banner

For a full list of BASHing data blog posts, see the index page.     RSS


Putting information into a table from the table's filename

An example of this data-processing task would be grabbing the date part of a date-stamped filename and adding it to the table records (assuming they don't have a date), so that the files can be combined for a time-series study.

Below I use simplified files to make this task clearer. Each tab-separated file has the unit prices of 3 investments on a particular day. I'd like to combine the files after adding to each record the date given in the filename.

equities_05-Dec-2018.tsv:

APIR codeEntry priceExit price
MAQ0477AU0.94170.9372
MAQ5378AU0.95030.9478
MAQ0464AU1.03501.0322

equities_06-Dec-2018.tsv:

APIR codeEntry priceExit price
MAQ0477AU0.92860.9241
MAQ5378AU0.94770.9452
MAQ0464AU1.03211.0294

equities_07-Dec-2018.tsv:

APIR codeEntry priceExit price
MAQ0477AU0.92860.9241
MAQ5378AU0.93760.9351
MAQ0464AU1.02101.0183

To get the date part of the filename I can use AWK's "split" function:

split1

The "split" function is normally used with 3 arguments. The first is the string to be split, the second is the name of the array to be loaded with the split-out parts, and the third is the character used as a part-separator. However, that third argument doesn't have to be a literal character, like "_". It can also be a regex, and in this case the regex means "either underscore or period", with the "." escaped so that the regex engine doesn't interpret it as "any character".

And here's the first file by itself with a header line and a date added to each record:

awk 'BEGIN {split(ARGV[1],a,/[_\.]/)} NR==1 {print $0 "\t" "Date"} NR>1 {print $0 "\t" a[2]}' equities_05-Dec-2018.tsv

split2

In order to grab the date part from the filename in the BEGIN statement, I use "ARGV[1]", which is the first argument offered to the AWK command, and in this case is the filename. To the first line of the file ("NR==1") AWK adds the tab-separated field name "Date". To all the remaining lines ("NR>1") AWK adds the tab-separated date part from the filename, stored in the array "a" as "a[2]".

Next I'll combine the 3 files into one table. First I print a header line with echo, then use a "for" loop to run the AWK command on just the non-header lines in each file:

echo -e "APIR Code\tEntry price\tExit price\tDate"; for i in *.tsv; do awk 'BEGIN {split(ARGV[1],a,/[_\.]/)} NR>1 {print $0 "\t" a[2]}' "$i"; done

split3

Not bad, but for ease of sorting the output table, it would be better if the dates were in ISO 8601 format, namely YYYY-MM-DD. Possibly the least fussy way to do this is to pipe the output table to a second AWK command, and use the date command "internally" to do the conversion:

echo -e "APIR Code\tEntry price\tExit price\tDate"; for i in *.tsv; do awk 'BEGIN {split(ARGV[1],a,/[_\.]/)} NR>1 {print $0 "\t" a[2]}' "$i" | awk 'BEGIN {FS=OFS="\t"} {"date -d "$4" +%Y-%m-%d" | getline iso; print $1,$2,$3,iso}'; done

split4

I rarely use the "command | getline" option in AWK, but in this case it's simpler than deconstructing the date and rebuilding it as YYYY-MM-DD. What happens here is that AWK allows the shell to run the command "date -d "$4" +%Y-%m-%d" (note the quotes in the command) and captures the output in the variable "iso". The date command recognises "05-Dec-2018" as a date and knows how to convert it.

If wanted, that output table with ISO 8601 dates can be easily sorted by individual investments in chronological order:

echo -e "APIR Code\tEntry price\tExit price\tDate"; for i in *.tsv; do awk 'BEGIN {split(ARGV[1],a,/[_\.]/)} NR>1 {print $0 "\t" a[2]}' "$i" | awk 'BEGIN {FS=OFS="\t"} {"date -d "$4" +%Y-%m-%d" | getline iso; print $1,$2,$3,iso}'; done | sort -t $'\t' -k1,1 -k4,4

split5

Last update: 2018-12-13