logo

Some handy functions

broken
fields
hunt
maxchk
minchk
recbyfld
tally

If you use a command chain often, you can save a lot of work by writing it into a function with a short, appropriate name. Here are some functions I've often used in data auditing, shown in C-style and as one-liners (note closing semi-colon). I keep these functions in my .bashrc file, from which they can be printed in the shell with the type command:

$ type fields
fields is a function
fields ()
{
head -n 1 "$1" | tr "\t" "\n" | nl | pr -t -2
} 



broken

broken ()
{
awk -F"\t" '{print NF}' "$1" | sort -n | uniq -c
}
 
broken() { awk -F"\t" '{print NF}' "$1" | sort -n | uniq -c; }

Use broken to tally the number of fields in each record in a tab-separated data table. It's the easiest way to determine if there are broken records in the table. Here I'm checking the 10264 records (including header) in the 15-field events table:

broken 

fields

fields ()
{
head -n 1 "$1" | tr '\t' '\n' | nl | pr -t -2
}
 
fields() { head -n 1 "$1" | tr '\t' '\n' | nl | pr -t -2; }

fields names and numbers the fields in a tab-separated data table, making it easy to use those numbers with AWK and cut. It assumes that the first line in the table is a header with field names, and takes as its 1 argument the filename of the data table.

Here's a screenshot of fields at work on a 15-field data table of mine called events:

fields 1

Suppose I want to cut out the 3 fields 'Locality_in_words', 'Lat_DD_WGS84' and 'Long_DD_WGS84' from my events table and put them in a new file, places. I can never remember what the field numbers in events are, so I run fields first, as above, to remind me.

$ cut -f2,4,5 events > places 



hunt

hunt ()
{
awk -F"\t" -v tgt="$2" '{for (i=1;i<=NF;i++) if ($i ~ tgt) print NR" f"i": "$i}' "$1"
}
 
hunt() { awk -F"\t" -v tgt="$2" '{for (i=1;i<=NF;i++) if ($i ~ tgt) print NR" f"i": "$i}' "$1"; }

hunt is a grep-like use of AWK. The first argument is a filename, the second a target string. hunt loops through each field on each line and returns not only the line number for a match, but also its field address, with separate returns for each field occurrence on a line.

This function is useful for finding all occurrences of a string before doing a global replacement. In the specimens table below, hunt has found the string 4165 in fields 3 and 6 on the 7 indicated lines:

hunt 

maxchk

maxchk ()
{
tail -n +2 "$1" | cut -f"$2" | awk -F"\t" '{print length($1)"\t"$1}' | sort | uniq | sort -nr | head -n "$3"
}
 
maxchk() { tail -n +2 "$1" | cut -f"$2" | awk -F"\t" '{print length($1)"\t"$1}' | sort | uniq | sort -nr | head -n "$3"; }

maxchk counts characters in unique data items in a field and returns the longest data items in reverse order, beginning with number of characters. It takes 3 arguments: filename, field and cut-off (how many items do you want to see?). This example uses gazOz, an Australian gazetteer, and looks at the 3 longest place names:

maxchk 

minchk

minchk ()
{
tail -n +2 "$1" | cut -f"$2" | awk -F"\t" '{print length($1)"\t"$1}' | sort | uniq | sort -n | head -n "$3"
}
 
minchk() { tail -n +2 "$1" | cut -f"$2" | awk -F"\t" '{print length($1)"\t"$1}' | sort | uniq | sort -n | head -n "$3"; }

minchk is a 'shortest data item' version of maxchk:

minchk 

recbyfld

recbyfld ()
{
paste <(head -n1 "$1" | tr '\t' '\n') <(sed -n "$2p" "$1"| tr '\t' '\n') | nl
}
 
recbyfld() { paste <(head -n1 "$1" | tr '\t' '\n') <(sed -n "$2p" "$1"| tr '\t' '\n') | nl; }

recbyfld is especially handy for tables with a lot of fields. The first argument is the table name, the second the line number of a record. The function prints out the contents of each field for that record, vertically, and also numbers the fields:

recbyfld 

tally

tally ()
{
tail -n +2 "$1" | cut -f"$2" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/'
}

tally takes 2 arguments: filename, then field number. It lists the sorted, unique data items in the field and gives the number of times that item occurs, ignoring the header entry. It modifies the usual output of the uniq command to generate left-justified numbers separated from the items counted by a tab character, making the result ready to put into a data table. Here's tally at work on the 'State' field in my events table:

tally

Because field number is the last argument, it's easy to tally sequentially through the fields of a data table by repeating the command (up arrow key) and changing just the last character or characters.