logo

Truncated fields

(See also the companion post on the BASHing data blog.)

Truncations in the strict sense have data items with their right-hand ends trimmed off by software. They can be hard to detect, and might also be data entry errors. For example, in the data item Melbourne, Vict. 201 did the data enterer simply leave off the last digit of a year, or did software (at some stage) truncate the string Melbourne, Vict. 2016 (for example) at 20 characters?

I use three methods to look for possible truncations. In the first, a field to be checked is passed to an AWK command that tallies up data items by field width, and the result is sorted to print the tallies in reverse order of width. For example, to check field 33 in the tab-separated file midges:

awk -F"\t" 'NR>1 {a[length($33)]++} \
END {for (i in a) print i FS a[i]}' midges | sort -nr

distro1

The longest entries have exactly 50 characters, which is suspicious, and there's a "bulge" of data items at that width, which is even more suspicious. Inspection of those 50-character-wide items reveals truncations:

distro2

Other tables I've checked this way had bulges at 100, 200 and 255 characters. In each case the bulges contained apparent truncations.

A second way to check for truncated data items is to use the unmatched script (or function). A good starting point is a tally of all the punctuation in the data table. Here I'm checking the file mag2. Note that the numbers of opening and closing round brackets aren't equal:

grep -o "[[:punct:]]" file | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/'

punct

ummatched checks the round brackets and finds some likely truncations:

unmatched

A third way to detect truncations is to look for data items with a trailing space or a non-terminal punctuation mark, like a comma or hyphen. This can be done on a single field with cut piped to grep, or in one step with AWK. :

cut -f[field to be checked] file | grep -n "[ ,;:-]$"
 
awk -F"\t" '$[field to be checked] ~ /[ ,;:-]$/ {print NR": "$47}' file

The all-fields version of the AWK command is:

awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /[ ,;:-]$/) \
print "line "NR", field "i":\n"$i}' file

Repairing truncations means finding out what the non-truncated string was. As with domain schizophrenia, that may involve politely communicating with the data manager or compiler.