logo

Truncated fields

Truncations are hard to detect because a cut-off data item might be a data entry error. In the data item Melbourne, Vict. 201 did the data enterer simply leave off the last digit of a year, or did software (at some stage) truncate the string Melbourne, Vict. 2016 (for example) at 20 characters?

One way to look for truncation is to run maxchk on a field and see if any of the longer data items look cut off. Even better evidence would be a series of possible truncations at a particular number of characters. For example, in one biological database I found a field with a maximum data item length of 100 characters. There were many truncated data items at the 100-character mark, and another run of truncations at 50 characters. I'm guessing that some of the data had passed through a database in which the field in question had a 50-character limit.

$ maxchk taxa.txt 22 10000
# Examples only:
100 sensu Withering [Bot. Arr. Brit. Pl. 4: 389 (1801)]; fide Checklist of Basidiomycota of Great Britai
100 sensu Reid [TBMS 84: 719 (1985)]; fide Checklist of Basidiomycota of Great Britain and Ireland , 200
100 sensu Rea (1922), Bres [(Icon.Mycol. 12 (1929)]; fide Checklist of Basidiomycota of Great Britain an
100 Rodrigues, Camacho, Sales Nunes, Sousa Recoder, Teixeira Jr., Valdujo, Ghellere, Mott & Nogueira, 20
100 A. Cali, P.M. Takvorian, S. Lewin, M. Rendel, C.S. Sian, M. Wittner, H.B. Tanowitz, E. Keohane & L.
-------
50 Apothéloz-Perret-Gentil, Holzmann & Pawlowski, 201
50 (Ax & Armonies, 1990) Martens & Curini-Galletti, 1
50 Amon, Wiklund, Dahlgren, Copley, Smith, Jamieson &
50 Baluk & Radwanski in Baluk, Radwanski & Grimm, 197
50 (Carballo, Hepburn, Nava, Cruz-Barraza & Bautista-

Another way to detect truncations is to look for data items with a trailing space or a non-terminal punctuation mark, like a comma or hyphen. Here the search is in field 2 of file:

$ cat file
aaa   mmm      # One whitespace after 'mmm'
bbb   nnn
ccc   ooo,
ddd   ppp
 
$ cut -f2 file | grep -n "[ ,;:-]$"
1:mmm
3:ooo,

Repairing truncations means finding out what the non-truncated string was. As with domain schizophrenia, that may involve politely communicating with the data manager or compiler.