logo

Disagreement between fields

By 'disagreement' I mean that information in a field contradicts information in another field or fields. Three kinds of discord I've come across could be called 'subset', 'failed process' and 'which is right?' disagreements.


Subset

In a 'subset' disagreement, the information in one field is supposed to be just a subset of the information in another field. Sometimes the larger field has been parsed to extract the subset, and the parsing has failed — resulting in either an incorrect string in the subset field or a blank. Alternatively, the subset and larger field were independently filled in, and there was a data entry or other error.

Detecting a discord of this kind is fairly straightforward once you determine which fields should contain the same information. In the example below, the 'Year' item should match the year in brackets in 'Citation'. AWK is shown here finding disagreement, then agreement.

$ cat list
Year   Citation
1922   Brown, F. (1922) Poems. Melody, London; 64 pp.
1948   Smith, A. (1948) Chiaroscuro. Arthouse, New York; 96 pp.
1981   Jones, H. (1980) Sausages. Home Arts, Sydney;128 pp.
2001   Messenger, J. (2001) Speeches. Politica, Glasgow; 92 pp.
 
$ tail -n +2 list | awk -F"\t" '{a=$1} $2 !~ a'
1981   Jones, H. (1980) Sausages. Home Arts, Sydney;128 pp.
 
$ tail -n +2 list | awk -F"\t" '{a=$1} $2 ~ a'
1922   Brown, F. (1922) Poems. Melody, London; 64 pp.
1948   Smith, A. (1948) Chiaroscuro. Arthouse, New York; 96 pp.
2001   Messenger, J. (2001) Speeches. Politica, Glasgow; 92 pp.

This recipe obviously won't work if either the subset field or the larger field is blank. Records with blank fields should be checked individually. (See the invalid values page.)


Failed process

Information in one field is supposed to be derived from information in another, but the deriving has failed. Disagreements of this kind are particularly abundant in a certain Australian biodiversity database, which has a large number of fields with either 'true' or 'false' as the data item. Each of the true/false items is said to be the result of a data quality check on information in fields containing data supplied to the database. However, the true/false results are unreliable and there are many false positives and false negatives. These can be found by extracting a data field and its corresponding data quality check field with cut, passing the pair to sort | uniq and inspecting the results.

The same database has other failed-process issues. One field (call it field 1) contains a scientific name from an external data provider, while another field (call it field 2) contains a matched scientific name, derived programmatically by reference to an official list of such names. If the program can't find an exact match, it puts into field 2 the next higher grouping in the biological hierarchy. For example, if the provided name is genus Amblium, species potamum and the program can't find potamum in the official list of Amblium species, it fills field 2 with Amblium.

This command (a little simplified from the one I actually used)

$ awk -F"\t" '$1 != $2 {print $1"\t"$2}' table | sort | uniq -c

gave a lot of results! Inspection showed that many of them arose from failed processing of scientific names that were supplied in 'non-standard' forms. Field 1 items like these

Arthropoda Hexapoda Insecta Coleoptera Adephaga Carabidae Carabinae
Arthropoda Hexapoda Insecta Coleoptera Adephaga Carabidae Lebiinae Lebiini
Arthropoda Hexapoda Insecta Coleoptera Adephaga Carabidae
Arthropoda Hexapoda Insecta Coleoptera Polyphaga Aderidae

wound up as 'Arthropoda' in field 2, rather than as the perfectly valid scientific name at the other end of the string.


Which is right?

Here 2 or more fields disagree with each other, and there's no additional information in the data table which can help you decide how and why the fields disagree. 'Which is right?' cases are very hard to detect by eye or with commands, and are impossible to resolve without further investigation (searching online or an authority file, or asking the data manager or compiler).

One kind of 'Which is right?' discord I find in biodiversity databases is disagreement between a 'Locality' field and the corresponding 'Latitude' and 'Longitude' fields. For example, the record has 4 km E of Mathinna, Tasmania at latitude -41.4720 and longitude 147.8424. In fact, that latitude/longitude is for a point about 4 km west of Mathinna, not east.

A strategy I've used to detect locality-lat/lon discord is to (1) parse out placenames in the 'Locality' field (with regex followed by manual checking), (2) add 'reference' latitude and longitude fields (with AWK) for those placenames from an authority file (in my case, a digital gazetteer of Australia or one of its States), (3) calculate the differences (again with AWK) between 'reference' lat/lons and the given ones and (4) sort these differences by decreasing size with sort -nr. Big differences suggest there's a problem.

That strategy wouldn't work in the Mathinna example, however! My usual practice now is to check lat/lon data against locality record-by-record, using Google Maps or a suitable online GIS. See also the geochecking page.