Duplicate records

It's unlikely that you'll find exact, character-for-character duplicate records in a data table, especially if the table derives from a database which has added a unique ID string to each record. For this reason a search for duplicates needs to be based on expectable uses for the record. In general, you might say that there's duplication in a data table when another record has all the information you currently need from a particular record. Duplication often happens when you combine 2 different versions of a single table — quite a common occurrence when databases are merged.

In the following example, we're looking for records with the same data items in fields 3 and 5. There's 1 pair and 1 triplet of such records, marked in colours:

$ cat table
IDField1    Field2    Field3    Field4    Field5
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10002    ff-78    71-49a9-b    701-9e    6a8470122b
10003    be-17    c8-4d7e-9    7ee-ec    55168faf14
10004    b6-47    41-4e3d-b    681-4f    cbe71417f2
10005    9f-af    f0-4df5-8    ea3-b5    a092755912
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10007    c8-8f    21-4239-b    fea-2b    82d21a1cbd
10008    b4-07    9e-47e8-b    e1e-0e    a9999a8cd7
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10010    af-c2    d1-4dfd-a    589-87    edfbe7d380
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10012    42-ab    7b-4947-9    6fd-10    36c98a1235
10013    ae-aa    d7-4204-a    351-c1    851871300b
10014    d3-1c    64-4a57-b    aa0-68    804191270a
10015    7b-ea    6a-426e-b    ae2-45    ed2c358c6a
10016    be-29    b2-4aae-9    e61-d4    d019adef79
10017    cd-b5    71-49a9-b    c6d-eb    97b49401b9
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2
10019    9c-f8    bf-4da7-a    400-25    19042a98df

The most straightforward way I know to pull the field3/field5 duplicates out of this table is to use an AWK array:

$ awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' table | sort -t $'\t' -k3,3 -k1 | uniq
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

The array 'a' is indexed with the 2 fields of interest and set equal to the whole line. If a line's fields 3 and 5 are already in 'a', the corresponding stored line is set equal to a variable 'b', and 'b' is printed followed by the line currently being checked. The next time there's a successful check of the array, 'b' is set equal to the last line that had a successful check. In other words, that last line is repeated:

$ awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' table
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

The full command above removes repeated lines with uniq, after sorting the lines to group them by field 3 and listing them in IDField1 order.

Although the full command is a straightforward way to approach the problem, it requires a fair bit of memory and isn't really suited to very large data tables. A less straightforward approach also uses an AWK array, but is a little clearer in its logic and less demanding of memory, and also generates a possibly useful file containing the duplicated fields and their record IDs.

In this alternative recipe, fields 1, 3 and 5 are cut out with cut, the cut-out fields are sorted on the second field (the former field 3) with sort, and duplicates are found with uniq and its D option, ignoring the first field (the former field 1) with the f option:

$ cut -f1,3,5 table | sort -t $'\t' -k2 | uniq -f1 -D
10006    71-49a9-b    1fca079c31
10011    71-49a9-b    1fca079c31
10001    d4-4318-8    cbe71417f2
10009    d4-4318-8    cbe71417f2
10018    d4-4318-8    cbe71417f2

Note that uniq -f only works properly if all the fields to be ignored (in this case just field 1) are simple strings without spaces. You can add a suitable first ID field to the table with nl, as described here.

The first step in the recipe is to save the duplicate record fragments as a temp file:

$ cut -f1,3,5 table | sort -t $'\t' -k2 | uniq -f1 -D > temp

Next, we use an array containing the ID-field value from the temp file to find the corresponding full records in the main table. These are then sorted to group the duplicates and to put them in IDField1 order:

$ awk -F"\t" 'FNR==NR {a[$1];next} $1 in a' temp table | sort -t $'\t' -k3,3 -k1
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

In a real-world case, the data table I was checking had 31 fields, among them separate fields for each of the components of a scientific name. For example, in addition to a field with the full scientific name Acantholophus dixoni Ferguson, 1915, there were also fields for genus (= Acantholophus), species (= dixoni) and authorship (= Ferguson, 1915).

A check of the full-scientific-name field only found 18 duplicated records. For a broader search I looked for duplicates in the genus and species fields and found more than 21000! When I added back the authorship field I could see why. Separate records were being kept even if they had only minor spelling errors or formatting differences in the authorship field. Some simple data cleaning would remove more than 10000 duplicates.

RecordID     Genus     Species     Authorship
7837290     Acantholophus     dixoni     Fergusson, 1915
7854925     Acantholophus     dixoni     Ferguson, 1915
7794680     Acalles     anchonoides     Champ. , 1910
7854125     Acalles     anchonoides     Champion , 1910

Note that in this particular case the record ID was an uninterrupted string of digits, so I could use uniq -f. In the simplified commands below, field 1 is record ID, field 2 is genus, field 3 is species and field 4 is authorship. After finding the genus+species duplicates (temp), I built another table (authorship) containing just record ID and authorship. Authorship was then added to the temp table using an AWK array:

$ cut -f1,2,3 table | sort -t $'\t' -k2 | uniq -f1 -D > temp
$ cut -f1,4 table > authorship
$ awk -F"\t" 'FNR==NR {a[$1]=$2;next} {print $0"\t"a[$1]}' authorship temp > duplicates