logo

Duplicate records

Exact, character-for-character duplicate records are most easily found with sort and uniq -D:

$ cat file
aaa    bbb    ccc
ddd    eee    fff
ggg    hhh    iii
jjj    kkk    lll
ddd    eee    fff

 
$ sort file | uniq -D
ddd    eee    fff
ddd    eee    fff

However, it's unlikely that you'll find exact, character-for-character duplicate records in a data table, especially if the table derives from a database which has added a unique ID string to each record. For this reason a search for duplicates needs to be based on expectable uses for the record. In general, you might say that there's duplication in a data table when all the information you currently need from a particular record is also found in another record or records. Duplication of this kind often happens when you combine 2 different versions of a single table — quite a common occurrence when databases are merged.


In the following example, we're looking for records with the same data items in fields 3 and 5. There's 1 pair and 1 triplet of such records, marked in colours:

$ cat table
IDField1    Field2    Field3    Field4    Field5
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10002    ff-78    71-49a9-b    701-9e    6a8470122b
10003    be-17    c8-4d7e-9    7ee-ec    55168faf14
10004    b6-47    41-4e3d-b    681-4f    cbe71417f2
10005    9f-af    f0-4df5-8    ea3-b5    a092755912
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10007    c8-8f    21-4239-b    fea-2b    82d21a1cbd
10008    b4-07    9e-47e8-b    e1e-0e    a9999a8cd7
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10010    af-c2    d1-4dfd-a    589-87    edfbe7d380
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10012    42-ab    7b-4947-9    6fd-10    36c98a1235
10013    ae-aa    d7-4204-a    351-c1    851871300b
10014    d3-1c    64-4a57-b    aa0-68    804191270a
10015    7b-ea    6a-426e-b    ae2-45    ed2c358c6a
10016    be-29    b2-4aae-9    e61-d4    d019adef79
10017    cd-b5    71-49a9-b    c6d-eb    97b49401b9
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2
10019    9c-f8    bf-4da7-a    400-25    19042a98df

One way to do find these duplicates is to cut out fields 1, 3 and 5 with cut, sort the cut-out fields on the second field (the former field 3) with sort, and look for duplicates with uniq -D, ignoring the first field (the former field 1) with the f option:

$ cut -f1,3,5 table | sort -t $'\t' -k2 | uniq -f1 -D
10006    71-49a9-b    1fca079c31
10011    71-49a9-b    1fca079c31
10001    d4-4318-8    cbe71417f2
10009    d4-4318-8    cbe71417f2
10018    d4-4318-8    cbe71417f2

Note that uniq -f only works properly if all the fields to be ignored (in this case just field 1) are simple strings without spaces. You can add a suitable first ID field to a table with nl, as described here.

Save the duplicate record fragments as a temp file:

$ cut -f1,3,5 table | sort -t $'\t' -k2 | uniq -f1 -D > temp

Next, use an AWK array containing the ID-field value from the temp file to find the corresponding full records in the main table. These are then sorted to group the duplicates and to put them in IDField1 order:

$ awk -F"\t" 'FNR==NR {a[$1];next} $1 in a' temp table \
| sort -t $'\t' -k3,3 -k1
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

You can get the same result more easily by using an AWK array:

$ awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' table \
| sort -t $'\t' -k3,3 -k1 | uniq
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

The array 'a' is indexed with the 2 fields of interest and set equal to the whole line. If a line's fields 3 and 5 are already in 'a', the corresponding stored line is set equal to a variable 'b', and 'b' is printed followed by the line currently being checked. The next time there's a successful check of the array, 'b' is set equal to the last line that had a successful check. In other words, that last line is repeated:

$ awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' table
10001    34-ba    d4-4318-8    df9-ab    cbe71417f2
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10006    a8-36    71-49a9-b    a15-c2    1fca079c31
10011    db-3f    71-49a9-b    24c-5a    1fca079c31
10009    39-cf    d4-4318-8    75c-e7    cbe71417f2
10018    89-bb    d4-4318-8    f9d-60    cbe71417f2

In the full command (above) repeated lines are removed with uniq, after sorting the lines to group them by field 3 and listing them in IDField1 order.

The full command is a straightforward way to approach the problem, but it requires a fair bit of memory and isn't really suited to very large data tables. The cut method above is less straightforward, but it's probably a little clearer in its logic and less demanding of memory, and also generates a possibly useful temp file containing the duplicated fields and their record IDs.



In a real-world case, the data table I was checking had 31 fields, among them separate fields for each of the components of a scientific name. For example, in addition to a field with the full scientific name Acantholophus dixoni Ferguson, 1915, there were also fields for genus (= Acantholophus), species (= dixoni) and authorship (= Ferguson, 1915).

A check of the full-scientific-name field only found 18 duplicated records. For a broader search I looked for duplicates in the genus and species fields and found more than 21000! When I added back the authorship field I could see why. Separate records were being kept even if they had only minor spelling errors or formatting differences in the authorship field. Some simple data cleaning would remove more than 10000 duplicates.

RecordID     Genus     Species     Authorship
7837290     Acantholophus     dixoni     Fergusson, 1915
7854925     Acantholophus     dixoni     Ferguson, 1915
 
7794680     Acalles     anchonoides     Champ. , 1910
7854125     Acalles     anchonoides     Champion , 1910