logo

Duplicate records

Exact, character-for-character duplicate records are most easily found with sort and uniq -D:

$ cat file
aaa    bbb    ccc
ddd    eee    fff
ggg    hhh    iii
jjj    kkk    lll
ddd    eee    fff

 
$ sort file | uniq -D
ddd    eee    fff
ddd    eee    fff

However, it's unlikely that you'll find exact, character-for-character duplicate records in a data table, especially if the table derives from a database which has added a unique ID string to each record. For this reason a search for duplicates needs to be based on expectable uses for the record. In general, you might say that there's duplication in a data table when all the information you currently need from a particular record is also found in another record or records. Duplication of this kind often happens when you combine 2 different versions of a single table — quite a common occurrence when databases are merged.


In the following example, we're looking for records with the same data items in fields 3 and 5. There's 1 pair and 1 triplet of such records:

dupes1

Two AWK recipes will find the partial duplicates. One command reads the file twice and doesn't use much memory, and is suited to files of any size. The other command reads the file once; with big files it uses a lot of memory. Both commands require post-AWK sorting to make the results clearer.

Here's the two-pass command at work on the example:

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' file file \
| sort -t $'\t' -k3 -k1

dupes2

and the one-pass command:

awk -F"\t" 'b=a[$3,$5] {print b"\n"$0} {a[$3,$5]=$0}' file \
| sort -t $'\t' -k3 -k1 | uniq

dupes3

I recommend the two-pass commands for general use. Any number of fields can be checked in this way (not just two), but the choice of fields can be tricky, and if the checked fields are blank in some records, those records will show up in the output. To remove the records with blanks, add a "no-blanks" condition to the command. See this BASHing data blog post for an example.



In a real-world case, the data table I was checking had 31 fields, among them separate fields for each of the components of a scientific name. For example, in addition to a field with the full scientific name Acantholophus dixoni Ferguson, 1915, there were also fields for genus (= Acantholophus), species (= dixoni) and authorship (= Ferguson, 1915).

A check of the full-scientific-name field only found 18 duplicated records. For a broader search I looked for duplicates in the genus and species fields and found more than 21000! When I added back the authorship field I could see why. Separate records were being kept even if they had only minor spelling errors or formatting differences in the authorship field (see below). Some simple data cleaning would remove more than 10000 duplicates.

RecordID     Genus     Species     Authorship
7837290     Acantholophus     dixoni     Fergusson, 1915
7854925     Acantholophus     dixoni     Ferguson, 1915
 
7794680     Acalles     anchonoides     Champ. , 1910
7854125     Acalles     anchonoides     Champion , 1910