logo

Incorrect or inconsistent formatting

One way to find formatting problems in a field is to look over the results of a tally. In the following list (from a real-world case), note the different versions of the name "Arthur Mills Lea" (and some domain schizophrenia):

$ tally ala4 28
# Extracts only
   3 A Lea (1917)
  42 A M. Lea
  40 A. M. Lea
1027 Arthur M. Lea
   1 Lea, A M
  37 Lea, A. M.
  32 Lea, A.M.
  52 Lea,A.M.
 120 Lea, Arthur M. - South Australian Museum
   4 [Lea coll.]
  92 Lea, Mr Arthur Mills
   1 R. M. Lea    # From context, this is an error for A. M. Lea

The 'clustering' function of OpenRefine software can be used (with practice) for finding variants like these and replacing all of them with a single selected format, but I prefer the simplicity of working on the command line. To complement tally I use regex. In this case

$ awk -F"\t" '$28 ~ /Lea/ {print $28}' ala4 | sort | uniq

would be a way to check for variants that I missed when looking over the tally results.

Regex can also be used to check for incorrect formatting. The command

$ grep -o "[[:punct:]]" table | sort | uniq -c | pr -t -3

will extract, sort and count all punctuation characters, then display the result in 3 columns:

punctuation

In this real-world example, there are 2330 apostrophes (single quotes) in the file, but also 1184 acute accents, 36 left single quotes and 11272 right single quotes, all doing the same punctuation job! The non-apostrophe "apostrophes" can be found in the table by grepping and replaced with the standard apostrophe character. (The 123 blanks at the end of the output are non-breaking spaces.)

Also of interest are the paired round brackets: 9009 left ones and 9014 right ones. Some must be unmatched. To check whether there are matching pairs of punctuation in every individual field I use the following script, called unmatched. The "unmatched" command takes 3 arguments. The first is the filename and the second and third are the brackets pair, each enclosed in double quotes.

#!/bin/bash
 
awk -F"\t" -v start="$2" -v end="$3" \
'{for (i=1;i<=NF;i++) \
if (split($i,a,start) != split($i,b,end)) \
{print "line "NR", field "i": "$i}}' "$1"
 
exit

unmatched returns line number, field number and the contents of the field with the unmatched bracket. Below I run unmatched on the file demo:

unmatched

Note that unmatched did not find the field ")kkkk(k" with matched but backwards braces. Cases like these could be found with a regex:

awk -F"\t" \
'{for (i=1;i<=NF;i++) \
if ($i ~ /^[^()]*\)[^()]*\(/) \
{print "line "NR", field "i": "$i}}' "$1"
 
exit

backwards

Formatting checks can be done with regex by looking for punctuation errors, such as a space before a comma or no space after one. This is command-line proofreading and is slow, but it's a very effective way to detect inconsistent formatting.

$ cat file1
aaa, bbb    ccc, ddd    eee, fff
ggg , hhh    iii, jjj    kkk, mmm
nnn, ooo    ppp,qqq    rrr, sss
 
$ grep -En ' ,|,[^[:blank:]]' file1
2:ggg ,/ hhh    iii, jjj    kkk, mmm
3:nnn, ooo    ppp,qqq    rrr, sss

A useful check is to see if there are any non-blank fields which don't begin and end with an alphanumeric character:

$ cat file2
oranges   apples,            ?grapes
pears    bananas   cherries!   apricots    # 'pears ' ends in a whitespace

$ awk -F"\t" '{for (i=1;i<=NF;i++) if ($i != "" && ($i !~ /^[[:alnum:]]/ || $i !~ /[[:alnum:]]$/)) print NR" f"i": "$i}' file2
1 f2: apples,
1 f4: ?grapes
2 f1: pears
2 f3: cherries!

Finally, the easiest way to reduce a run of multiple whitespaces (or other single character) to a single whitespace is with the tr command and its '-s' ('squeeze') option:

$ cat file3
aaa      bbb
 
$ tr -s ' ' < file3 > file4
 
$ cat file4
aaa bbb