logo

Broken records - 1

In a table with N fields, every record should have exactly N fields. A record with more or less than N fields is broken and needs fixing.

Check for broken records using AWK to count fields, either with a typed-in command or with the broken function (to save typing):

$ awk -F"\t" '{print NF}' table | sort -n | uniq -c
 
$ broken() { awk -F"\t" '{print NF}' "$1" | sort -n | uniq -c; }

For a 10000-record file called table with 20 fields, the result should be

$ broken table
10000 20

If there are many records with other than the expected number of fields, those records can be exported for separate inspection. In this particular case:

$ awk -F"\t" 'NF!=20' table > inspect

The 'broken' check can also be done on an all-fields-double-quoted CSV by using its field separator:

$ awk -F'","' '{print NF}' table.csv | sort -n | uniq -c

Examples of broken records are shown below and on the next page.


Too many fields

In file1 there's an unneeded tab embedded in the 3rd field of record 2:

$ cat file1
aaa   bbb   ccc
ddd   eee   ff   f
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file1
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file1
2: ddd   eee   ff   f

To fix this:

$ sed '2s/ff\tf/fff/' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$2"\t"$3$4; next} 1' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$2,$3$4; next} 1' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll

In file2, the extra within-field tab is here replaced with a single whitespace:

$ cat file2
aaa   bbb   ccc
ddd   eee   ff   X
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file2
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file2
2: ddd   eee   ff   X
 
----------
 
$ sed '2s/ff\tX/ff X/' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$2"\t"$3" "$4; next} 1' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$2,$3" "$4; next} 1' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll

In file3, field 2 of record 2 has been separated off by 2 tabs instead of 1:

$ cat file3
aaa   bbb   ccc
ddd      eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file3
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file3
2: ddd      eee   fff
 
----------
 
$ sed '2s/d\t\te/d\te/' file3
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$3"\t"$4; next} 1' file3
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$3,$4; next} 1' file3
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll

Note that in this last example, sed is addressed to line 2 and the actual contents of fields 1 and 2. In a real-world case, running sed '2s/\t\t/\t/' or sed '2s/\t\t/\t/g' could delete genuinely blank fields.

On to too few fields...