logo

Broken records - 2

Back to too many fields...


Too few fields

A common cause of too few fields is the splitting of a record onto 2 or more lines, with the split happening either between or within fields. In file1 the split is between fields:

$ cat file1
aaa   bbb   ccc   ddd
eee
fff   ggg   hhh
iii   jjj   kkk   lll
mmm   nnn   ooo   ppp
qqq   rrr   sss   ttt
 
$ broken file1
1 1
1 3
4 4

and in file2 the split is within a field:

$ cat file2
aaa   bbb   ccc   ddd
eee   ff
f   ggg   hhh
iii   jjj   kkk   lll
mmm   nnn   ooo   ppp
qqq   rrr   sss   ttt
 
$ broken file2
1 2
1 3
4 4

If the split is across 2 successive lines, a search for short fields with 'either/or' syntax will find the adjoining pair of lines:

$ awk -F"\t" 'NF==1 || NF==3 {print NR": "$0}' file1
2: eee
3: fff   ggg   hhh
 
$ awk -F"\t" 'NF==2 || NF==3 {print NR": "$0}' file2
2: eee   ff
3: f   ggg   hhh

Joining 2 successive lines is most easily done with sed:

$ sed '2N;s/\n/\t/' file1
aaa   bbb   ccc   ddd
eee   fff   ggg   hhh
iii   jjj   kkk   lll
mmm   nnn   ooo   ppp
qqq   rrr   sss   ttt
 
$ sed '2N;s/\n//' file2
aaa   bbb   ccc   ddd
eee   fff   ggg   hhh
iii   jjj   kkk   lll
mmm   nnn   ooo   ppp
qqq   rrr   sss   ttt

and joins with sed can be 'ganged':

$ sed '311N;4065N;4067N;8339N;s/\n/\t/' table

If the pieces of the split record have been shuffled after the split, some fancy AWK work may be required to rejoin them. In the command used here, AWK makes 2 passes through the file. In the first pass it stores the trailing piece 'fff' on line 5 in a variable, and in the next pass it appends a tab and the variable to line 2, and doesn't print line 5:

$ cat file3
aaa   bbb   ccc
ddd   eee
ggg   hhh   iii
jjj   kkk   lll
fff
mmm   nnn   ooo
 
$ awk 'FNR==NR {if (NR==5) var=$0; next} FNR==2 {print $0"\t"var; next} FNR==5 {next} 1' file3 file3
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
mmm   nnn   ooo



Back to too many fields...