logo

Blank records

Blank records contribute nothing to a data table because they contain no information. Finding them can be tricky, however, since apparently blank lines can be filled with invisible spaces and tabs:

$ cat file
This line is followed by a genuinely blank line.
 
This line is followed by one with 10 spaces.
 
This line is followed by one with 2 tabs.
 

Here's what file looks like when opened in the Gedit text editor with its 'Draw Spaces' plugin and the lines numbered:

gedit

To find all blank lines, genuine and non-genuine, together with their line numbers in file:

$ awk 'NF==0 {print NR}' file
2
4
6
 
$ grep -n "^[[:blank:]]*$" file | sed 's/://'
2
4
6

The list of blank lines from AWK or grep should be recorded before any blank lines are removed from the table:

$ awk 'NF==0 {print NR}' file > list_of_blank_lines
 
$ grep -n "^[[:blank:]]*$" file | sed 's/://' > list_of_blank_lines

To delete all blank lines in the file table, reverse the blank-finding commands. In other words, look for lines with content other than tabs and spaces:

$ awk 'NF!=0' table > table_no_blanks
 
$ awk 'NF' table > table_no_blanks
 
$ grep -v "^[[:blank:]]*$" table > table_no_blanks



Blank data items

From a data cleaner's point of view, there's nothing wrong with a blank data item. It's just an empty string in a field in a particular record, and unless you know that some piece of information should definitely be in that field, there's nothing to be done about it.

However, data tables sometimes have 'blank' data items which are actually whitespaces. Whitespaces in 'blank' data items may cause auditing problems, because those data items are not, in fact, blank:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ tally file 2
1
1
1 fff

To count records containing data items with whitespace only:

$ grep -cP "\t[ ]+\t"
 
$ awk -F"\t" '/\t[ ]+\t/ {count++} END {print count}'

To locate data items with whitespace only:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) {print "line "NR": field "i}}' file
line 2: field 2

To get rid of whitespaces in 'blank' data items, you can use sed to edit particular items, or do a global edit by first squeezing all multiple whitespaces in the table to one space using tr, then delete the single whitespaces in the 'blank' data items with sed:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ tr -s ' ' < file | sed 's/\t \t/\t\t/g' > fileA
 
$ grep -cP "\t[ ]+\t" fileA
0



Blank fields

Like blank records, blank fields (blank apart from the field name in the header line) contribute nothing to a data table, apart from the field name. The following BASH script (I call it empties) takes the filename of the data table as its one argument and lists any blank fields in the table (if there are any, otherwise the script exits) along with their field numbers in the text file [filename]_emptyfields. The script reports the number of blank fields in the table and offers to build a new table without the blank fields, with a filename that you choose. The script is shown working on a 5-field table called fileC in which fields 2 and 4 are blank. Note that empties does not take into account data items made up entirely of whitespace. Whitespace-only entries in a field are best dealt with in field-by-field checking with tally.

#!/bin/bash
 
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' "$1" > /tmp/flds
head -n1 "$1" | tr '\t' '\n' | nl -w1 > /tmp/allflds
 
if [ ! -s /tmp/flds ]; then
   echo
   echo "No empty fields in $1 table" && rm /tmp/flds /tmp/allflds && exit
else
   awk -F"\t" 'FNR==NR {b[$1]=$2; next} $1 in b {print $1":"b[$1]}' /tmp/allflds /tmp/flds > "$1"_emptyfields
   echo
   echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
   echo
   read -p "Build table with non-empty fields only? (y/n) " build
   if [[ "$build" == "y" ]]; then
      read -p "What name for the new table? " name
      cut --complement -f$(paste -d',' -s /tmp/flds) "$1" > "$name"
   else
      rm /tmp/flds /tmp/allflds && exit
   fi
fi
rm /tmp/flds /tmp/allflds
exit


empties