logo

Blank records

Blank records contribute nothing to a data table because they contain no information. Finding them can be tricky, however, since apparently blank lines can be filled with invisible spaces and tabs:

$ cat file
This line is followed by a genuinely blank line.
 
This line is followed by one with 10 spaces.
 
This line is followed by one with 2 tabs.
 

Here's what file looks like when opened in the Gedit text editor with its 'Draw Spaces' plugin and the lines numbered:

gedit

To find all blank lines, genuine and non-genuine, together with their line numbers in file:

$ awk 'NF==0 {print NR}' file
2
4
6
 
$ grep -n "^[[:blank:]]*$" file
2:
4:
6:

You can strip away everything except the line numbers from the grep -n result with a cut command, by specifying a colon as field delimiter for cut and selecting just the first field. The list of blank lines from AWK or grep should be recorded before any blank lines are removed from the table:

$ awk 'NF==0 {print NR}' file > list_of_blank_lines
 
$ grep -n "^[[:blank:]]*$" file | cut -d ':' -f1 > list_of_blank_lines

To delete all blank lines in the file table, reverse the blank-finding commands. In other words, look for lines with content other than tabs and spaces:

$ awk 'NF!=0' table > table_no_blanks
 
$ awk 'NF' table > table_no_blanks
 
$ grep -v "^[[:blank:]]*$" table > table_no_blanks



Blank data items

From a data cleaner's point of view, there's nothing wrong with a blank data item. It's just an empty string in a field in a particular record, and unless you know that some piece of information should definitely be in that field, there's nothing to be done about it.

However, data tables sometimes have 'blank' data items which are actually whitespaces. Whitespaces in 'blank' data items will cause auditing problems, because those data items are not, in fact, blank:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ tally file 2
1
1
1 fff

To count records containing data items with whitespace only:

$ grep -cP "\t[ ]+\t"
 
$ awk -F"\t" '/\t[ ]+\t/ {count++} END {print count}'

To locate data items with whitespace only:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /^[ ]+$/) {print "line "NR": field "i}}' file
line 2: field 2

To get rid of whitespaces in 'blank' data items, you can use sed to edit particular items, or do a global edit by first squeezing all multiple whitespaces in the table to one space using tr, then delete the single whitespaces in the 'blank' data items with sed:

$ cat file
Col1    Col2    Col3
a aa    fff    bbb    #Col2 here has 2 whitespaces
cc c    fff    ddd    #Col2 here is genuinely blank
eee    fff    gg g
 
$ tr -s ' ' < file | sed 's/\t \t/\t\t/g' > fileA
 
$ grep -cP "\t[ ]+\t" fileA
0



Blank fields

Like blank records, blank fields contribute nothing to a data table, apart from the field name. The following BASH script (I call it empties) takes the name of the data table as its one argument and lists any blank fields in the table along with their field numbers. The list is printed to screen and also saved in a text file. The script also reports the number of blank fields in the table and offers to build a new table without the blank fields. The script is shown working on a 5-field table called demo in which fields 2 and 4 are blank. Note that empties ignores whitespaces.

#!/bin/bash
 
totf=$(($(head -n 1 "$1" | grep -o $'\t' | wc -l)+1))
for ((i=1;i<="$totf";i++)); do \
cut -f "$i" "$1" | awk -v fnum="$i" 'NR==1 {label=$0} \
NR>1 && gsub(/[[:graph:]]/,"")>0 f=1; exit} \
END {if (!f) print fnum ":" label}';
done | tee "$1"_emptyfields
echo ""
echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
echo
read -p "Build table with non-empty fields only? (y/n)" foo
if [[ "$foo" == "y" ]]; then
cut --complement -f$(cut -d':' -f1 "$1"_emptyfields | paste -d',' -s) "$1" > "$1"_nonemptyfields
else
exit
fi
 
exit


empties