logo

Invalid values

Invalid values can be detected either by simply looking over the results of a tally, maxchk or minchk on a field, or by searching the field using an appropriate regex. In the example below, field 3 of table is 'Year'. The grep and AWK searches are looking for values that aren't '19nn', '20nn' or blank.

$ tally table 3
 11
  1 1798
 25 1995
 23 1996
 28 1997
 24 1998
 31 1999
  1 200
 29 2000
# etc
  1 417.30
 
$ cut -f3 table | grep -Env '^19[0-9]{2}$|^20[0-9]{2}$|^$'
62:200
149:417.30
156:1798
 
$ awk -F"\t" '$3 !~ /^19[0-9]{2}$/ && $3 !~ /^20[0-9]{2}$/ && $3 != "" {print NR": "$3}' table
62: 200
149: 417.30
156: 1798

The first returns on tally are 11 blanks. Are those years really missing from the data set, or are they in another field, perhaps not properly parsed out from (for example) Sydney1998, which should have been entered as Sydney 1998? Needs checking.

$ awk -F"\t" '$3 == ""' table > blanks
# Saves records where field 3 is blank to a new file for inspection

The 1798 on line 156 and 200 on line 62 look like data entry errors, the first possibly a typo for 1998, the second for a year in the 2000's. The 417.30 on line 149 looks like a case of field-shifting, where a money item from another, possibly adjacent field has been shifted into the 'Year' field. The record needs checking to see if any other fields have been shifted.

$ awk -F"\t" '$3 == "200" || $3 == "1798" || $3 == "417.30"' table
# Prints to screen the 3 questionable records

When invalid values have been checked and their correct values found, replacements need to be carefully done record-by-record. For example, if the 1798 on line 156 was really meant to be 1998, the sed substitution

$ sed '156s/1798/1998/' table

should only be done if neither field 1 nor field 2 in record 156 contains the string '1798', because a simple sed substitution will target the first (leftmost) match to the target string in the record. (Use hunt to check for this.) A safer option is to use AWK to target just the record and the field in question:

$ awk 'BEGIN {FS=OFS="\t"} NR==156 {sub("1798","1998",$3)} 1' table
 
$ awk 'BEGIN {FS=OFS="\t"} NR==156 {$3="1998")} 1' table
 
--------
 
$ cat file
abbbc   dbbbe
abbbc   dbbbe
abbbc   dbbbe
abbbc   dbbbe
 
awk 'BEGIN {FS=OFS="\t"} NR==3 {sub("bbb","xxx",$2)} 1' file
abbbc   dbbbe
abbbc   dbbbe
abbbc   dxxxe
abbbc   dbbbe
 
awk 'BEGIN {FS=OFS="\t"} NR==3 {$2="dxxxe"} 1' file
abbbc   dbbbe
abbbc   dbbbe
abbbc   dxxxe
abbbc   dbbbe