logo

Gremlin characters

When cleaning UTF-8 text files I sometimes come across invisible characters that I call 'gremlins'. These aren't the usual non-printing characters, like whitespace and (horizontal) tab, which are non-printing characters I expect to find in the plain text files I work with. Gremlins are weird things like 'vertical tab', 'device control 2' and 'soft hyphen'.

Unless you deliberately hunt for gremlins, you can easily miss them. I use a BASH script called gremlins, shown at the bottom of this page, to do my hunting. The first part of the script checks for soft hyphens and non-breaking spaces, as well as replacement characters and Windows carriage returns.

A soft hyphen is placed at a preferred breakpoint in a word, so that programs like Web browsers know where to insert a visible hyphen when splitting the word over two successive lines. In a terminal a soft hyphen appears as a single whitespace, but it's invisible in a text editor. Soft hyphens can be found with a search for their hex value, but they remain invisible until the output is highlighted:

$ grep -P "\xad"
 
$ grep $'\xc2\xad'

soft hyphen

Another visibility trick is to pass the text through sed -n 'l', which reveals soft hyphens as octal 302 255 (hex c2 a0). The octal value can then be grepped:

$ sed -n 'l' table | grep $'\\302\\255'

soft hyphen octal

A non-breaking space is a character that prevents a line break from happening at its position. Like a soft hyphen, a non-breaking space appears as a single whitespace in a terminal and is invisible in a text editor. Also like soft hyphens, non-breaking spaces can be grepped by their hex code and revealed by highlighting or with the sed -n 'l' trick:

nbsp

The second part of the gremlins script uses a bit of AWK voodoo to tally the invisible characters other than space, horizontal tab, soft hyphen, non-breaking space and Windows carriage return, together with their hexadecimal values. These gremlins are C0 and C1 control characters. The script returns the tally of each gremlin after its name and hex code. The gremlin names are taken from a look-up table of mine which you can download here. (Note that the script uses the filename chars.)

Here's a real-world example of gremlins at work:

gremlins script

For some purposes, gremlins can be destroyed globally with tr or sed, using the gremlin hex values. A more targeted approach may be called for, as in line 1066231 of taxa.txt (see image below), where a space and a record separator (hex 1e) have taken the place of a hyphen in the name 'Schmid-Eggr'. The right approach here is to replace the space and record separator with a hyphen.

gremlin hunt

Having detected gremlins, I can find their line number and context using a function similar to the one I use for question marks. The function is called gremhunt and takes as its two arguments the filename and the hex value of the gremlin:

gremhunt()
{
grep -noP "[^[:blank:]]*\x$2[^[:blank:]]*" "$1"
}

gremhunt

You can isolate the field containing the gremlin with a variation on the gremhunt command:

grep -noP "[^\t]*\xhex value[^\t]*" table

gremhunt2

You can also use non-Perl grep to search for the characters as their hex values, but the syntax is a little complicated:

grep --color=always "["$'\xhex value of character'"]" table

Here I'm looking for the first line with a replacement character, hex ef bf bd, in the file sam1:

syntax

Gremlins can also be found and manipulated using AWK, but for most gremlins the full hex value (with "c2") needs to be used. For example:

$ awk -F"\t" '{for (i=1;i<=NF;i++) \
{if ($i ~ /\xc2\xhex value of character/) \
{print "line "NR", field "i": \n"$i}}}' table

AWKgrem

For more on gremlin characters, see Control characters in ASCII and Unicode, produced by the Finnish software company Aivosto Oy, and the wonderful Graphemica website.



The gremlins script:

#!/bin/bash
 
red="\033[1;31m"
gray="\033[1;37m"
reset="\033[0m"
 
printf "\nFirst check for gremlins, please wait...\n\n"
 
wincr=$(grep -cP "\r" "$1")
if [ "$wincr" -eq "0" ]; then
wc=none
else
wc=$(awk -F"\r" 'NF>1 {a+=(NF-1); b++} END {print a" on "b" lines"}' "$1")
fi
 
shy=$(grep -c $'\xc2\xad' "$1")
if [ "$shy" -eq "0" ]; then
sh=none
else
sh=$(awk -F"\xc2\xad" 'NF>1 {c+=(NF-1); d++} END {print c" on "d" lines"}' "$1")
fi
 
nbsp=$(grep -c $'\xc2\xa0' "$1")
if [ "$nbsp" -eq "0" ]; then
nb=none
else
nb=$(awk -F"\xc2\xa0" 'NF>1 {c+=(NF-1); d++} END {print c" on "d" lines"}' "$1")
fi
 
repcha=$(grep -c "["$'\xef\xbf\xbd'"]" "$1")
if [ "$repcha" -eq "0" ]; then
rc=none
else
count=$(grep -o "["$'\xef\xbf\xbd'"]" "$1" | wc -l)
lines=$(grep -c "["$'\xef\xbf\xbd'"]" "$1")
rc=$(printf "$count on $lines lines\n")
fi
 
printf "$red$1$reset has:\n\nWindows carriage returns (\\\r): $gray$wc$reset\nSoft hyphens (\\\xad): $gray$sh$reset\nNon-breaking spaces (\\\xa0): $gray$nb$reset\nReplacement characters (\\\xef\\\xbf\\\xbd): $gray$count on $lines lines$reset\n"
 
printf "_ _ _ _ _ _ _ _ _ _ _ \n"
 
printf "\nChecking now for gremlin control characters, please wait...\n"
 
awk 'BEGIN {FS=""; for (n=0;n<256;n++) ord[sprintf("%c",n)]=n; \ list="\x00|\x01|\x02|\x03|\x04|\x05|\x06|\x07|\x08|\x0b|\x0c|\x0e| \ \x0f|\x10|\x11|\x12|\x13|\x14|\x15|\x16|\x17|\x18|\x19|\x1a|\x1b| \ \x1c|\x1d|\x1e|\x1f|\x7f|\xc2\x80|\xc2\x81|\xc2\x82|\xc2\x83|\xc2\x84 \ |\xc2\x85|\xc2\x86|\xc2\x87|\xc2\x88|\xc2\x89|\xc2\x8a|\xc2\x8b| \ \xc2\x8c|\xc2\x8d|\xc2\x8e|\xc2\x8f|\xc2\x90|\xc2\x91|\xc2\x92|\xc2\x93 \ |\xc2\x94|\xc2\x95|\xc2\x96|\xc2\x97|\xc2\x98|\xc2\x99|\xc2\x9a| \ \xc2\x9b|\xc2\x9c|\xc2\x9d|\xc2\x9e|\xc2\x9f"} \
{if ($0 ~ list) \
{for (i=1;i<=NF;i++) if ($i ~ /[^[:graph:]]/ && $i !~ /[[:blank:]]/ && $i !~ /\r/) {b[$i]++}}} \
END {for (j in b) printf("%s\t%02x\n", b[j],ord[j])}' "$1" > /tmp/list
 
echo
 
if [ -s /tmp/list ]; then
awk -v GRAY="$gray" -v RESET="$reset" 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2;next} {print a[$2]" (hex "$2"): " ,GRAY$1RESET}' ~/scripts/chars /tmp/list
else
printf "No gremlin control characters found\n\n"
fi
 
echo
 
rm /tmp/list
 
exit