logo

Gremlin characters

When cleaning UTF-8 text files I sometimes come across invisible characters that I call 'gremlins'. These aren't the usual non-printing characters, like whitespace and (horizontal) tab, which are non-printing characters I expect to find in the plain text files I work with. Gremlins are weird things like 'vertical tab', 'device control 2' and 'soft hyphen'.

Unless you deliberately hunt for gremlins, you can easily miss them. I use two BASH scripts to do my hunting. The first, gremlins1, checks for soft hyphens and non-breaking spaces, as well as replacement characters and Windows carriage returns.

A soft hyphen is placed at a preferred breakpoint in a word, so that programs like Web browsers know where to insert a visible hyphen when splitting the word over two successive lines. In a terminal a soft hyphen appears as a single whitespace, but it's invisible in a text editor. Soft hyphens aren't greppable as control characters or as non-printing characters (with [^[:print:]]). They can be found with a search for their hex value, but they remain invisible until the output is highlighted:

soft hyphen

Another visibility trick is to pass the text through sed -n 'l', which reveals soft hyphens as octal 302 255 (hex c2 a0):

$ echo "Entomologicheskoe Obozrenie 61: 569-583" | sed -n 'l'
Entomologicheskoe Obozrenie 61: 569\302\255-583$

A non-breaking space is a character that prevent a line break from happening at its position. Like a soft hyphen, a non-breaking space appears as a single whitespace in a terminal and is invisible in a text editor. Also like soft hyphens, non-breaking spaces can be grepped by their hex code and revealed by highlighting or the sed -n 'l' trick:

nbsp

My gremlins1 script:

#!/bin/bash
 
wincr=$(grep -cP "\r" "$1")
softhyph=$(grep -cP "\xad" "$1")
nonbrsp=$(grep -cP "\xa0" "$1")
replchars=$(awk '/\xef\xbf\xbd/ {repcha++} \
END {print (repcha == "" ? 0 : repcha)}' "$1")
 
echo -e \
"Found lines with:\nWindows carriage returns: $wincr\nsoft hyphens: $softhyph\nnon-breaking spaces: $nonbrsp\nreplacement characters: $replchars"
 
exit

And here it is at work:

gremlins1

The second script, gremlins2, uses a bit of AWK voodoo to tally the invisible characters other than space, horizontal tab, soft hyphen and non-breaking space, together with their sorted hexadecimal values:

#!/bin/bash
 
awk 'BEGIN {FS="";for (n=0;n<256;n++) ord[sprintf("%c",n)]=n} \
{for (i=1;i<=NF;i++) if ($i ~ /[^[:graph:]]/ && $i !~ /\t/ && $i !~ / /) {arr[$i]++}} \
END {for (j in arr) printf("%s\t%x\n", arr[j],ord[j])}' "$1" | sort -t $'\t' -nk2
 
exit

gremlins2

In the real-world example above, the script has found 11 "index" characters (hex 84), 9 "next line" (85), 298 "character tab with justification" (89), 104 "private use 1" (91), 191 "private use 2" (92), 83 "set transmit state" (93), 87 "cancel character" (94), 418 "start of protected area" (96) and 1 "start of string" (98).

For my own use I've built an "explainer" table which labels each gremlin according to its hex value. The labels are extracted from the explainer table after the main part of gremlins2 has done its tallying:

#!/bin/bash
 
awk 'BEGIN {FS="";for (n=0;n<256;n++) ord[sprintf("%c",n)]=n} \
{for (i=1;i<=NF;i++) if ($i ~ /[^[:graph:]]/ && $i !~ /\t/ && $i !~ / /) {arr[$i]++}} \
END {for (j in arr) printf("%s\t%x\n", arr[j],ord[j])}' "$1" | sort -t $'\t' -nk2 \
| awk ' BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2;next} \
{print $1,$2,a[$2]}' [path to explainer file] /tmp/list
 
rm /tmp/list
 
exit

Another real-world example:

explainer

For some purposes, gremlins can be destroyed globally with tr or sed, using the gremlin hex values. A more targeted approach may be called for, as in line 1066231 of taxa.txt (see image below), where a space and a record separator (hex 1e) have taken the place of a hyphen in the name 'Schmid-Eggr'. The right approach here is to replace the space and record separator with a hyphen.

gremlin hunt

Having detected gremlins with the gremlins2 script, I find their line number and context using a function similar to the one I use for question marks. The function is called gremhunt and takes as its two arguments the filename and the hex value of the gremlin:

gremhunt()
{
grep -noP "[^[:blank:]]*\x$2[^[:blank:]]*" "$1"
}

gremhunt

You can isolate the field containing the gremlin with a variation on the gremhunt command:

grep -noP "[^\t]*\xhex value[^\t]*" table

gremhunt2

You can also use non-Perl grep to search for the characters as their hex values, but the syntax is a little complicated:

grep --color=always "["$'\xhex value of character'"]" table

Here I'm looking for the first line with a replacement character, hex ef bf bd, in the file sam1:

syntax

Gremlins can also be found and manipulated using AWK, but for many gremlins the hex value returned by gremlins2 needs to be preceded by "c2". For example:

$ awk -F"\t" '{for (i=1;i<=NF;i++) \
{if ($i ~ /\xc2\xhex value of character/) \
{print "line "NR", field "i": \n"$i}}}' table

AWKgrem

For more on gremlin characters, see Control characters in ASCII and Unicode, produced by the Finnish software company Aivosto Oy, and the wonderful Graphemica website.