Character encoding - 2

Uninterpretable characters

Sometimes a UTF-8 file contains characters that are uninterpretable by any program. These will each appear on a webpage or in a terminal or text editor as the Unicode replacement character (hex value ef bf bd), usually displayed as a polygon with a contrasting question mark, like this: �

A replacement character needs to be replaced with whatever character was originally there, and not deleted. Discovering the original character may be difficult, but finding a replacement character by its hex value is easy with sed or AWK:

$ cat file
There's one in the next line
� is a replacement character
This line doesn't have one
$ sed -n '/\xef\xbf\xbd/p' file
� is a replacement character
$ awk '/\xef\xbf\xbd/ {print NR": "$0}' file
2: � is a replacement character

Replacement characters can also appear when non-UTF-8 files are viewed in a UTF-8 locale. In the example shown below, copy.csv is ISO-8859-1 encoded. If I grep for non-characters (anything that isn't a character) in my UTF-8 locale, 9850 of them are found in the ISO-8859-1 file, and as shown these appear in my terminal as replacement characters. When the file is converted to UTF-8 with iconv, the non-characters disappear and get replaced with the Unicode equivalents of the ISO-8859-1 characters.


Mysterious question marks

Data items that have a complicated encoding history (like UTF-8 > windows-1252 > UTF-8) can accumulate question marks in place of original characters. The question marks were inserted by programs en route that couldn't interpret the encoding of the characters concerned. Since a question mark is a valid character in most encodings, once inserted it's always a question mark. Unlike some replacement characters (see above), it can't be converted back to an original character by changing the encoding. The original character is lost and can only be re-inserted after considering context, external sources or the advice of the data manager or custodian.

I use regex to find all the question marks in a data table, together with their immediate context. As shown below, grep extracts matches zero or more 'not space or tab' followed by a literal question mark followed by zero or more 'not space or tab'. The results shown below are from a real-world data table.

$ grep -o "[^[:blank:]]*?[^[:blank:]]*" table | sort | uniq -c
# Examples only:
2 Abendroth?s
2 adresse?e
2 anÌ?os
4 ant?s
8 author?s
8 Beru?cksichtigung
2 brou?ích
19 d?Entomologie
2 ?echoslovenia
2 l?Algérie
2 O?Keefe
1 P?ísp?vek

Once found, the question marks can be replaced globally with sed, as in this 'ganged' command (see also the bulk replace pages):

$ sed 's/adresse?e/adressée/g;s/Beru?cksichtigung/Berücksichtigung/g' table

For routine use I've saved that grep command as an alias, qwords:

alias qwords='grep -o "[^[:blank:]]*?[^[:blank:]]*"'

Another time-saver is the following script, called qfinder, which finds each of the "words" with question marks in a table and saves them to a file called qlist, together with their line and field numbers. The script then works on qlist to generate some summary statistics.

awk -F"\t" '{for (i=1;i<=NF;i++) \
{if ($i ~ /[^[:blank:]]*\?[^[:blank:]]*/ && $i !~ /http/) \
{print NR,i,$i}}}' "$1" \
| awk '{for (j=3;j<=NF;j++) \
{if ($j ~ /[^[:blank:]]*\?[^[:blank:]]*/) \
{print $1,$2,$j}}}' > qlist
echo -e "The table \e[1;31m$1\e[0m has \e[1;31m$(cut -d' ' -f3- qlist | sort | uniq | wc -l)\e[0m different words with \e[1;36m?\e[0m on \e[1;31m$(cut -d' ' -f1 qlist | sort | uniq | wc -l)\e[0m lines"
declare -a labels=($(head -n1 "$1" | tr '\t' '\n'))
qflds=$(cut -d' ' -f2 qlist | sort -n | uniq)
echo -e "The \e[1;31m$(echo "$qflds" | wc -l)\e[0m fields containing \e[1;36m?\e[0m words are:"
for k in $(echo "$qflds"); do echo -e " field \e[1;31m$k\e[0m, \e[1;36m$(echo ${labels[$k-1]})\e[0m, with \e[1;31m$(awk -v fld="$k" '$2 == fld {print $3}' qlist | sort | uniq | wc -l)\e[0m different \e[1;36m?\e[0m words"; done


Detecting encoding fails

Replacement characters and mysterious question marks are signs of failed conversions, but there are others. Unfortunately, these others are much harder to find. I use a function I call graph to tally all the visible characters in the table. The function also finds invisible soft hyphens and non-breaking spaces, which are discussed on the gremlin characters page.

awk 'BEGIN {FS=""} \
{for (i=1;i<=NF;i++) if ($i ~ /[[:graph:]]/) {arr[$i]++}} \
END {for (j in arr) printf("%7s %s\n",arr[j],j)}' "$1" \
| sort -k2 | pr -t -4

In the very large table sam1, the graph function found a few characters not expected in the table, like a copyright symbol:


Grepping for the copyright symbol turns up the odd combination ©":


This combination is almost certainly an encoding fail, possibly of an e with acute accent, é. A table of possibilities is on Tex Texin's very useful encoding website. What was the original character or characters? In this case as in others, it's probably best to ask the data compiler.