logo

Character encoding - 1

Is it UTF-8?

If data are going to be shared they should be UTF-8 encoded in Unicode. For that reason, data cleaning should be done on a system with a UTF-8 locale and the cleaned data saved in UTF-8.

To check the encoding of a data table, use file -i or uchardet:

$ file -i table1
table1: text/plain; charset=utf-8
 
$ uchardet table1
UTF-8
 
--------
 
$ file -i table2
table2: text/plain; charset=iso-8859-1
 
$ uchardet table2
windows-1252
 
--------
 
$ file -i table3
table3: text/plain; charset=us-ascii
 
$ uchardet table3
ascii/unknown

table1 is fine and table2 will probably convert cleanly to UTF-8 (but see below). table3 may have mixed character encoding and there may be conversion problems.

Note that the outputs of the file -i or uchardet programs are best guesses. Convert to UTF-8 (see next section) if you suspect that the character encoding of the data table might be something other than UTF-8.


Converting to UTF-8

The best all-purpose conversion tool is iconv:

$ iconv -f windows-1252 -t utf-8 < table2 > table2-utf8
 
$ iconv -f us-ascii -t utf-8 < table3 > table3-utf8

If there are uninterpretable byte sequences in the data, iconv will throw up an error message:

$ iconv -f windows-1252 -t utf-8 < table2 > table2-utf8
iconv: illegal input sequence at position 661187

Find out where the 'illegal input sequence' is by looking at the last line of the converted file; the next (missing) character is the problem:

$ tail -n 1 table2-utf8
Helmut   Karl   Pl

Find the problem line in the unconverted file and replace the problematic character temporarily with a placeholder like '@'. Re-do the conversion, then replace the placeholder with the character that was corrupted or otherwise problematic:

$ grep -n "Helmut   Karl   Pl" table2
1243: Helmut   Karl   Pl�n   farmer
 
$ awk 'BEGIN {FS=OFS="\t"} NR==1243 {$3="Pl@n"} 1' table2 > temp
 
$ iconv -f windows-1252 -t utf-8 < temp > table2-utf8
$
 
$ sed -i '1243s/Pl@n/Plön/' table2-utf8

If there are a number of uninterpretable characters in the data table, this procedure will need to be repeated until all 'illegal input sequences' have been replaced. Sometimes the original, uncorrupted characters will be obvious from context or from external data sources, but it may be necessary to check with the data manager or compiler.