logo

Character encoding - 1

Is it UTF-8?

If data are going to be shared they should be UTF-8 encoded in Unicode. For that reason, data cleaning should be done on a system with a UTF-8 locale and the cleaned data saved in UTF-8.

To check the encoding of a data table, use file -i or uchardet:

$ file -i table1
table1: text/plain; charset=utf-8
 
$ uchardet table1
UTF-8
 
--------
 
$ file -i table2
table2: text/plain; charset=iso-8859-1
 
$ uchardet table2
windows-1252
 
--------
 
$ file -i table3
table3: text/plain; charset=us-ascii
 
$ uchardet table3
ascii/unknown

table1 is fine and table2 will probably convert cleanly to UTF-8 (but see below). table3 may have mixed character encoding and there may be conversion problems.

Note that the outputs of the file -i or uchardet programs are best guesses. Convert to UTF-8 (see next section) if you suspect that the character encoding of the data table might be something other than UTF-8.


Converting to UTF-8

The best all-purpose conversion tool is iconv:

$ iconv -f windows-1252 -t utf-8 < table2 > table2-utf8

If there are uninterpretable byte sequences in the data, iconv will throw up an error message:

$ iconv -f windows-1252 -t utf-8 < table2 > table2-utf8
iconv: illegal input sequence at position 661187

The best approach to this problem is to find the 'illegal input sequence' in the last line of the converted table. The (missing) character after the last one on the line must be the "illegal" one. It can be replaced with another character once its byte sequence is known; for details see this BASHing data blog post.

If there are a number of uninterpretable characters in the data table, this procedure will need to be repeated until all 'illegal input sequences' have been replaced. Sometimes the original, "legal" characters will be obvious from context or from external data sources, but it may be necessary to check with the data manager or compiler.

Note that "ascii" and "us-ascii" character sets are included in UTF-8, so that if you try to use iconv to convert "ascii" to UTF-8, the result of testing with file -i and uchardet will still be "ascii". However, not all files that test as "ascii" and "us-ascii" are genuine — they may contain isolated non-ASCII characters.