logo

Windows carriage returns

On Linux and Mac machines, a newline is built with just one character, the UNIX linefeed '\n' ('LF'). On Windows computers, a newline is created using two characters, one after the other: '\r\n' ('CRLF'), where '\r' is called a 'carriage return' ('CR'). Carriage returns aren't necessary in a data table and can cause problems in data cleaning.

There are several ways to find CR characters. You can use sed -n 'l' to visualise any '\r' in a table, and grep to select out the lines with a CR and print their line numbers. Alternatively, a CR character will be shown as '^M' if you use cat -v, where the '-v' option shows non-printing characters other than tabs and linefeeds. In the example below, the file winCR has an invisible Windows carriage return at the end of the first line:

$ cat winCR
aaa   bbb
ccc   ddd
eee   fff
 
$ sed -n 'l' winCR
aaa\tbbb\r$
ccc\tddd$
eee\tfff$
 
$ sed -n 'l' winCR | grep -n "\\r"
1:aaa\tbbb\r$
 
$ cat -v winCR | grep -n "\^M"
1:aaa   bbb^M$

It's wise to run these commands with grep's '-c' option first rather than '-n'. The '-c' option returns only the number of lines with a CR, and if that number is big, you avoid having large number of lines printed at high speed in your terminal. If your grep supports Perl-type regular expressions, you can count '\r' characters directly.

$ sed -n 'l' winCR | grep -c "\\r"
1
 
$ cat -v winCR | grep -c "\^M"
1
 
$ grep -cP "\r" winCR
1

You can strip away everything except the line numbers from the grep -n result with a cut command, by specifying a colon as field delimiter for cut:

$ sed -n 'l' winCR | grep -n "\\r" | cut -d ':' -f1 > list_of_records_with_CR

The easiest way to remove all Windows carriage returns from table is with tr:

$ tr -d '\r' < table > table_without_CR

Deleting all the carriage returns could be a mistake, however, if any of them are within data items. The screenshot below shows a real-world example. In the file afd1, I used sed to replace each of the 2 carriage returns in line 67893 with a single whitespace. Note that this was an 'in-place' edit with sed's '-i' option.


CR fix