logo

AWK

Many of the recipes here rely on AWK, which is an interpreted programming language first developed in the 1970s by Alfred Aho, Peter Weinberger and Brian Kernighan. AWK is designed to process text files consisting of records broken up into fields — such as data tables. The version included in many current Linux distributions is 'gawk 4' (e.g. GNU AWK 4.1.1, 2014), and that's the AWK version used in this cookbook.

AWK is powerful and elegant. It's powerful because it can do so many different processing jobs, and it's elegant because you don't need to write much to tell AWK how to do its job.

For an introduction to AWK I recommend these online articles by Dan Robbins:
   Awk by example, Part 1
   Awk by example, Part 2
   Awk by example, Part 3
 
and this one-stop tutorial:
   AWK [by Bruce Barnett]

There are many introductions to AWK arrays on the Web, but none I've seen are both comprehensive and elementary enough for an AWK beginner. One of my own efforts is here. Clear explanations of GNU AWK 4 arrays are in a manual written by the chief GNU AWK 4 developer, available as a website, a free PDF and a non-free printed book. Another good way to learn AWK arrays is to see how they're used to solve specific problems on websites like Stack Overflow.



sed

GNU sed (version 4.2.2 is used here) is a command-line text editor that processes a file line by line, like AWK. This cookbook mainly uses sed for data-cleaning recipes.

A few online resources for sed are good for beginners. The first 3 are again by Dan Robbins:
   Sed by example, Part 1
   Sed by example, Part 2
   Sed by example, Part 3
 
   The SED FAQ
   Unix - Regular Expressions with SED

Almost uniquely among command-line tools, sed has an in-place option, '-i'. In other words, you can process a file with sed without having to generate a new, processed file and leave the old file untouched. Here I replace all instances of 'aaa' in file1 with 'bbb', first by generating a new file, then by replacing in-place:

$ sed 's/aaa/bbb/g' file1 > new_file1
 
$ sed -i 's/aaa/bbb/g' file1

The '-i' option can be dangerous if you haven't followed data-cleaning rule no. 1! However, you can create a backup of a file at the same time that you modify it, by following the '-i' with an addition to the filename:

$ ls
file2
 
$ sed -i_old 's/aaa/bbb/g' file2
 
$ ls
file2 file2_old

On this website I show sed commands without the '-i' option, but if you're confident you know what the result will be, you're welcome to use the recipe with '-i'. 



Regular expressions

Regular expressions, or 'regex', are sometimes complicated and confusing, but there are some excellent explainers online. The best and most complete I've seen is by Jan Goyvaerts:
   Regular-Expressions.info

Less complete but somewhat easier introductions are
   Regular Expressions - User Guide
   Simple RegEx Tutorial

and there's a clever try-it-yourself regex builder at
   RegExr