logo

About this website

This is version 1 of a cookbook that will help you check whether a data table (defined on the data tables page) is properly structured and free from formatting errors, inconsistencies, duplicates and other data headaches.

All the data-auditing and -cleaning recipes on this website use GNU/Linux tools in a BASH shell and work on plain text files. If your data table is in a spreadsheet or a database, you first need to export the table as a text file, or copy/paste it into a text editor (explained on this page) and save the pasted text as a file.

Why go to all that trouble? Why not check your data table inside a spreadsheet or database? Because it's often much easier to find and fix errors in a plain text file. The GNU/Linux tools for processing plain text have up to 40 years of development behind them. The tools are fast, reliable, simple to use and easy to learn. They're particularly good at working with files too big (100 MB+) for editing in a spreadsheet.

Command-line recipes on this website are shown as below. This is real text, not an image, so you can copy and paste commands to try them out. Please note, however, that the 'tabs' in the demonstration files aren't real tabs, as tabs don't exist in HTML. If you copy/paste a demonstration file into a text editor or terminal, you'll have 3 or 4 spaces between data items, not tabs. Note also that the < and > characters are represented on these pages by their HTML equivalents.

$ This is a command

The maximum width of a page in this cookbook is a bit over 900 pixels, so you can keep recipes open side-by-side with a terminal on a widescreen display.



About you

You know how to use the command line in a BASH shell, you know at least a little about regular expressions, and you may have used AWK and sed before.

I'm hoping you run Linux on your computer. If your computer is a Mac, your BASH shell is the Terminal app in Applications. Most of the commands used in this cookbook are already there, but I strongly recommend that you install their GNU versions (e.g. with Homebrew), rather than use the BSD versions that come with a Mac. The GNU versions generally run faster than their BSD equivalents and usually have more features.

If your computer runs Windows you can try CygWin, but dual-booting a Linux distribution is a better option. Microsoft announced in March 2016 that a BASH shell will run natively in upcoming editions of Windows 10, but writing this in June 2016 I can't vouch for the success of this idea.



About me

I'm a retired scientist and I've been mucking around with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite). In 2012 I discovered the AWK language and realised that every processing job I'd ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with GNU/Linux command-line tools, especially AWK.

If you find mistakes on this website or have suggestions for better recipes, please email me. You can also contact me directly if you have data that you would like audited or cleaned at commercial rates.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert (dot) mesibov (at) gmail (dot) com
Updated: 4 March 2017



About the banner

The webpage banner shows a detail from a painting by the 17th-century Flemish artist David Rijckaert III. I like the look of concentration on the alchemist's face as he studies a text before doing something with that flask in his right hand. Working with the command line isn't alchemy, but sometimes it seems like magic.



Legal stuff

The text and images on this website are my own work and are copyright under a Creative Commons license (attribution + non-commercial, cc-by-nc). You are welcome to use or copy the information and images on this website for non-commercial purposes, but please attribute that use to this source.

Please note that the software commands on this website are provided 'as is', without warranty of any kind, express or implied, including fitness for particular purposes. In no event shall the website author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software commands on this website.