logo

Bulk replacement - 1

Replacing 1 item at a time with sed or AWK can be tedious, but there are ways to bulk-replace when data cleaning. One kind of bulk replacement could be called 'many to one' (below) where various items are all replaced with the same new item. 'Address-based' bulk replacement (next page) does the same replacement over and over, but only at the specified line addresses (and fields, if needed).


Many to one

This is best done with sed. First construct a list of individual replacement commands from a list of the items to be replaced. I'll demonstrate with a tab-separated file built around the list of 12 variants on 'Arthur Mills Lea' from the formatting issues page:

$ cat file
Locality    Collector    Date
Burnie    A Lea (1917)    1917-08-19
Marion Bay    A M. Lea    1920-09-28
Yorktown    A. M. Lea    1914-11-09
Devonport    Arthur M. Lea    1915-04-03
Burnie    Lea, A M    1915-04-06
Launceston    Lea, A. M.    1914-11-08
Uraidla    Lea, A.M.    1920-06-15
Hahndorf    Lea,A.M.    1920-06-16
Morgan    Lea, Arthur M. - South Australian Museum    1920-00-00
Hobart    [Lea coll.]    1914-12-09
Melbourne    Lea, Mr Arthur Mills    1915-06-11
Hobart    R. M. Lea    1914-12-09

Build the list:

$ tail -n +2 file | cut -f2 > list
 
$ cat list
A Lea (1917)
A M. Lea
A. M. Lea
Arthur M. Lea
Lea, A M
Lea, A. M.
Lea, A.M.
Lea,A.M.
Lea, Arthur M. - South Australian Museum
[Lea coll.]
Lea, Mr Arthur Mills
R. M. Lea

To replace each of these variants with 'Arthur Mills Lea', it might seem that all we have to do is replace the beginning of each line in list with 's/' and the end of each line with '/Arthur Mills Lea/g'. (Note: although there was only instance of each variant on each line in file, I've included the 'g' option to allow global replacement, as a model for other commands you might build.)

$ sed 's/^/s\//;s/$/\/Arthur Mills Lea\/g/' list

However, this particular list includes characters that have regex significance: '.', '[' and ']'. These characters need to be escaped with additional sed substitutions:

$ sed 's/^/s\//;s/$/\/Arthur Mills Lea\/g/;s/\./\\&/g;s/\[/\\&/g;s/\]/\\&/g' list
s/A Lea (1917)/Arthur Mills Lea/g
s/A M\. Lea/Arthur Mills Lea/g
s/A\. M\. Lea/Arthur Mills Lea/g
s/Arthur M\. Lea/Arthur Mills Lea/g
s/Lea, A M/Arthur Mills Lea/g
s/Lea, A\. M\./Arthur Mills Lea/g
s/Lea, A\.M\./Arthur Mills Lea/g
s/Lea,A\.M\./Arthur Mills Lea/g
s/Lea, Arthur M\. - South Australian Museum/Arthur Mills Lea/g
s/\[Lea coll\.\]/Arthur Mills Lea/g
s/Lea, Mr Arthur Mills/Arthur Mills Lea/g
s/R\. M\. Lea/Arthur Mills Lea/g

Finally, we feed the result of building the list of replacement commands to sed with its '-e' option and use file as the argument:

$ sed -e "$(sed 's/^/s\//;s/$/\/Arthur Mills Lea\/g/;s/\./\\&/g;s/\[/\\&/g;s/\]/\\&/g' list)" file
Locality    Collector    Date
Burnie    Arthur Mills Lea    1917-08-19
Marion Bay    Arthur Mills Lea    1920-09-28
Yorktown    Arthur Mills Lea    1914-11-09
Devonport    Arthur Mills Lea    1915-04-03
Burnie    Arthur Mills Lea    1915-04-06
Launceston    Arthur Mills Lea    1914-11-08
Uraidla    Arthur Mills Lea    1920-06-15
Hahndorf    Arthur Mills Lea    1920-06-16
Morgan    Arthur Mills Lea    1920-00-00
Hobart    Arthur Mills Lea    1914-12-09
Melbourne    Arthur Mills Lea    1915-06-11
Hobart    Arthur Mills Lea    1914-12-09

On to address-based bulk replacement...