Regular Expressions

Overview

We have looked at grep which is able to search for patterns in its input. However, the patterns we have seen so far were all just simple text. grep, and tools like find, vim, sed and others are able to search for more complex patterns called regular expressions or simply regexes.

regexes are somewhat similar to the "wild cards" that the shell recognizes such as "*.txt" which matches all files ending in ".txt", except they are more powerful.

regexes consist of text which is matched exactly, along with other characters which have special meanings for describing patterns, which are covered below.

One tricky thing with regexes is that different tools use slightly different syntaxes for dealing with them. The notes which follow talk about using grep, vim/sed, and rename which all differ slightly. The biggest difference is which characters are literal and which have special meanings. For example, the + character could mean literally a plus sign, or it could mean one or more of the previous character. Remembering the differences is hard, so sometimes trial and error is needed.

Using Regular Expressions with grep

By default, grep treats many of the special characters for regexes as just regular text. To make grep treat them as regular expression symbols, we must pass the -E flag. We could alternatively use the egrep command which is equivalent.

As a first example, we can consider the "|" symbol. Inside of a regex, this means "or", so if we want to match "alligator" or "crocodile", we could use this regex:

$ grep -E 'alligator|crocodile' /usr/share/dict/words
alligator
alligator's
alligators
crocodile
crocodile's
crocodiles

Here we are using grep to search a dictionary file which contains a list of around 100 thousand English words. Note that we are passing the -E flag to grep. Without it, grep would find nothing as it would be looking for | as a literal part of the search string.

The /usr/share/dict/words file is used by spell checkers and can be installed with the wamerican package.

Also note the use of single quotes around the regex. Those are necessary because otherwise, the shell would see the | symbol as a pipe and try to pass the output of grep to the "crocodile" command which sadly does not exist.

Matching any Character

The '.' symbol matches any character at all in a regex. So if we wanted to find all three words which contain a 'z', then any letter, followed by a second 'z', we could use:

$ grep -E 'z.z' /usr/share/dict/words
Azazel
Azazel's
Brzezinski
Brzezinski's
pizazz
pizzazz

Note that the "." matches to an 'a' in most of the results, but matches to an 'e' in "Brzezinski". There are apparently no other English words which fit this pattern.

What words consist of at least 20 letters?

$ grep -E '....................' /usr/share/dict/words
Andrianampoinimerina
Andrianampoinimerina's
counterintelligence's
counterrevolutionaries
counterrevolutionary
counterrevolutionary's
disenfranchisement's
electrocardiograph's
electroencephalogram
electroencephalogram's
electroencephalograms
electroencephalograph
electroencephalograph's
electroencephalographs
oversimplification's
transubstantiation's
uncharacteristically

Note that grep searches only on a line-basis. Only lines which contain the pattern above are matched; a pattern cannot span multiple lines.

Anchors

Notice that grep will produce matches even in the middle of a word.

Anchors allow us to specify that a match should be anchored at a specific point, but do not consume a character.

Anchor	Meaning
^	Start of a line.
$	End of a line.
\<	Start of a word.
\>	End of a word.

For instance, if we grep for 'x', we will get any line that contains a x any place in it. If we grep for '^x' we will match only those lines which start with x:

$ grep -E '^x' /usr/share/dict/words
x
xenon
xenon's
xenophobia
xenophobia's
xenophobic
xerographic
xerography
xerography's
xylem
xylem's
xylophone
xylophone's
xylophones
xylophonist
xylophonists

The following regex will search for words of four letters which both begin and end with 'a':

$ grep -E '^a..a$' /usr/share/dict/words
alga
aqua
area
aria
aura

Repetition

What if we wanted to search for any word which both began and ended with an 'a'? We could attempt something like the following:

$ grep -E '(^a$)|(^aa$)|(^a.a$)|(^a..a$)|(^a...a$)' /usr/share/dict/words
a
aha
alga
aloha
alpha
ameba
aorta
aqua
area
arena
aria
aroma
atria
aura

And we could continue on for every case up until we had covered all possible words. Notice that, just as in math, parentheses are used for control of precedence in regular expressions. It would be better, however, to use one of the regular expression grouping operators:

Operator	Meaning
*	Zero or more of the preceding element.
+	One or more of the preceding element.
?	Zero or one of the preceding element.
{N}	Match exactly N of the preceding element where N is an integer.

With these, we can shorten our regular expression which finds words beginning and ending with 'a':

$ grep -E '^a.*a$' /usr/share/dict/words

We could also use the last form to simplify our search for words of at least 20 letters:

$ grep -E '.{20}' /usr/share/dict/words

Escaping

What if we want to actually search for one of the characters with special meaning? e.g. if we want to search for ellipses in a paper we write, we could do:

$ grep -E '...' paper.txt

However, this will match every line which has at least three consecutive characters on it.

In order to actually match a regex operator literally, we "escape it" with a :

$ grep -E '\.\.\.' paper.txt

This allows us to selectively decide whether to treat the operators as literal characters or as regex operators.

Below are some other escape sequences that are useful:

Escape Sequence	Meaning
\s	Any whitespace.
\S	Anything but whitespace.
\w	A "word" character (not punctuation).
\W	A "non-word" character (punctuation or space).

Character Classes

If we want to match any decimal digit, we could do the following:

(0|1|2|3|4|5|6|7|8|9)

However, a simpler way to do this is with a character class. The following regex will match any digit as well:

[0123456789]

We can also use a range:

[0-9]

We could also use these with letter ranges as well. The following will find all words which both begin and end with a capital letter:

$ grep -E '^[A-Z].*[A-Z]$' /usr/share/dict/words
AOL
BMW
FDR
FNMA
GE
GTE
IBM
JFK
LBJ
LyX
MCI
MGM
MIT
MiG
NORAD
OHSA
OK
PhD
RCA
TWA
UCLA

Inverted Character Classes

Oftentimes we want to match any character except for one or two exceptions. Rather than list all the possibilities, we can list the exceptions. The regex below matches any vowel:

[aeiou]

The regex below matches anything except for a vowel.

[^aeiou]

The caret as the first character after the opening bracket here means the character class is inverted.

To search for words that contain a q followed by a letter other than u, we could use the following:

$ grep -iE 'q[^u]' /usr/share/dict/words
Chongqing
Compaq's
Esq's
Iqaluit
Iqaluit's
Iqbal
Iqbal's
Iraq's
Iraqi
Iraqi's
Iraqis
Q's
Qaddafi
Qaddafi's
Qantas
Qantas's
Qatar
Qatar's
Qingdao
Qiqihar
Qiqihar's
Qom
Qom's
Sq's
Urumqi

Notice we are using the '-i' ignore case option here, otherwise, we would not have gotten the ones with capital 'Q's.

Back References

A back reference allows us to reference some portion of a regex later on in the same regex. To reference some portion of a regex, the portion to reference must be enclosed in parentheses.

The back reference itself is a backslash followed by a number. The number refers to which parenthesized portion we are referencing.

For example, in the following regex:

'(.)(.)(.)\3\2\1'

\1 refers to the text matched by the regex inside the first set of parentheses, \2 refers to the second and \3 refers to the third. A subset of the output of this regex on the dictionary is:

$ grep -E '(.)(.)(.)\3\2\1' /usr/share/dict/words | head 
Brenner
Brenner's
Chattahoochee
Chattahoochee's
assesses
braggart
braggart's
braggarts
cassettes
collocate

Vim Regular Expressions

Vim supports regular expressions in its search as well.

Vim regexes differ from that of grep -E in that the following symbols are literal by default and need to be escaped in order to be operators:

The following do not need to be escaped, however:

.
^
*
$
[
]
- and ^ when inside a character class.

Vim can also ignore case when doing searches. To get this behavior, run:

:set ic

Sed and Vim Substitutions

Sed and Vim substitutions can also use regular expressions, including back references. For instance, let's say we have a file containing names in the last name-comma-first name format, and we wish to re-write them to have the first name first.

Johnson, Paul
Torres, Ana
van Dyke, Susan
Smith, John Michael

We could do this with a Vim substitution command as follows:

:%s/\([^,]*\), \(.*\)$/\2 \1/

This breaks down as follows:

: enter a Vim command
%s/ the % means the entire file (we could instead specify a range of lines). The s stands for substitute and the / marks the beginning of the regex
\( begins the first capture group, that we can back-reference. Unlike with grep, parens have to be escaped in Vim
[^,]* The last name, which is any number of non-comma characters
\) marks the end of the first capture group
, the comma and space which separate the names, which is not captured
\( the start of the second capture group
.* the first name, which is any characters at all
\) marks the end of the second capture group
$ the end of the line the name appears on
/ separates the search portion from the replace portion
\2 \1 what we replace it with: the first name, a space, and then the last name
/ marks the end of the command

If we want to do substitutions using Vim-style regular expressions, we can also do so using the sed command. This can be invoked from the command line:

$ cat names.txt | sed 's/\([^,]*\), \(.*\)$/\2 \1/'
Paul Johnson
Ana Torres
Susan van Dyke
John Michael Smith

By default, sed simply prints the new text to the screen, but with the -i flag, it does the substitutions in place:

$ sed -i 's/\([^,]*\), \(.*\)$/\2 \1/' names.txt

Vim is easier for doing substitutions interactively in a file, while sed is better for automating changes across multiple files.

Renaming Files with Regular Expressions

We can also do a regular expression search and replace to rename files with the rename command. This command takes a substitution command, and a set of files, and applies the substitution to the file names. For instance, if we wish to take a set of files and add "-backup" between the name and the extension, we might use this command:

$ ls
input.txt  output.txt  program.py
$ rename 's/([^.]*)\.(.*)/\1-backup.\2/' * 
$ ls
input-backup.txt  output-backup.txt  program-backup.py

Here the regular expression we are matching is:

([^.]*) - This is the main part of the file name and consists of zero or more characters which are not '.', which does not have to be escaped in a character class. It is in parentheses so it can be referenced as \1.
\. - The '.' between the file name and the extension.
(.*) - Whatever characters comprise the extension. It is in parentheses so it can be referenced as \2.

We then specify the new name as \1-backup.\2 which is the original file name suffixed with "-backup", then the '.', and then the original extension.

rename has a very helpful "-n" flag which makes rename just tell you what changes it would make without actually renaming anything. I recommend using this first to see what will happen.

Conclusion

Regular expressions can look incomprehensible at first, but they are easier to write than read. Writing your first ones can be frustrating as a simple error can be hard to find.

Adding regexes as a tool will be worth it in the long run, however, as they can perform in a few lines what would otherwise be long and tedious tasks.