Regular Expressions
Overview
We have looked at grep which is able to search for patterns in its
input. However, the patterns we have seen so far were all just simple text.
grep, and tools like find, vim, sed and others
are able to search for more complex patterns called regular expressions
or simply regexes.
regexes are somewhat similar to the "wild cards" that the shell recognizes such as "*.txt" which matches all files ending in ".txt", except they are more powerful.
regexes consist of text which is matched exactly, along with other characters which have special meanings for describing patterns, which are covered below.
One tricky thing with regexes is that different tools use slightly different
syntaxes for dealing with them. The notes which follow talk about using grep,
vim/sed, and rename which all differ slightly. The biggest difference is which
characters are literal and which have special meanings. For example,
the + character could mean literally a plus sign, or it could
mean one or more of the previous character. Remembering the differences
is hard, so sometimes trial and error is needed.
Using Regular Expressions with grep
By default, grep treats many of the special characters for
regexes as just regular text. To make grep treat them as regular
expression symbols, we must pass the -E flag. We could alternatively use the
egrep command which is equivalent.
As a first example, we can consider the "|" symbol. Inside of a regex, this means "or", so if we want to match "alligator" or "crocodile", we could use this regex:
$ grep -E 'alligator|crocodile' /usr/share/dict/words alligator alligator's alligators crocodile crocodile's crocodiles
Here we are using grep to search a dictionary file which contains
a list of around 100 thousand English words. Note that we are passing the
-E flag to grep. Without it, grep would find nothing as
it would be looking for | as a literal part of the search string.
The /usr/share/dict/words file is used by spell checkers and
can be installed with the wamerican package.
Also note the use of single quotes around the regex. Those are
necessary because otherwise, the shell would see the | symbol as a pipe
and try to pass the output of grep to the "crocodile"
command which sadly does not exist.
Matching any Character
The '.' symbol matches any character at all in a regex. So if we wanted to find all three words which contain a 'z', then any letter, followed by a second 'z', we could use:
$ grep -E 'z.z' /usr/share/dict/words Azazel Azazel's Brzezinski Brzezinski's pizazz pizzazz
Note that the "." matches to an 'a' in most of the results, but matches to an 'e' in "Brzezinski". There are apparently no other English words which fit this pattern.
What words consist of at least 20 letters?
$ grep -E '....................' /usr/share/dict/words Andrianampoinimerina Andrianampoinimerina's counterintelligence's counterrevolutionaries counterrevolutionary counterrevolutionary's disenfranchisement's electrocardiograph's electroencephalogram electroencephalogram's electroencephalograms electroencephalograph electroencephalograph's electroencephalographs oversimplification's transubstantiation's uncharacteristically
Note that grep searches only on a line-basis. Only lines which
contain the pattern above are matched; a pattern cannot span multiple lines.
Anchors
Notice that grep will produce matches even in the middle of a word.
Anchors allow us to specify that a match should be anchored at a specific point, but do not consume a character.
| Anchor | Meaning |
| ^ | Start of a line. |
| $ | End of a line. |
| \< | Start of a word. |
| \> | End of a word. |
For instance, if we grep for 'x', we will get any line that contains a x any place in it. If we grep for '^x' we will match only those lines which start with x:
$ grep -E '^x' /usr/share/dict/words x xenon xenon's xenophobia xenophobia's xenophobic xerographic xerography xerography's xylem xylem's xylophone xylophone's xylophones xylophonist xylophonists
The following regex will search for words of four letters which both begin and end with 'a':
$ grep -E '^a..a$' /usr/share/dict/words alga aqua area aria aura
Repetition
What if we wanted to search for any word which both began and ended with an 'a'? We could attempt something like the following:
$ grep -E '(^a$)|(^aa$)|(^a.a$)|(^a..a$)|(^a...a$)' /usr/share/dict/words a aha alga aloha alpha ameba aorta aqua area arena aria aroma atria aura
And we could continue on for every case up until we had covered all possible words. Notice that, just as in math, parentheses are used for control of precedence in regular expressions. It would be better, however, to use one of the regular expression grouping operators:
| Operator | Meaning |
| * | Zero or more of the preceding element. |
| + | One or more of the preceding element. |
| ? | Zero or one of the preceding element. |
| {N} | Match exactly N of the preceding element where N is an integer. |
With these, we can shorten our regular expression which finds words beginning and ending with 'a':
$ grep -E '^a.*a$' /usr/share/dict/words
We could also use the last form to simplify our search for words of at least 20 letters:
$ grep -E '.{20}' /usr/share/dict/words
Escaping
What if we want to actually search for one of the characters with special meaning? e.g. if we want to search for ellipses in a paper we write, we could do:
$ grep -E '...' paper.txt
However, this will match every line which has at least three consecutive characters on it.
In order to actually match a regex operator literally, we "escape it" with a :
$ grep -E '\.\.\.' paper.txt
This allows us to selectively decide whether to treat the operators as literal characters or as regex operators.
Below are some other escape sequences that are useful:
| Escape Sequence | Meaning |
| \s | Any whitespace. |
| \S | Anything but whitespace. |
| \w | A "word" character (not punctuation). |
| \W | A "non-word" character (punctuation or space). |
Character Classes
If we want to match any decimal digit, we could do the following:
(0|1|2|3|4|5|6|7|8|9)
However, a simpler way to do this is with a character class. The following regex will match any digit as well:
[0123456789]
We can also use a range:
[0-9]
We could also use these with letter ranges as well. The following will find all words which both begin and end with a capital letter:
$ grep -E '^[A-Z].*[A-Z]$' /usr/share/dict/words AOL BMW FDR FNMA GE GTE IBM JFK LBJ LyX MCI MGM MIT MiG NORAD OHSA OK PhD RCA TWA UCLA
Inverted Character Classes
Oftentimes we want to match any character except for one or two exceptions. Rather than list all the possibilities, we can list the exceptions. The regex below matches any vowel:
[aeiou]
The regex below matches anything except for a vowel.
[^aeiou]
The caret as the first character after the opening bracket here means the character class is inverted.
To search for words that contain a q followed by a letter other than u, we could use the following:
$ grep -iE 'q[^u]' /usr/share/dict/words Chongqing Compaq's Esq's Iqaluit Iqaluit's Iqbal Iqbal's Iraq's Iraqi Iraqi's Iraqis Q's Qaddafi Qaddafi's Qantas Qantas's Qatar Qatar's Qingdao Qiqihar Qiqihar's Qom Qom's Sq's Urumqi
Notice we are using the '-i' ignore case option here, otherwise, we would not have gotten the ones with capital 'Q's.
Back References
A back reference allows us to reference some portion of a regex later on in the same regex. To reference some portion of a regex, the portion to reference must be enclosed in parentheses.
The back reference itself is a backslash followed by a number. The number refers to which parenthesized portion we are referencing.
For example, in the following regex:
'(.)(.)(.)\3\2\1'
\1 refers to the text matched by the regex inside the first set of parentheses, \2 refers to the second and \3 refers to the third. A subset of the output of this regex on the dictionary is:
$ grep -E '(.)(.)(.)\3\2\1' /usr/share/dict/words | head Brenner Brenner's Chattahoochee Chattahoochee's assesses braggart braggart's braggarts cassettes collocate
Vim Regular Expressions
Vim supports regular expressions in its search as well.
Vim regexes differ from that of grep -E in that the following symbols are
literal by default and need to be escaped in order to be operators:
- \(
- \)
- \+
- \|
- \?
The following do not need to be escaped, however:
- .
- ^
- *
- $
- [
- ]
- - and ^ when inside a character class.
Vim can also ignore case when doing searches. To get this behavior, run:
:set ic
Sed and Vim Substitutions
Sed and Vim substitutions can also use regular expressions, including back references. For instance, let's say we have a file containing names in the last name-comma-first name format, and we wish to re-write them to have the first name first.
Johnson, Paul Torres, Ana van Dyke, Susan Smith, John Michael
We could do this with a Vim substitution command as follows:
:%s/\([^,]*\), \(.*\)$/\2 \1/
This breaks down as follows:
:enter a Vim command%s/the % means the entire file (we could instead specify a range of lines). The s stands for substitute and the / marks the beginning of the regex\(begins the first capture group, that we can back-reference. Unlike with grep, parens have to be escaped in Vim[^,]*The last name, which is any number of non-comma characters\)marks the end of the first capture group,the comma and space which separate the names, which is not captured\(the start of the second capture group.*the first name, which is any characters at all\)marks the end of the second capture group$the end of the line the name appears on/separates the search portion from the replace portion\2 \1what we replace it with: the first name, a space, and then the last name/marks the end of the command
If we want to do substitutions using Vim-style regular expressions, we can also
do so using the sed command. This can be invoked from the command
line:
$ cat names.txt | sed 's/\([^,]*\), \(.*\)$/\2 \1/' Paul Johnson Ana Torres Susan van Dyke John Michael Smith
By default, sed simply prints the new text to the screen, but with the
-i flag, it does the substitutions in place:
$ sed -i 's/\([^,]*\), \(.*\)$/\2 \1/' names.txt
Vim is easier for doing substitutions interactively in a file, while sed is better for automating changes across multiple files.
Renaming Files with Regular Expressions
We can also do a regular expression search and replace to rename
files with the rename command. This command takes a substitution
command, and a set of files, and applies the substitution to the file names.
For instance, if we wish to take a set of files and add "-backup" between
the name and the extension, we might use this command:
$ ls input.txt output.txt program.py $ rename 's/([^.]*)\.(.*)/\1-backup.\2/' * $ ls input-backup.txt output-backup.txt program-backup.py
Here the regular expression we are matching is:
([^.]*)- This is the main part of the file name and consists of zero or more characters which are not '.', which does not have to be escaped in a character class. It is in parentheses so it can be referenced as \1.\.- The '.' between the file name and the extension.(.*)- Whatever characters comprise the extension. It is in parentheses so it can be referenced as \2.
We then specify the new name as \1-backup.\2 which is the original
file name suffixed with "-backup", then the '.', and then the original extension.
rename has a very helpful "-n" flag which makes rename just tell
you what changes it would make without actually renaming anything. I recommend using
this first to see what will happen.
Conclusion
Regular expressions can look incomprehensible at first, but they are easier to write than read. Writing your first ones can be frustrating as a simple error can be hard to find.
Adding regexes as a tool will be worth it in the long run, however, as they can perform in a few lines what would otherwise be long and tedious tasks.