A Tao of Regular Expressions

Steve Mansour

(copied by jm /at/ jmason.org from [url]http://www.scruz.net/%7esman/regexp.htm[/url], after the original disappeared! )


What Are Regular Expressions

A regular expression is a formula for matching strings that follow some pattern. Many people are afraidto use them because they can look confusing and complicated. Unfortunately, nothing in this write upcan change that. However, I have found that with a bit of practice, it's pretty easy to write thesecomplicated expressions. Plus, once you get the hang of them, you can reduce hours of laborious anderror-prone text editing down to minutes or seconds. Regular expressions are supported by many texteditors, class libraries such as Rogue Wave's Tools.h++, scripting tools such as awk, grep, sed, andincreasingly in interactive development environments such as Microsoft's Visual C++.

Regular expressions usage is explained by examples in the sections that follow. Most examples arepresented as vi substitution commands or as grep file search commands, but they are representativeexamples and the concepts can be applied in the use of tools such as sed, awk, perl and other programsthat support regular expressions. Have a look at Regular Expressions In Various Tools for examples of
regular expression usage in other tools. A short explanation of vi's substitution command and syntax is
provided at the end of this document.

Regular Expression Basics

Regular expressions are made up of normal characters and metacharacters. Normal characters include
upper and lower case letters and digits. The metacharacters have special meanings and are described indetail below.

In the simplest case, a regular expression looks like a standard search string. For example, the regularexpression "testing" contains no metacharacters. It will match "testing" and "123testing" but it will notmatch "Testing".

To really make good use of regular expressions it is critical to understand metacharacters. The tablebelow lists metacharacters and a short explanation of their meaning.

Metacharacter Description


.        Matches any single character. For example the regular expression r.t would match the  
        strings rat, rut, r t, but not root.        

$               Matches the end of a line. For example, the regular expression weasel$ would match
                the end of the string "He's a weasel" but not the string "They are a bunch of weasels."
       
^               Matches the beginning of a line. For example, the regular expression ^When in would
                match the beginning of the string "When in the course of human events" but would not
                match "What and When in the" .
                
*               Matches zero or more occurences of the character immediately preceding. For example,
               
\               the regular expression .* means match any number of any characters.
                This is the quoting character, use it to treat the following character as an ordinarycharacter. For example, \$ is used to match the dollar sign character ($) rather than theend of a line. Similarly, the expression \. is used to match the period character ratherthan any single character.
        
[]              Matches any one of the characters between the brackets. For example, the regularexpression r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters canspecified by using a hyphen. For example, the regular expression [0-9] means match
[c1-c2]            any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z]
[^c1-c2]    means match any upper or lower case letter. To match any character except those in the
        range, the complement range, use the caret as the first character after the openingbracket. For example, the expression [^269A-Z] will match any characters except 2, 6,
        9, and upper case letters.
        
\< \>           Matches the beginning (\<) or end (\>) or a word. For example, \<the matches on "the"
        in the string "for the wise" but does not match "the" in "otherwise". NOTE: this
            metacharacter is not supported by all applications.
        
\( \)           Treat the expression between \( and \) as a group. Also, saves the characters matchedby the expression into temporary holding areas. Up to nine pattern matches can besaved in a single regular expression. They can be referenced as \1 through \9.
                 
|               Or two conditions together. For example (him|her) matches the line "it belongs to
                him" and matches the line "it belongs to her" but does not match the line "it belongs to
                them." NOTE: this metacharacter is not supported by all applications.
                                 
+               Matches one or more occurences of the character or regular expression immediatelypreceding. For example, the regular expression 9+ matches 9, 99, 999. NOTE: this
                metacharacter is not supported by all applications.
                 
?               Matches 0 or 1 occurence of the character or regular expression immediately
                preceding.NOTE: this metacharacter is not supported by all applications.
\{i\}           Match a specific number of instances or instances within a range of the precedingcharacter. For example, the expression A[0-9]\{3\} will match "A" followed byexactly 3 digits. That is, it will match A123 but not A1234. The expression [09]\{
\{i,j\}         4,6\} any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is not supportedby all applications.

The simplest metacharacter is the dot. It matches any one character (excluding the newline character).
Consider a file named test.txt consisting of the following lines:

he is a rat
he is in a rut
the food is Rotten
I like root beer


We can use grep to test our regular expressions. Grep uses the regular expression we supply and tries tomatch it to every line of the file. It prints all lines where the regular expression matches at least onesequence of characters on a line. The command

grep r.t test.txt


searches for the regular expression r.t in each line of test.txt and prints the matching lines. The regular
expression r.t matches an r followed by any character followed by a t. It will match rat and rut. It
does not match the Rot in Rotten because regular expressions are case sensitive. To match both theupper and lower the square brackets (character range metacharacters) can be used. The regularexpression [Rr] matches either R or r. So, to match an upper or lower case r followed by any character
followed by the character t the regular expression [Rr].t will do the trick.

To match characters at the beginning of a line use the circumflex character (sometimes called a caret).
For example, to find the lines containing the word "he" at the beginning of each line in the file test.txtyou might first think the use the simple expression he. However, this would match the in the third line.
The regular expression ^he only matches the h at the beginning of a line.

Sometimes it is easier to indicate something what should not be matched rather than all the cases thatshould be matched. When the circumflex is the first character between the square brackets it means tomatch any character which is not in the range. For example, to match he when it is not preceded by t or
s, the following regular expression can be used: [^st]he.

Several character ranges can be specified between the square brackets. For example, the regularexpression [A-Za-z] matches any letter in the alphabet, upper or lower case. The regular expression [A-
Za-z][A-Za-z]* matches a letter followed by zero or more letters. We can use the + metacharacter to do
the same thing. That is, the regular expression [A-Za-z]+ means the same thing as [A-Za-z][A-Za-z]*.
Note that the + metacharacter is not supported by all programs that have regular expressions. See RegularExpressions Syntax Support for more details.

To specify the number of occurrences matched, use the braces (they must be escaped with a backslash).
As an example, to match all instances of 100 and 1000 but not 10 or 10000 use the following:
10\{2,3\}. This regular expression matches a the digit 1 followed by either 2 or 3 0's. A useful variation
is to omit the second number. For example, the regular expression 0\{3,\} will match 3 or more
successive 0's.

Simple Examples

Here are a few representative, simple examples.

vi command What it does


:%s/ */ /g Change 1 or more spaces into a single space.
:%s/ *$// Remove all spaces from the end of the line.

[url]http://sitescooper.org/tao_regexps.html[/url] Page 3 of 9


A Tao of Regular Expressions 06/22/2006 06:28 AM

:%s/^/ / Insert a space at the beginning of every line.

:%s/^[0-9][0-9]* // Remove all numbers at the beginning of a line.

:%s/b[aeio]g/bug/g Change all occurences of bag, beg, big, and bog, to bug.

Change all occurences of tag, tog, and tug to hat, hot, and hug

:%s/t\([aou]\)g/h\1t/g


respectively.

Medium Examples (Strange Incantations)

Example 1

Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied tofoo(). That is, we must be able to make changes like the following:

Before After

foo(10,7,2) foo(7,10,2)
foo(x+13,y-2,10) foo(y-2,x+13,10)
foo( bar(8), x+y+z, 5) foo( x+y+z, bar(8), 5)


The following substitution command will do the trick :

:%s/foo(\([^,]*\),\([^,]*\),\([^)]*\))/foo(\2,\1,\3)/g


Now, let's break this apart and analyze what's happening. The idea behind this expression is to identifyinvocations of foo() with three parameters between the parentheses. The first parameter is identified bythe regular expression \([^,]*\), which we can analyze from the inside out.

[^,] means any character which is not a comma

[^,]* means 0 or more characters which are not commas

\([^,]*\) tags the non-comma characters as \1 for use in the replacement part of the command

\([^,]*\), means that we must match 0 or more non-comma characters which are followed by a

comma. The non-comma characters are tagged.

This is a good time to point out one of the most common problems people have with regular expressions.
Why would we use an expression like [^,]*, instead of something more straightforward like .*, to
match the first parameter? Consider applying the pattern .*, to the string "10,7,2". Should it match
"10," or "10,7," ? To resolve this ambiguity, regular expressions will always match the longest stringpossible. In this case "10,7," which covers two parameters instead of one parameter like we want. So, byusing the expression [^,]*, we force the pattern to match all characters up to the first comma.

The expression up to this point is: foo(\([^,]*\), and can be roughly translated as "after you find foo(
tag all characters up to the next comma as \1". We tag the second parameter just like the first and it can
be referenced as \2. The tag used on the third parameter is exactly like the others except that we searchfor all characters up to the right parenthesis. It may be superfluous to search for the last parameter sincewe don't have to move it. But this pattern guarantees that we update only those instances of foo() where3 parameters are specified. In these times of function and method overloading, being explicit oftenproves to be useful. In the substitution portion of the command, we explicitly enter the invocation of

We have a CSV (comma separated value) file with information we need, but in the wrong format. Thecolumns of data are currently arranged in the following order: Name, Company Name, State, PostalCode. We need to reorganize the data into the following order in order to use it with a particular piece ofsoftware: Name, State-Postal Code, Company Name. This means that we must change the order of thecolumns in addition to merging two columns to form a new column value. The particular piece ofsoftware that needs this data will not work if there are any whitespace characters (spaces or tabs) beforeor after the commas. So we must remove whitespace around the commas.

Here are a few lines from the data we have:

Bill Jones, HI-TEK Corporation , CA, 95011


Sharon Lee Smith, Design Works Incorporated, CA, 95012


B. Amos , Hill Street Cafe, CA, 95013
Alexander Weatherworth, The Crafts Store, CA, 95014
...


We need to transform them to look like this:

Bill Jones,CA 95011,HI-TEK Corporation


Sharon Lee Smith,CA 95012,Design Works Incorporated


B. Amos,CA 95013,Hill Street Cafe
Alexander Weatherworth,CA 95014,The Crafts Store
...


We'll look at two regular expressions to solve this problem. The first moves the columns around andmerges the data. The second removes the excess spaces.

Here is the first pass at a substitution command that will solve the problem:

:%s/\([^,]*\),\([^,]*\),\([^,]*\),\(.*\)/\1,\3 \4,\2/


The approach is similar to that of Example 1. The Name is matched by the expression \([^,]*\), that
is, all characters up to the first comma. The name can then be referenced as \1 in the replacementpattern. The Company Name and State fields are matched just like the Name field and are referenced as
\2 and \3 in the replacement pattern. The last field is matched with the expression \(.*\) which can be
translated as "match all characters through the end of the line". The replacement pattern is constructed bycalling out each tagged expression in the appropriate order and adding or not adding the delimeter.

The following substitution command will remove the excess spaces:

:%s/[ \t]*,[ \t]*/,/g


To break it down: [ \t] matches a space or tab character; [ \t]* matches 0 or more spaces or tabs; [
\t]*, matches 0 or more spaces or tabs followed by a comma; and finally [ \t]*,[ \t]* matches 0 or
more spaces or tabs followed by a comma followed by 0 or more spaces or tabs. In the replacementpattern, we simply replace whatever we matched with a single comma. The optional g parameter isadded to the end of the substitution command to apply the substitution to all commas in the line.

Example 3

Suppose you have a multi-character sequence that repeats. For example, consider the following:

Billy tried really hard
Sally tried really really hard
Timmy tried really really really hard


Johnny tried really really really really hard


Now suppose you want to change "really", "really really", and any number of consecutive "really"
strings to a single word: "very". The command

:%s/\(really \)\(really \)*/very /


changes the text above to:

Billy tried very hard
Sally tried very hard
Timmy tried very hard
Johnny tried very hard


The expression \(really \)* matches 0 or more sequences of "really ". The sequence \(really
\)\(really \)* matches one or more instances of the sequence "really ".

Hard Examples (Magical Hieroglyphics)

coming soon.


Regular Expressions In Various Tools

OK, you'd like to use regular expressions, but you can't bring yourself to use vi. Here, then, are a fewexamples of how to use regular expressions in other tools. Also, I have attempted to summarize thedifferences in regular expressions you will find between different programs.

You can use regular expressions in the Visual C++ editor. Select Edit->Replace, then be sure to checkthe checkbox labled "Regular expression". For vi expressions of the form :%s/pat1/pat2/g set the Find
What field to pat1 and the Replace with field to pat2. To simulate the range (% in this case) and the goption you will have to use the Replace All button or appropriate combinations of Find Next andReplace

sed

Sed is a Stream EDitor which can be used to make changes to files or pipes. For complete details, see
the man page sed(1).

Here are a few interesting sed scripts. Assume that we're processing a file called price.txt. Note that theedits don't actually happen to the input file, sed simply processes each line of the file with the commandyou supply and echos the result to its standard out.

sed script Description


sed 's/^$/d' price.txt removes all empty lines

sed 's/^[ \t]*$/d' price.txt removes all lines containing only whitespace

sed 's/"//g' price.txt remove all quotation marks

awk

Awk is a programming language which can be used to perform sophisticated analysis and manipulationof text data. For complete details, see the man page awk(1). Its peculiar name is an acronym made up ofthe first character of its authors last names (Aho, Weinberger, and Kernighan).

There are many good awk examples in the book The AWK Programming Language (written by Aho,
Weinberger, and Kernighan). Please don't form any broad opinions about awk's capabilities based on thefollowing trivial sample scripts. For purposes of these examples, assume that we're working with a filecalled price.txt. As with sed, awk simply echos its output to its standard out.

awk script Description


awk '$0 !~ /^$/' price.txt removes all empty lines

awk 'NF > 0' price.txt a better way to remove all lines in awk
awk '$2 ~ /^[JT]/ {print $3}' price.txt print the third field of all lines whosesecond field begins with 'J' or 'T'
awk '$2 !~ /[Mm]isc/ {print $3 + $4}' price.txt for all lines whose second field does not
contain 'Misc' or 'misc' print the sum ofcolumns 3 and 4 (assumed to be numbers).
awk '$3 !~ /^[0-9]+\.[0-9]*$/ {print $0}' print all lines where field 3 is not a number.
price.txt The number must be of the form: d.d or d.
where d is any number of digits from 0 to 9.
awk '$2 ~ /John|Fred/ {print $0}' price.txt print the entire line if the second fieldcontains 'John' or 'Fred'

grep

grep is a program used to match regular expressions in one or more specified files or in an input stream.
Its name programming language which can be used to perform data manipulation on files or pipes. Forcomplete details, see the man page grep(1). Its peculiar name stems from its roots as a command in vi,
g/re/p meaning global regular expression print.

For the examples below, assume we have the text below in a file named phone.txt. Its format is lastname followed by a comma, first name followed by a tab, then a phone number.

Francis, John 5-3871
Wong, Fred 4-4123
Jones, Thomas 1-4122
Salazar, Richard 5-2522

grep command Description


grep '\t5-...1' phone.txt print all the lines in phone.txt where the phone number begins with 5and ends with 1. Note that the tab character is represented by \t.
grep '^S[^ ]* R' phone.txt print lines where the last name begins with S and first name begins

with R.
grep '^[JW]' phone.txt        print lines where the last name begins with J or W  
grep ', ....\t' phone.txt       print lines where the first name is 4 characters. The tab character isrepresented by \t.
grep -v '^[JW]' phone.txt       print lines that do not begin with J or W
grep '^[M-Z]' phone.txt         print lines where the last name begins with any letter from M to Z.
grep '^[M-Z].*[12]'             print lines where the last name begins with a letter from M to Z andwhere the phone number ends with a 1 or 2.
phone.txt

egrep

egrep is an extended version of grep. It supports a few more metacharacters in its regular expressions.
For the examples below, assume we have the text below in a file named phone.txt. Its format is lastname followed by a comma, first name followed by a tab, then a phone number.

Francis, John 5-3871
Wong, Fred 4-4123
Jones, Thomas 1-4122
Salazar, Richard 5-2522
egrep command Description


egrep '(John|Fred)' phone.txt        print all lines that contain the name John or Fred.
egrep 'John|22$|^W' phone.txt        print lines that contain John or that end with 22 or that begin with W.
egrep 'net(work)?s' report.txt        print lines in report.txt contain networks or nets.

The vi Substitution Command

Vi's substitution command has the form

:ranges/pat1/pat2/g

where
: begins an ex (command line editor) command which is applied to the file currently being edited.
range is the line range specifier. Use the percent sign (%) to indicate all lines. Use the dot (.) toindicate the current line. Use the dollar sign to indicate the last line. You can also use specific linenumbers. Examples: 10,20 means lines 10 through 20; .,$ means from the current line to the last
line; .+2,$-5 means from two lines after the current through the fifth line up from the end of the
file.
s is the substitution command.
pat1 is the regular expression to be searched for. This paper is full of examples.
pat2 is the replacement pattern. This paper is full of examples.
g is optional. When present the substitution is made to all matches on the line. When it is not

present, the substitution is applied only to the first match on the line.

There are many online manuals for vi that provide more complete detail. This page has a number of good

vi links and information.