Programming in Lua | ||
Part III. The Standard Libraries Chapter 20. The String Library |
20.2 - Patterns
You can make patterns more useful with character classes. A character class is an item in a pattern that can match any character in a specific set. For instance, the class %d
matches any digit. Therefore, you can search for a date in the format dd/mm/yyyy
with the pattern '%d%d/%d%d/%d%d%d%d
':
s = "Deadline is 30/05/1999, firm" date = "%d%d/%d%d/%d%d%d%d" print(string.sub(s, string.find(s, date))) --> 30/05/1999
The following table lists all character classes:
. | all characters |
%a | letters |
%c | control characters |
%d | digits |
%l | lower case letters |
%p | punctuation characters |
%s | space characters |
%u | upper case letters |
%w | alphanumeric characters |
%x | hexadecimal digits |
%z | the character with representation 0 |
An upper case version of any of those classes represents the complement of the class. For instance, '%A
' represents all non-letter characters:
print(string.gsub("hello, up-down!", "%A", ".")) --> hello..up.down. 4
(The 4
is not part of the result string. It is the second result of gsub
, the total number of substitutions. Other examples that print the result of gsub
will omit this count.)
Some characters, called magic characters, have special meanings when used in a pattern. The magic characters are
( ) . % + - * ? [ ^ $
The character `%
´ works as an escape for those magic characters. So, '%.
' matches a dot; '%%
' matches the character `%
´ itself. You can use the escape `%
´ not only for the magic characters, but also for all other non-alphanumeric characters. When in doubt, play safe and put an escape.
For Lua, patterns are regular strings. They have no special treatment and follow the same rules as other strings. Only inside the functions are they interpreted as patterns and only then does the `%
´ work as an escape. Therefore, if you need to put a quote inside a pattern, you must use the same techniques that you use to put a quote inside other strings; for instance, you can escape the quote with a `/
´, which is the escape character for Lua.
A char-set allows you to create your own character classes, combining different classes and single characters between square brackets. For instance, the char-set '[%w_]
' matches both alphanumeric characters and underscores, the char-set '[01]
' matches binary digits, and the char-set '[%[%]]
' matches square brackets. To count the number of vowels in a text, you can write
_, nvow = string.gsub(text, "[AEIOUaeiou]", "")
You can also include character ranges in a char-set, by writing the first and the last characters of the range separated by a hyphen. You will seldom need this facility, because most useful ranges are already predefined; for instance, '[0-9]
' is simpler when written as '%d
', '[0-9a-fA-F]
' is the same as '%x
'. However, if you need to find an octal digit, then you may prefer '[0-7]
', instead of an explicit enumeration ('[01234567]
'). You can get the complement of a char-set by starting it with `^
´: '[^0-7]
' finds any character that is not an octal digit and '[^/n]
' matches any character different from newline. But remember that you can negate simple classes with its upper case version: '%S
' is simpler than '[^%s]
'.
Character classes follow the current locale set for Lua. Therefore, the class '[a-z]
' can be different from '%l
'. In a proper locale, the latter form includes letters such as `ç
´ and `ã
´. You should always use the latter form, unless you have a strong reason to do otherwise: It is simpler, more portable, and slightly more efficient.
You can make patterns still more useful with modifiers for repetitions and optional parts. Patterns in Lua offer four modifiers:
+ | 1 or more repetitions |
* | 0 or more repetitions |
- | also 0 or more repetitions |
? | optional (0 or 1 occurrence) |
The `+
´ modifier matches one or more characters of the original class. It will always get the longest sequence that matches the pattern. For instance, the pattern '%a+
' means one or more letters, or a word:
print(string.gsub("one, and two; and three", "%a+", "word")) --> word, word word; word word
The pattern '%d+
' matches one or more digits (an integer):
i, j = string.find("the number 1298 is even", "%d+") print(i,j) --> 12 15
The modifier `*
´ is similar to `+
´, but it also accepts zero occurrences of characters of the class. A typical use is to match optional spaces between parts of a pattern. For instance, to match an empty parenthesis pair, such as ()
or ( )
, you use the pattern '%(%s*%)
'. (The pattern '%s*
' matches zero or more spaces. Parentheses have a special meaning in a pattern, so we must escape them with a `%
´.) As another example, the pattern '[_%a][_%w]*
' matches identifiers in a Lua program: a sequence that starts with a letter or an underscore, followed by zero or more underscores or alphanumeric characters.
Like `*
´, the modifier `-
´ also matches zero or more occurrences of characters of the original class. However, instead of matching the longest sequence, it matches the shortest one. Sometimes, there is no difference between `*
´ or `-
´, but usually they present rather different results. For instance, if you try to find an identifier with the pattern '[_%a][_%w]-
', you will find only the first letter, because the '[_%w]-
' will always match the empty sequence. On the other hand, suppose you want to find comments in a C program. Many people would first try '/%*.*%*/
' (that is, a "/*"
followed by a sequence of any characters followed by "*/"
, written with the appropriate escapes). However, because the '.*
' expands as far as it can, the first "/*"
in the program would close only with the last "*/"
:
test = "int x; /* x */ int y; /* y */" print(string.gsub(test, "/%*.*%*/", "<COMMENT>")) --> int x; <COMMENT>
The pattern '.-
', instead, will expand the least amount necessary to find the first "*/"
, so that you get your desired result:
test = "int x; /* x */ int y; /* y */" print(string.gsub(test, "/%*.-%*/", "<COMMENT>")) --> int x; <COMMENT> int y; <COMMENT>
The last modifier, `?
´, matches an optional character. As an example, suppose we want to find an integer in a text, where the number may contain an optional sign. The pattern '[+-]?%d+
' does the job, matching numerals like "-12"
, "23"
and "+1009"
. The '[+-]
' is a character class that matches both a `+
´ or a `-
´ sign; the following `?
´ makes that sign optional.
Unlike some other systems, in Lua a modifier can only be applied to a character class; there is no way to group patterns under a modifier. For instance, there is no pattern that matches an optional word (unless the word has only one letter). Usually you can circumvent this limitation using some of the advanced techniques that we will see later.
If a pattern begins with a `^
´, it will match only at the beginning of the subject string. Similarly, if it ends with a `$
´, it will match only at the end of the subject string. These marks can be used both to restrict the patterns that you find and to anchor patterns. For instance, the test
if string.find(s, "^%d") then ...
checks whether the string s
starts with a digit and the test
if string.find(s, "^[+-]?%d+$") then ...
checks whether that string represents an integer number, without other leading or trailing characters.
Another item in a pattern is the '%b
', that matches balanced strings. Such item is written as '%bxy
', where x and y are any two distinct characters; the x acts as an opening character and the y as the closing one. For instance, the pattern '%b()
' matches parts of the string that start with a `(
´ and finish at the respective `)
´:
print(string.gsub("a (enclosed (in) parentheses) line", "%b()", "")) --> a line
Typically, this pattern is used as '%b()
', '%b[]
', '%b%{%}
', or '%b<>
', but you can use any characters as delimiters.
Copyright © 2003-2004 Roberto Ierusalimschy. All rights reserved. |