Regular Expressions
Overview
ICU's Regular Expressions package providesapplications with the ability to apply regular expression matching toUnicode string data. The regular expression patterns and behavior arebased on Perl's regular expressions. The C++ programming API for usingICU regular expressions is loosely based on the JDK 1.4 packagejava.util.regex, with some extensions to adapt it for use in a C++environment. A plain C API is also provided.
The ICU Regularexpression API supports operations including testing for a patternmatch, searching for a pattern match, and replacing matched text.Capture groups allow subranges within an overall match to beidentified, and to appear within replacement text.
A Perl-inspired split() function that breaks a string into fields based on a delimiter pattern is also included.
ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2.Adetailed description of regular expression patterns and patternmatching behavior is not included in this user guide. The bestreference for this topic is the book "Mastering Regular Expressions,Second Edition" by Jeffrey E. F. Friedl, O'Reilly & Associates; 2ndedition (July 15, 2002). Matching behavior can sometimes be surprising,and this book is highly recommended for anyone doing significant workwith regular expressions.
Using ICU Regular Expressions
The ICU C++ Regular Expression API includes two classes, RegexPattern and RegexMatcher, that parallel the classes from the Java JDK package java.util.regex. A RegexPattern represents a compiled regular expression while RegexMatcher associates a RegexPatternand an input string to be matched, and provides API for the variousfind, match and replace operations. In most cases, however, only theclass RegexMatcher is needed, and the existence of class RegexPattern can safely be ignored.
The first step in using a regular expression is typically the creation of a RegexMatcher object from the source (string) form of the regular expression.
RegexMatcherholds a pre-processed (compiled) pattern and a reference to an inputstring to be matched, and provides API for the various find, match andreplace operations. RegexMatcherscan be reset and reused with new input, thus avoiding object creationoverhead when performing the same matching operation repeatedly ondifferent strings.
The following code will create a RegexMatcher from a string containing a regular expression, and then perform a simple find() operation.
#include <unicode/regex.h> UErrorCode status = U_ZERO_ERROR; ... RegexMatcher *matcher = new RegexMatcher("abc+", 0, status); if (U_FAILURE(status)) { // Handle any syntax errors in the regular expression here ... } UnicodeString stringToTest = "Find the abc in this string"; matcher->reset(stringToTest); if (matcher->find()) { // We found a match. int startOfMatch = matcher->start(status); // string index of start of match. ... } |
Several types of matching tests are available
Function | Description |
---|---|
matches() | True if the pattern matches the entire string. from the start through to the last character. |
lookingAt() | True if the pattern matches at the start of the string. The match need not include the entire string. |
find() | Trueif the pattern matches somewhere within the string. Successive calls tofind() will find additional matches, until the string is exhausted. |
Ifadditional text is to be checked for a match with the same pattern,there is no need to create a new matcher object; just reuse theexisting one.
myMatcher->reset(anotherString); if (myMatcher->matches(status)) { // We have a with the new string. } |
Note that matching happens directly in the string supplied bythe application. This reduces the overhead when resetting a matcher toan absolute minimum – the matcher need only store a reference to thenew string – but it does mean that the application must be careful notto modify or delete the string while the matcher is holding a referenceto the string.
After finding a match, additional information isavailable about the range of the input matched, and the contents of anycapture groups. Note that, for simplicity, any error parameters havebeen omitted. See the API reference for complete a complete description of the API.
Function | Description |
---|---|
start() | Return the index of the start of the matched region in the input string . |
end() | Return the index of the first character following the match. |
group() | Return a UnicodeString containing the text that was matched. |
start(n) | Return the index of the start of the text matched by the nth capture group. |
end(n) | Return the index of the first character following the text matched by the nth capture group. |
group(n) | Return a UnicodeString containing the text that was matched by the nth capture group.. |
Regular Expression Metacharacters
Character | outside of sets | [inside sets] | Description |
---|---|---|---|
\a | ✓ | ✓ | Match a BELL, \u0007 |
\A | ✓ | Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. | |
\b | ✓ | Matchif the current position is a word boundary. Boundaries occur at thetransitions between word (\w) and non-word (\W) characters, withcombining marks ignored. For better word boundaries, see ICU Boundary Analysis. | |
\B | ✓ | Match if the current position is not a word boundary. | |
\cX | ✓ | ✓ | Match a control-X character. |
\d | ✓ | ✓ | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
\D | ✓ | ✓ | Match any character that is not a decimal digit. |
\e | ✓ | ✓ | Match an ESCAPE, \u001B. |
\E | ✓ | ✓ | Terminates a \Q ... \E quoted sequence. |
\f | ✓ | ✓ | Match a FORM FEED, \u000C. |
\G | ✓ | Match if the current position is at the end of the previous match. | |
\n | ✓ | ✓ | Match a LINE FEED, \u000A. |
\N{UNICODE CHARACTER NAME} | ✓ | ✓ | Match the named character. |
\p{UNICODE PROPERTY NAME} | ✓ | ✓ | Match any character with the specified Unicode Property. |
\P{UNICODE PROPERTY NAME} | ✓ | ✓ | Match any character not having the specified Unicode Property. |
\Q | ✓ | ✓ | Quotes all following characters until \E. |
\r | ✓ | ✓ | Match a CARRIAGE RETURN, \u000D. |
\s | ✓ | ✓ | Match a white space character. White space is defined as [\t\n\f\r\p{Z}]. |
\S | ✓ | ✓ | Match a non-white space character. |
\t | ✓ | ✓ | Match a HORIZONTAL TABULATION, \u0009. |
\uhhhh | ✓ | ✓ | Match the character with the hex value hhhh. |
\Uhhhhhhhh | ✓ | ✓ | Matchthe character with the hex value hhhhhhhh. Exactly eight hex digitsmust be provided, even though the largest Unicode code point is\U0010ffff. |
\w | ✓ | ✓ | Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. |
\W | ✓ | ✓ | Match a non-word character. |
\x{hhhh} | ✓ | ✓ | Match the character with hex value hhhh. From one to six hex digits may be supplied. |
\xhh | ✓ | ✓ | Match the character with two digit hex value hh |
\X | ✓ | Match a Grapheme Cluster. | |
\Z | ✓ | Match if the current position is at the end of input, but before the final line terminator, if one exists. | |
\z | ✓ | Match if the current position is at the end of input. | |
\n | ✓ | BackReference. Match whatever the nth capturing group matched. n must be anumber > 1 and < total number of capture groups in the pattern. | |
\0ooo | ✓ | ✓ | Match an Octal character. 'ooo' is from one to three octal digits. 0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references. |
[pattern] | ✓ | ✓ | Match any one character from the set. |
. | ✓ | Match any character. | |
^ | ✓ | Match at the beginning of a line. | |
$ | ✓ | Match at the end of a line. | |
\ | ✓ | Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . | |
\ | ✓ | Quotes the following character. Characters that must be quoted to be treated as literals are [ ] \ Characters that may need to be quoted, depending on the context are - & |
Regular Expression Operators
Operator | Description |
---|---|
| | Alternation. A|B matches either A or B. |
* | Match 0 or more times. Match as many times as possible. |
+ | Match 1 or more times. Match as many times as possible. |
? | Match zero or one times. Prefer one. |
{n} | Match exactly n times |
{n,} | Match at least n times. Match as many times as possible. |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. |
*? | Match 0 or more times. Match as few times as possible. |
+? | Match 1 or more times. Match as few times as possible. |
?? | Match zero or one times. Prefer zero. |
{n}? | Match exactly n times |
{n,}? | Match at least n times, but no more than required for an overall pattern match |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. |
*+ | Match0 or more times. Match as many times as possible when firstencountered, do not retry with fewer even if overall match fails(Possessive Match) |
++ | Match 1 or more times. Possessive match. |
?+ | Match zero or one times. Possessive match. |
{n}+ | Match exactly n times |
{n,}+ | Match at least n times. Possessive Match. |
{n,m}+ | Match between n and m times. Possessive Match. |
( ... ) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
(?: ... ) | Non-capturingparentheses. Groups the included pattern, but does not providecapturing of matching text. Somewhat more efficient than capturingparentheses. |
(?> ... ) | Atomic-matchparentheses. First match of the parenthesized subexpression is the onlyone tried; if it does not lead to an overall pattern match, back up thesearch for a match to a position before the "(?>" |
(?# ... ) | Free-format comment (?# comment ). |
(?= ... ) | Look-aheadassertion. True if the parenthesized pattern matches at the currentinput position, but does not advance the input position. |
(?! ... ) | Negativelook-ahead assertion. True if the parenthesized pattern does not matchat the current input position. Does not advance the input position. |
(?<= ... ) | Look-behindassertion. True if the parenthesized pattern matches text preceding thecurrent input position, with the last character of the match being theinput character just before the current position. Does not alter theinput position. The length of possible strings matched by thelook-behind pattern must not be unbounded (no * or + operators.) |
(?<! ... ) | NegativeLook-behind assertion. True if the parenthesized pattern does not matchtext preceding the current input position, with the last character ofthe match being the input character just before the current position.Does not alter the input position. The length of possible stringsmatched by the look-behind pattern must not be unbounded (no * or +operators.) |
(?ismwx-ismwx: ... ) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
(?ismwx-ismwx) | Flagsettings. Change the flag settings. Changes apply to the portion of thepattern following the setting. For example, (?i) changes to a caseinsensitive match. |
Set Expressions (Character Classes)
Example | Description |
---|---|
[abc] | Match any of the characters a, b or c |
[^abc] | Negation - match any character except a, b or c |
[A-M] | Range - match any character from A to M. The characters to include are determined by Unicode code point ordering. |
[\u0000-\U0010ffff] | Range - match all characters. |
[\p{Letter}] [\p{General_Category=Letter}] [\p{L}] | Characters with Unicocde Category = Letter. All forms shown are equivalent. |
[\P{Letter}] | Negated property. (Upper case \P) Match everything except Letters. |
[\p{numeric_value=9}] | Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions. |
[\p{Letter}&&\p{script=cyrillic}] | Logical AND or intersection. Match the set of all Cyrillic letters. |
[\p{Letter}--\p{script=latin}] | Subtraction. Match all non-Latin letters. |
[[a-z][A-Z][0-9]] [a-zA-Z0-9]] | Implicit Logical OR or Union of Sets. The examples match ASCII letters and digits. The two forms are equivalent. |
[:script=Greek:] | Alternate POSIX-like syntax for properties. Equivalent to \p{script=Greek} |
Case Insensitive Matching
- Anything from a regular expression pattern that looks like a literal string (even of one character) will be matched against the text using full case folding. The pattern string and the matched text may be of different lengths.
- Any sequence that is composed by the matching engine from originally separate parts of the pattern will not match with the composition boundary within a case folding expansion of the text being matched.
- Matching of [set expressions] uses simple matching. A [set] will match exactly one code point from the text.
- pattern "fussball" will match "fußball or "fussball"
- pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
- pattern "ß" will find occurences of "ss" or "ß"
- pattern "s+" will not find "ß"
Replacement Text
Thereplacement text for find-and-replace operations may contain referencesto capture-group text from the find. References are of the form $n, where n is the number of the capture group.
Character | Descriptions |
---|---|
$n | Thetext of capture group n will be substituted for $n. n must be >= 0and not greater than the number of capture groups. A $ not followed bya digit has no special meaning, and will appear in the substitutiontext as itself, a $. |
\ | Treatthe following character as a literal, suppressing any special meaning.Backslash escaping in substitution text is only required for '$' and'\', but may be used on any other character without bad effects. |
Flag Options
Thefollowing flags control various aspects of regular expression matching.The flag values may be specified at the time that an expression iscompiled into a RegexPattern object, or they may be specified withinthe pattern itself using the (?ismx-ismx) pattern options.
The UREGEX_CANON_EQ option is not yet available. |
Flag (pattern) | Flag (API Constant) | Description |
---|---|---|
UREGEX_CANON_EQ | If set, matching will take the canonical equivalence of characters into account. NOTE: this flag is not yet implemented. | |
i | UREGEX_CASE_INSENSITIVE | If set, matching will take place in a case-insensitive manner. |
x | UREGEX_COMMENTS | If set, allow use of white space and #comments within patterns |
s | UREGEX_DOTALL | Ifset, a "." in a pattern will match a line terminator in the input text.By default, it will not. Note that a carriage-return / line-feed pairin text behave as a single line terminator, and will match a single "."in a RE pattern |
m | UREGEX_MULTILINE | Controlthe behavior of "^" and "$" in a pattern. By default these will onlymatch at the start and end, respectively, of the input text. If thisflag is set, "^" and "$" will also match at the start and end of eachline within the input text. |
w | UREGEX_UWORD | Controlsthe behavior of \b in a pattern. If set, word boundaries are foundaccording to the definitions of word found in Unicode UAX 29, TextBoundaries. By default, word boundaries are identified by means of asimple classification of characters as either “word” or “non-word”,which approximates traditional regular expression behavior. The resultsobtained with the two options can be quite different in runs of spacesand other non-word characters. |
Using split()
ICU'ssplit() function is similar in concept to Perl's – it will split astring into fields, with a regular expression match defining the fielddelimiters and the text between the delimiters being the field contentitself.
Suppose you have a string of words separated by spaces
UnicodeString s = “dog cat giraffe”;
This code will extract the individual words from the string.
UErrorCode status = U_ZERO_ERROR; RegexMatcher m(“\\s+”, 0, status); const int maxWords = 10; UnicodeString words[maxWords]; int numWords = m.split(s, words, maxWords, status); |
After the split(),
Variable | value |
---|---|
numWords | 3 |
words[0] | “dog” |
words[1] | “cat” |
words[2] | “giraffe” |
words[3 to 9] | “” |
The field delimiters, the spaces from the original string, do not appear in the output strings.
Note that, in this example, “words”is a local, or stack array of actual UnicodeString objects. No heapallocation is involved in initializing this array of empty strings (C++is not Java!). Local UnicodeString arrays like this are a very good fitfor use with split(); after extracting the fields, any values that needto be kept in some more permanent way can be copied to their ultimatedestination.
If the number of fields in a string being splitexceeds the capacity of the destination array, the last destinationstring will contain all of the input string data that could not besplit, including any embedded field delimiters. This is similar tosplit() in Perl.
If the pattern expression contains capturingparentheses, the captured data ($1, $2, etc.) will also be saved in thedestination array, interspersed with the fields themselves.
If,in the “dog cat giraffe” example, the pattern had been “(\s+)” insteadof “\s+”, split() would have produced five output strings instead ofthree. Words[1] and words[3] would have been the spaces.
Find and Replace
Description of AppendReplacement() and AppendTail(). To be added.
Performance Tips
(A+)+B
The expression can't match - there is no 'B' in the input - but the engine is too dumb to realize that, and will try all possible permutations of rearranging the input between the terms of the expression before failing.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
Some suggestions:
- Avoid, or examine carefully, any expressions with nested repeating quantifiers, like in the example above. They can often be recast in some other way. Any ambiguity in how input text could be distributed between the terms of the expression will cause problems.
- Narrow every term in a pattern to match as small a set of characters as possible at each point. Fail as early as possible with bad input, rather than letting broad .* style terms eat intermediate input and relying on later terms in the expression to produce a failure.
- Use possessive quantifiers when possible - *+ instead of *, ++ instead of +
These operators prevent backtracking; the initial match of a *+ qualified pattern is either used in its entirety as part of the complete match, or it is not used at all.
- Follow or surround * or + expressions with terms that the repeated expression can not match. The idea is to have only one possible way to match the input, with no possibility of redistributing the input between adjacent terms of the pattern.
- Avoid overly long and complex regular expressions. Just because it's possible to do something completely in one large expression doesn't mean that you should. Long expressions are difficult to understand and can be almost impossible to debug when they go wrong. It is no sin to break a parsing problem into pieces and to have some code involved involved in the process.
- Set a time limit. ICU includes the ability to limit the time spent on a regular expression match. This is a good idea when running untested expressions from users of your application, or as a fail safe for servers or other processes that cannot afford to be hung.
Examples from actual bug reports,
The initial part of the expression can be recast as
A further note: this expression was intended to parse email addresses, and has a number of other flaws. For common tasks like this there are libraries of freely available regular expressions that have been well debugged. It's worth making a quick search before writing a new expression.
Heap and Stack Usage
Differences with Java Regular Expressions
- ICU does not support UREGEX_CANON_EQ. See http://bugs.icu-project.org/trac/ticket/9111
- The behavior of \cx (Control-X) different from Java when x is outside the range A-Z. See http://bugs.icu-project.org/trac/ticket/6068
- Java allows quantifiers (*, +, etc) on zero length tests. ICU does not. Occurrences of these in patterns are most likely unintended user errors, but it is an incompatibility with Java. http://bugs.icu-project.org/trac/ticket/6080
- ICU recognizes all Unicode properties known to ICU, which is all of them. Java is restricted to just a few.
- ICU case insensitive matching works with all Unicode characters, and, within string literals, does full Unicode matching (where matching strings may be different lengths.) Java does ASCII.
- ICU has an extended syntax for set [bracket] expressions, including additional operators. Added for improved compatibility with the original ICU implementation, which was based on ICU UnicodeSet pattern syntax.
- ICU does not support named capture groups. http://bugs.icu-project.org/trac/ticket/5312