正则表达式基本概念
正则表达式(regular expression)是一个模式,这个模式描述了一个字符串的集合。正则表达式的用途是对文本的查找和替换。
当前常见的有3个不同版本的正则表达式语法,它们是“basic” (BRE), “extended” (ERE) 和 “perl” (PRCE)。perl正则表达式提供了比extended更加丰富的功能,但是不一定在所有的平台上能够使用它定义的所有功能。关于perl正则表达式的语法可查看man 3 pcrepattern。这里主要以extended的为主。
关于正则表达式语法的介绍可以在Ubuntu终端执行man 1 grep,查看“REGULAR EXPRESSIONS”章节。
正则表达式元字符总结
下面转载一个来自msdn的关于正则表达式元字符说明的几个表格,总结的不错,在快速查看时非常有效。
原文地址:Regular Expression Syntax
单个字符的元字符
正则表达式包括普通字符(例如,a 到 z 之间的字母)和特殊字符(称为“元字符”,Special Character, Metacharacter)。
Metacharacter | Behavior | Example |
---|---|---|
* | Matches the preceding character or subexpression zero or more times. Equivalent to {0,}. | zo* matches “z” and “zoo”. |
+ | Matches the preceding character or subexpression one or more times. Equivalent to {1,}. | zo+ matches “zo” and “zoo”, but not “z”. |
? | Matches the preceding character or subexpression zero or one time. Equivalent to {0,1}. When ? immediately follows any other quantifier (*, +, ?, {n}, {n,}, or {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible. The default greedy pattern matches as much of the searched string as possible. | zo? matches “z” and “zo”, but not “zoo”. o+? matches a single “o” in “oooo”, and o+ matches all “o”s. do(es)? matches the “do” in “do” or “does”. |
^ | Matches the position at the start of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r. When used as the first character in a bracket expression, ^ negates the character set. | ^\d{3} matches 3 numeric digits at the start of the searched string. [^abc] matches any character except a, b, and c. |
$ | Matches the position at the end of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position before \n or \r. | \d{3}$ matches 3 numeric digits at the end of the searched string. |
. | Matches any single character except the newline character \n. To match any character including the \n, use a pattern like [\s\S]. | a.c matches “abc”, “a1c”, and “a-c”. |
[] | Marks the start and end of a bracket expression. | [1-4] matches “1”, “2”, “3”, or “4”. [^aAeEiIoOuU] matches any non-vowel character. |
{} | Marks the start and end of a quantifier expression. | a{2,3} matches “aa” and “aaa”. |
() | Marks the start and end of a subexpression. Subexpressions can be saved for later use. | A(\d) matches “A0” to “A9”. The digit is saved for later use. |
| | Indicates a choice between two or more items. | z|food matches “z” or “food”. (z|f)ood matches “zood” or “food”. |
/ | Denotes the start or end of a literal regular expression pattern in JScript. After the second “/”, single-character flags can be added to specify search behavior. | /abc/gi is a JScript literal regular expression that matches “abc”. The g (global) flag specifies to find all occurrences of the pattern, and the i (ignore case) flag makes the search case-insensitive. |
\ | Marks the next character as a special character, a literal, a backreference, or an octal escape. | \n matches a newline character. \( matches “(“. \\ matches “\”. |
注意
1. 为了匹配上表这些特殊字符本身,你必须转义它。例如,你要匹配”+”,必须用正则表达式”+”,要匹配”\”,你必须用正则表达式”\”。
2. 大多数元字符放在中括号表达式中(如[a-b])时,会失去在上表中说明的特殊含义。在中括号表达式中, 为了匹配这3个元字符本身,] 必须放在中括号中第一个 (即[]abc]),放在后面的位置即是和[形成[];^ 不能放在中括号中第一个;- 应该放在第一或者最后一个(例如[-a-z],[a-z-])。
3. 关于贪婪/非贪婪匹配:贪婪型元字符匹配尽可能多的字符,非贪婪型元字符匹配尽可能少的字符。典型的贪婪型的元字符是,+,{n,},它们对应的非贪婪型的元字符是?,+?,{n,}?。具体可查看上表关于?字符的说明。
多个字符的元字符
Metacharacter | Behavior | Example |
---|---|---|
\b | Matches a word boundary; that is, the position between a word and a space. | er\b matches the “er” in “never” but not the “er” in “verb”. |
\B | Matches a word non-boundary. | er\B matches the “er” in “verb” but not the “er” in “never”. |
\d | Matches a digit character. Equivalent to [0-9]. | In the searched string “12 345”, \d{2} matches “12” and “34”. \d matches “1”, 2”, “3”, “4”, and “5”. |
\D | Matches a nondigit character. Equivalent to [^0-9]. | \D+ matches “abc” and ” def” in “abc123 def”. |
\w | Matches any of the following characters: A-Z, a-z, 0-9, and underscore. Equivalent to [A-Za-z0-9_]. | In the searched string “The quick brown fox…”, \w+ matches “The”, “quick”, “brown”, and “fox”. |
\W | Matches any character except A-Z, a-z, 0-9, and underscore. Equivalent to [^A-Za-z0-9_]. | In the searched string “The quick brown fox…”, \W+ matches “…” and all of the spaces. |
[xyz] | A character set. Matches any one of the specified characters. | [abc] matches the “a” in “plain”. |
[^xyz] | A negative character set. Matches any character that is not specified. | [^abc] matches the “p”, “l”, “i”, and “n” in “plain”. |
[a-z] | A range of characters. Matches any character in the specified range. | [a-z] matches any lowercase alphabetical character in the range “a” through “z”. |
[^a-z] | A negative range of characters. Matches any character that is not in the specified range. | [^a-z] matches any character that is not in the range “a” through “z”. |
{n} | Matches exactly n times. n is a nonnegative integer. | o{2} does not match the “o” in “Bob”, but does match the two “o”s in “food”. |
{n,} | Matches at least n times. n is a nonnegative integer. * is equivalent to {0,}. + is equivalent to {1,}. | o{2,} does not match the “o” in “Bob” but does match all the “o”s in “foooood”. |
{n,m} | Matches at least n and at most m times. n and m are nonnegative integers, where n <= m. There cannot be a space between the comma and the numbers. ? is equivalent to {0,1}. | In the searched string”1234567”, \d{1,3} matches “123”, “456”, and “7”. |
(pattern) | Matches pattern and saves the match. You can retrieve the saved match from array elements returned by the exec Method in JScript. To match parentheses characters ( ), use “\(” or “\)”. | (Chapter|Section) [1-9] matches “Chapter 5”, and “Chapter” is saved for later use. |
(?:pattern) | Matches pattern but does not save the match; that is, the match is not stored for possible later use. This is useful for combining parts of a pattern with the “or” character (|). | industr(?:y|ies) is equivalent to industry|industries. |
(?=pattern) | Positive lookahead. After a match is found, the search for the next match starts before the matched text. The match is not saved for later use. | ^(?=.*\d).{4,8}$ applies a restriction that a password must be 4 to 8 characters long, and must contain at least one digit. Within the pattern, .*\d finds any number of characters followed by a digit. For the searched string “abc3qr”, this matches “abc3”. Starting before instead of after that match, .{4,8} matches a 4-8 character string. This matches “abc3qr”. The ^ and $ specify the positions at the start and end of the searched string. This is to prevent a match if the searched string contains any characters outside of the matched characters. |
(?!pattern) | Negative lookahead. Matches a search string that does not match pattern. After a match is found, the search for the next match starts before the matched text. The match is not saved for later use. | \b(?!th)\w+\b matches words that do not start with “th”. Within the pattern, \b matches a word boundary. For the searched string ” quick “, this matches the first space. (?!th) matches a string that is not “th”. This matches “qu”. Starting before that match, \w+ matches a word. This matches “quick”. |
\cx | Matches the control character indicated by x. The value of x must be in the range of A-Z or a-z. If it is not, c is assumed to be a literal “c” character. | \cM matches a CTRL+M or carriage return character. |
\xn | Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. Allows ASCII codes to be used in regular expressions. | \x41 matches “A”. \x041 is equivalent to “\x04” followed by “1”, (because n must be exactly 2 digits). |
\num | Matches num, where num is a positive integer. This is a reference to saved matches. | (.)\1 matches two consecutive identical characters. |
\n | Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7). | (\d)\1 matches two consecutive identical digits. |
\nm | Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captured subexpressions, n is a backreference followed by literal m. If neither of those conditions exist, \nm matches octal escape value nm when n and m are octal digits (0-7). | \11 matches a tab character. |
\nml | Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7). | \011 matches a tab character. |
\un | Matches n, where n is a Unicode character expressed as four hexadecimal digits. | \u00A9 matches the copyright symbol (©). |
补充说明:
1. 子表达式和回溯引用(backreference): 把正则表达式使用括号括起来就定义了一个子表达式,在后面可以使用\1,\2,\3…等形式来引用前面定义的第一个,第二个,第三个子表达式。举个例子:有一段文本”This is a block of of text, several words here are are repeated, and and they should not be.”。定义正则表达式 [ ]+(\w+)[ ]+\1,可以匹配连续重复的单词。[ ]+匹配一个或者多个空格,(\w+)为子表达式,匹配一个单词,后面\1引用的是这个表达式,所以of of, are are, and and都可以找出来。回溯引用可以保证前后匹配一致,因为他就是对前面的定义的子表达式的引用。有些正则表达式的实现是用$,而不是\来引用子表达式的。\0匹配整个正则表达式。
正则表达式表示不可打印字符
Character | Matches | Equivalent to |
---|---|---|
\f | Form-feed character(换页符). | \x0c and \cL |
\n | Newline character. | \x0a and \cJ |
\r | Carriage-return character. | \x0d and \cM |
\s | Any white-space character. This includes space, tab, and form feed. | [ \f\n\r\t\v] |
\S | Any non–white space character. | [^ \f\n\r\t\v] |
\t | Tab character. | \x09 and \cI |
\v | Vertical tab character. | \x0b and \cK |
正则表达式中操作符的优先级顺序
Operator or operators | Description |
---|---|
\ | Escape |
(), (?:), (?=), [] | Parentheses and brackets |
*, +, ?, {n}, {n,}, {n,m} | Quantifiers |
^, $, \anymetacharacter | Anchors and sequences |
| | Alternation |
POSIX字符类
POSIX字符类(character class)是对中括号表达式的一个扩展,对字符进行分类并给它们命名。
字符类 | 说明 |
---|---|
[:alnum:] | 任何一个字母或者数字,等价于[a-zA-Z0-9] |
[:alpha:] | 任何一个字母,等价于[a-zA-Z] |
[:blank:] | 空格或制表符,等价于[\t] |
[:cntrl:] | ASCII表中的控制字符,即编码值从0到31的字符,以及127 |
[:digit:] | 任何一个数字,等价于[0-9] |
[:graph:] | 和[:print:]一样,但不包括空格 |
[:lower:] | 任何一个小写字母,等价于[a-z] |
[:print:] | 任何一个可打印字符,可打印字符可以查见上表 |
[:punct:] | 既不属于[:alnum:]也不属于[:cntrl:]的任何一个字符 |
[:space:] | 任何一个空白字符,包括空格,等价于[^\f\n\r\t\v] |
[:upper:] | 任何一个大写字母,等价于[A-Z] |
[:xdigit:] | 任何一个十六进制数字,等价于[a-fA-F0-9] |
注意:
POSIX字符类必须包括在[:和:]之间,我们使用的[:alnum:],其中的[和]是字符类的组成部分,所以在模式表达式应该使用[[:alnum:]]。
正则表达式对字母进行大小写转换
有些正则表达式实现允许我们使用下表的元字符对字母进行大小写转换。
元字符 | 说明 |
---|---|
\E | 结束\L或者\U转换 |
\l | 把下一个字符转换为小写 |
\L | 把\L到\E之间的字符全部转换为小写 |
\u | 把下一个字符转换为大写 |
\U | 把\U到\E之间的字符全部转换为大写 |
正则表达式中的前后查找
前后查找(lookaround)模式定义了一个必须匹配但不在结果中返回的模式。前/后是指与被查找文本(即子表达式中的pattern)的相对位置而言,左为前,右为后。
向前查找(lookahead)模式: 实际上就是一个以?=开头的子表达式,需要匹配的文本跟在=的后面,语法是(?=pattern)。
例子:
文本
http://www.forta.com/
ftp://ftp.fforta.com/
使用正则表达式.+(?=:)将匹配http,ftp。(?=:)定义了向前查找模式,匹配:,但是并不在结果中返回。所以整个正则表达式返回:之前的任意字符。
向后查找(lookbehind)模式:实际上就是一个以?<=开头的子表达式,需要匹配的文本跟在<=的后面,语法是(?<=pattern)。
例子:
文本
ABC01: $23.45
HGG42: $5.31
CFMX1: $899.00
使用正则表达式(?<=$)[0-9.]+, 即可匹配23.45, 5.31, 899.00。
负前后查找(negative lookaround)模式: 前后查找模式实践上是用来定位的,通过匹配特定的模式来定位文本的位置,基于这个位置在向前或者向后匹配,这种用法被称为正向前查找(positive lookahead)和正向后查找(positive lookbehind)。
还有一种不太常见的用法叫做负前后查找(negative lookaround)。负向前查找(negative lookahead)将向前查找不与给定模式相匹配的文本。负向后查找(negative lookbehind)将向后查找不与给定模式相匹配的文本。
操作符 | 说明 |
---|---|
(?=pattern) | 正向前查找 |
(?!pattern) | 负向前查找 |
(?<=pattern) | 正向后查找 |
(? | 负向后前查找 |
例子:
文本
I paid $30 for 100 apples,
50 oranges, and 60 peers.
I saved $5 on this order.
正向后查找模式(?<=$)\d+, 匹配30, 5。
负向后前查找模式\b(?<!\$)\d+\b
,匹配100,50,60。