正则表达式学习指南(三)----字符

最新推荐文章于 2024-09-29 23:52:52 发布

wushuai1346

最新推荐文章于 2024-09-29 23:52:52 发布

阅读量577

点赞数

分类专栏：正则表达式文章标签：正则表达式 character regex compiler search insert

正则表达式专栏收录该内容

29 篇文章 0 订阅

订阅专栏

原义字符

最基本的正则表达式只有一个最基本的原义字符构成,例如:"a".他将匹配字符串中第一个出现的字符.如果字符串是"Jack is a boy",他将匹配"J"后面的"a".事实上,这个"a"是否在单词中间对正则引擎来说并不重要.但如果这对你来讲很重要,你就需要使用单词边界来告诉引擎这件事.我们稍后会谈到这点.

这个正则也可以匹配第二个"a".但这需要你来告诉引擎:在结束第一次匹配后,继续匹配字符串的其他位置.在一个文本编辑器中,你可以通过使用"查找上一个"或者"查找下一个"来实现这个功能.而在在一门编程语言里,你也可以通过调用特别的函数来实现它.

类似的,正则"cat"将匹配"About cats and dogs"中的"cat".这个正则表达式包含三个原义字符.这有点像告诉正则引擎说:先搜索"c",紧接着搜索"a",紧接着找"t".

注意,除非特别指定,正则在默认情况下是开启大小区分的,"cat"不匹配"Cat".

Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In theregex flavors discussed in this tutorial, there are 11 characters with special meanings: the opening square bracket[, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark?, the asterisk or star *, the plus sign+, the opening round bracket ( and the closing round bracket). These special characters are often called "metacharacters".

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.

Note that 1+1=2, with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match1+1=2. It would match 111=2 in 123+111=234, due to the special meaning of the plus character.

If you forget to escape a special character where its use is not allowed, such as in+1, then you will get an error message.

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like{1,3}. So you generally do not need to escape it with a backslash, though you can do so if you want. An exception to this rule is thejava.util.regex package: it requires all literal braces to be escaped.

All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. E.g.\d will match a single digit from 0 to 9.

Escaping a single metacharacter with a backslash works in all regular expression flavors. Many flavors also support the\Q...\E escape sequence. All the characters between the \Q and the\E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text*\d+*. The \E may be omitted at the end of the regex, so\Q*\d+* is the same as \Q*\d+*\E. This syntax is supported by theJGsoft engine, Perl, PCRE and Java, both inside and outside character classes. However, in Java, this feature does not work correctly in JDK 1.4 and 1.5 when used in a character class or followed by a quantifier.

Special Characters and Programming Languages

If you are a programmer, you may be surprised that characters like the single quote and double quote are not special characters. That is correct. When using aregular expression or grep tool like PowerGREP or the search function of atext editor like EditPad Pro, you should not escape or repeat the quote characters like you do in a programming language.

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string. So the regex1\+1=2 must be written as "1\\+1=2" in C++ code. The C++ compiler will turn the escaped backslash in the source code into a single backslash in the string that is passed on to the regex library. To matchc:\temp, you need to use the regex c:\\temp. As a string in C++ source code, this regex becomes"c:\\\\temp". Four backslashes to match a single one indeed.

See the tools and languages section of this help file for more information on how to use regular expressions in various programming languages.

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use\t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are\a (bell, 0x07), \e (escape, 0x1B),\f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use\r\n to terminate lines, while UNIX text files use \n.

You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9. Another way to search for a tab is to use\x09. Note that the leading zero is required.

Most regex flavors also support the tokens \cA through \cZ to insert ASCII control characters. The letter after the backslash is always a lowercase c. The second letter is an uppercase letter A through Z, to indicate Control+A through Control+Z. These are equivalent to\x01 through \x1A (26 decimal). E.g.\cM matches a carriage return, just like \r and \x0D. In XML Schema regular expressions,\c is a shorthand character class that matches any character allowed in an XML name.

If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with\u20AC.