RE2,C++正则表达式库实战

RE2简介

RE2

RE2是,一个高效、原则性的正则表达式库,由Rob PikeRuss Cox两位来自google的大牛用C++实现。他俩同时也是Go语言的主导者。Go语言中的regexp正则表达式包,也是RE2的Go实现。

RE2是,一个快速安全线程友好,PCRE、PERL和Python等回溯正则表达式引擎(backtracking regular expression engine)的一个替代品。RE2支持Linux和绝大多数的Unix平台,但不支持Windows(如果有必要,你可以自己hack)。

RE2的特点

回溯引擎(Backtracking engine)通常是典型的完整的功能和便捷的语法糖,但是即使很小的输入都可能强制进入指数级时间处理场景。RE2应用自动机理论理论,来保证在一个尺寸的输入上正则表达式搜索运行于一个时间线。RE2实现了内存限制,所以搜索可以被制约在一个固定大小的内存。RE2被设计为使用一个很小的固定C++堆栈足迹,无论它必须处理的输入或正则表达式是什么。从而RE2在多线程环境非常有用,当线程栈不能武断的增大时。

当输入(数据集)很大时,RE2通常比回溯引擎快很多。它采用自动机理论,实施别的引擎无法进行的优化。

不同于绝大多数基于自动机的引擎,RE2实现了几乎所有Perl和PCRE特点,和语法糖。它找到最左-优先(leftmost-first)匹配,同时匹配Perl可能匹配的,并且能返回子匹配信息。最明显的例外是,RE2去掉了反向引用(backreferences)和一般性零-宽度断言(zero-width assertion)的支持,因为无法高效实现。

为了相对简单语法的使用者,RE2,有一个POSIX模式,仅接受POSIX egrep算子,实现最左-最长整体匹配(leftmost-longest overall matching)。

xkcd

¹ Technical note: there's a difference between submatches and backreferences. Submatches let you find out what certain subexpressions matched after the match is over, so that you can find out, after matching dogcat against (cat|dog)(cat|dog), that \1 is dog and \2 is cat. Backreferences let you use those subexpressions during the match, so that (cat|dog)\1 matches catcat and dogdog but not catdog or dogcat.

RE2支持子匹配萃取(submatch extraction),但是不支持反向引用(backreferences)。

如果你必须要反向引用一般性断言,而RE2不支持,那么你可以看一下irregexp,Google Chrome的正则表达式引擎。

玩转RE2

安装

你可以下载发行版的代码包,然后解压进行安装。这里介绍,另一种安装方式:

需要安装Mercurial SCM和C++编译器(g++的克隆):

下载代码,并进行安装:


    hg clone http://re2.googlecode.com/hg re2
    cd re2
    make test
    make testinstall
    sudo make install

在BSD系统, 使用gmake替换make

使用RE2库

使用RE2库开发C++应用,需要在代码中包含re2/re2.h头文件,链接时增加 -lre2以及-lpthread(多线环境使用)选项。

语法

POSIX模式,RE@接受标准POSIX (egrep)语法正则表达式。在Perl模式,RE2接受大部分Perl操作符。唯一例外的是,那些要求回溯(潜在需要指数级的运行时)实现的部分。其中,包括反向引用(子匹配,还是支持的)和一般性断言。RE2,默认为Perl模式。

C++ 高级接口

这里包括两个基本的操作:

  • RE2::FullMatch: 要求regexp表达式匹配整个输入文本。
  • RE2::PartialMatch: 在输入文本中寻找一个子匹配。在POSIX模式,返回最左-最长匹配,Perl模式也是相同的匹配。

例如,

vi re2_high_interface_test.cc

#include <re2/re2.h>
#include <iostream>
#include <assert.h>

int
main(void)
{
    assert(RE2::FullMatch("hello", "h.*o"));
    assert(!RE2::FullMatch("hello", "e"));

    assert(RE2::PartialMatch("hello", "h.*o"));
    assert(RE2::PartialMatch("hello", "e"));

    std::cout << "Ok" << std::endl;
    return 0;
}

编译程序:

 g++ -o re2_high_interface_test re2_high_interface_test.cc -lre2

执行re2_high_interface_test,程序正常运行,显示结果Ok

子匹配萃取

两个匹配函数,都支持附加参数,来指定子匹配。此参数可以是一个字符串或一个整数类型StringPiece类型。一个StringPiece是一个指向原始输入的指针,和一个字符串的长度计数。有点类似一个string,但是有自己的存储。和使用指针一样,当使用StringPiece时,你必须小心谨慎,原始文本已被删除或不在相同的边界时,不能使用。

示例:

vi re2_submatch_ex_test.cc

#include <re2/re2.h>
#include <iostream>
#include <assert.h>

int
main(void)
{
    int i;
    std::string s;
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s, &i));
    assert(s == "ruby");
    assert(i == 1234);

    // Fails: "ruby" cannot be parsed as an integer.
    assert(!RE2::FullMatch("ruby", "(.+)", &i));

    // Success; does not extract the number.
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s));

    // Success; skips NULL argument.
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", (void*)NULL, &i));

    // Fails: integer overflow keeps value from being stored in i.
    assert(!RE2::FullMatch("ruby:123456789123", "(\\w+):(\\d+)", &s, &i));

    std::cout << "Ok" << std::endl;
    return 0;
}
g++ -o re2_submatch_ex_test re2_submatch_ex_test.cc -lre2

预编译的正则表达式

上面的示例都是每次调用的时编译一次正则表达式。相反,你可以编译一次正则表达式,保存到一个RE2对象中,然后在每次调用时重用这个对象。

示例:

vi re2_prec_re_test.cc

#include <re2/re2.h>
#include <iostream>
#include <assert.h>

int
main(void)
{
    int i;
    std::string s;
    RE2 re("(\\w+):(\\d+)");
    assert(re.ok());  // compiled; if not, see re.error();

    assert(RE2::FullMatch("ruby:1234", re, &s, &i));
    assert(RE2::FullMatch("ruby:1234", re, &s));
    assert(RE2::FullMatch("ruby:1234", re, (void*)NULL, &i));
    assert(!RE2::FullMatch("ruby:123456789123", re, &s, &i));

    std::cout << "Ok" << std::endl;
    return 0;
}
g++ -o re2_prec_re_test re2_prec_re_test.cc -lre2

选项

RE2构造器还有第二个可选参数,可以用来改变RE2的默认选项。例如,预定义的Quiet选项,当正则表达式解析失败时,不打印错误消息:

vi re2_options_test.cc

#include <re2/re2.h>
#include <iostream>
#include <assert.h>

int
main(void)
{
    RE2 re("(ab", RE2::Quiet);  // don't write to stderr for parser failure
    assert(!re.ok());  // can check re.error() for details

    std::cout << "Ok" << std::endl;
    return 0;
}

编译程序:

g++ -o re2_options_test re2_options_test.cc -lre2

其他有用的预定义选项,是Latin1 (禁用UTF-8)和POSIX (使用POSIX语法最左-最长匹配)。

你可以定义自己的RE2::Options对象,然后配置它。所有的选项在re2/re2.h文件中。

Unicode规范化

RE2操作Unicode的码点(code points): 它没有试图进行规范化。例如,正则表达式/ü/(U+00FC, u和分音符)不匹配"ü"(U+0075 U+0308, u紧挨结合分音符)。规范化,是一个长期,参与的话题。最小的解决方案,如果你需要这样的匹配,是在使用RE2之前的处理环节中同时规范化正则表达式和输入。相关主题的更多细节,请参考http://www.unicode.org/reports/tr15/

额外的技巧和窍门

RE2的高级应用技巧,如构造自己的参数列表,或将RE2作为词法分析器使用或解析十六进制、十进制和C-基数数字,请参考re2.h文件。

“回溯”与“非回溯”的区别

以下照片内容,源自“sregex: matching Perl 5 regexes on data streams”讲演文档.

回溯的意思

回溯方式实现

Robe Pike的算法

Thompson的构造的算法

RE2的各种包装

An Inferno wrapper is at http://code.google.com/p/inferno-re2/.

Python wrapper is at http://github.com/facebook/pyre2/.

Ruby wrapper is at http://github.com/axic/rre2/.

An Erlang wrapper is at http://github.com/tuncer/re2/.

Perl wrapper is at http://search.cpan.org/~dgl/re-engine-RE2-0.05/lib/re/engine/RE2.pm.

An Eiffel wrapper is at http://sourceforge.net/projects/eiffelre2/.

RE2支持的语法

这里列出了RE2支持的正则表达式语法。同时,也列出了PCRE、PERL和VIM接受的语法。蓝色内容是,RE2不支持的语法。

 
Single characters:
.any character, including newline (s=true)
[xyz]character class
[^xyz]negated character class
\dPerl character class
\Dnegated Perl character class
[:alpha:]ASCII character class
[:^alpha:]negated ASCII character class
\pNUnicode character class (one-letter name)
\p{Greek}Unicode character class
\PNnegated Unicode character class (one-letter name)
\P{Greek}negated Unicode character class
 
Composites:
xyx followed by y
x|yx or y (prefer x)
 
Repetitions:
xzero or more x, prefer more
x+one or more x, prefer more
x?zero or one x, prefer one
x{n,m}n or n+1 or ... or m x, prefer more
x{n,}n or more x, prefer more
x{n}exactly n x
x?zero or more x, prefer fewer
x+?one or more x, prefer fewer
x??zero or one x, prefer zero
x{n,m}?n or n+1 or ... or m x, prefer fewer
x{n,}?n or more x, prefer fewer
x{n}?exactly n x
x{}(≡ x(NOT SUPPORTED) VIM
x{-}(≡ x?(NOT SUPPORTED) VIM
x{-n}(≡ x{n}?(NOT SUPPORTED) VIM
x=(≡ x?(NOT SUPPORTED) VIM
 
Possessive repetitions:
x+zero or more x, possessive (NOT SUPPORTED)
x++one or more x, possessive (NOT SUPPORTED)
x?+zero or one x, possessive (NOT SUPPORTED)
x{n,m}+n or ... or m x, possessive (NOT SUPPORTED)
x{n,}+n or more x, possessive (NOT SUPPORTED)
x{n}+exactly n x, possessive (NOT SUPPORTED)
 
Grouping:
(re)numbered capturing group
(?Pre)named & numbered capturing group
(?re)named & numbered capturing group (NOT SUPPORTED)
(?'name're)named & numbered capturing group (NOT SUPPORTED)
(?:re)non-capturing group
(?flags)set flags within current group; non-capturing
(?flags:re)set flags during re; non-capturing
(?#text)comment (NOT SUPPORTED)
(?|x|y|z)branch numbering reset (NOT SUPPORTED)
(?>re)possessive match of re (NOT SUPPORTED)
re@>possessive match of re (NOT SUPPORTED) VIM
%(re)non-capturing group (NOT SUPPORTED) VIM
 
Flags:
icase-insensitive (default false)
mmulti-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
slet . match \n (default false)
Uungreedy: swap meaning of x and x?x+ and x+?, etc (default false)
Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z).
 
Empty strings:
^at beginning of text or line (m=true)
$at end of text (like \z not \Z) or line (m=true)
\Aat beginning of text
\bat word boundary (\w on one side and \W\A, or \z on the other)
\Bnot a word boundary
\Gat beginning of subtext being searched (NOT SUPPORTED) PCRE
\Gat end of last match (NOT SUPPORTED) PERL
\Zat end of text, or before newline at end of text (NOT SUPPORTED)
\zat end of text
(?=re)before text matching re (NOT SUPPORTED)
(?!re)before text not matching re (NOT SUPPORTED)
(?<=re)after text matching re (NOT SUPPORTED)
(?<!re)after text not matching re (NOT SUPPORTED)
re&before text matching re (NOT SUPPORTED) VIM
re@=before text matching re (NOT SUPPORTED) VIM
re@!before text not matching re (NOT SUPPORTED) VIM
re@<=after text matching re (NOT SUPPORTED) VIM
re@<!after text not matching re (NOT SUPPORTED) VIM
\zssets start of match (= \K) (NOT SUPPORTED) VIM
\zesets end of match (NOT SUPPORTED) VIM
\%^beginning of file (NOT SUPPORTED) VIM
\%$end of file (NOT SUPPORTED) VIM
\%Von screen (NOT SUPPORTED) VIM
\%#cursor position (NOT SUPPORTED) VIM
\%'mmark m position (NOT SUPPORTED) VIM
\%23lin line 23 (NOT SUPPORTED) VIM
\%23cin column 23 (NOT SUPPORTED) VIM
\%23vin virtual column 23 (NOT SUPPORTED) VIM
 
Escape sequences:
\abell (≡ \007)
\fform feed (≡ \014)
\thorizontal tab (≡ \011)
\nnewline (≡ \012)
\rcarriage return (≡ \015)
\vvertical tab character (≡ \013)
*literal , for any punctuation character
\123octal character code (up to three digits)
\x7Fhex character code (exactly two digits)
\x{10FFFF}hex character code
\Cmatch a single byte even in UTF-8 mode
\Q...\Eliteral text ... even if ... has punctuation
 
\1backreference (NOT SUPPORTED)
\bbackspace (NOT SUPPORTED) (use \010)
\cKcontrol char ^K (NOT SUPPORTED) (use \001 etc)
\eescape (NOT SUPPORTED) (use \033)
\g1backreference (NOT SUPPORTED)
\g{1}backreference (NOT SUPPORTED)
\g{+1}backreference (NOT SUPPORTED)
\g{-1}backreference (NOT SUPPORTED)
\g{name}named backreference (NOT SUPPORTED)
\gsubroutine call (NOT SUPPORTED)
\g'name'subroutine call (NOT SUPPORTED)
\knamed backreference (NOT SUPPORTED)
\k'name'named backreference (NOT SUPPORTED)
\lXlowercase X (NOT SUPPORTED)
\uxuppercase x (NOT SUPPORTED)
\L...\Elowercase text ... (NOT SUPPORTED)
\Kreset beginning of $0 (NOT SUPPORTED)
\N{name}named Unicode character (NOT SUPPORTED)
\Rline break (NOT SUPPORTED)
\U...\Eupper case text ... (NOT SUPPORTED)
\Xextended Unicode sequence (NOT SUPPORTED)
 
\%d123decimal character 123 (NOT SUPPORTED) VIM
\%xFFhex character FF (NOT SUPPORTED) VIM
\%o123octal character 123 (NOT SUPPORTED) VIM
\%u1234Unicode character 0x1234 (NOT SUPPORTED) VIM
\%U12345678Unicode character 0x12345678 (NOT SUPPORTED) VIM
 
Character class elements:
xsingle character
A-Zcharacter range (inclusive)
\dPerl character class
[:foo:]ASCII character class foo
\p{Foo}Unicode character class Foo
\pFUnicode character class F (one-letter name)
 
Named character classes as character class elements:
[\d]digits (≡ \d)
[^\d]not digits (≡ \D)
[\D]not digits (≡ \D)
[^\D]not not digits (≡ \d)
[[:name:]]named ASCII class inside character class (≡ [:name:])
[^[:name:]]named ASCII class inside negated character class (≡ [:^name:])
[\p{Name}]named Unicode property inside character class (≡ \p{Name})
[^\p{Name}]named Unicode property inside negated character class (≡ \P{Name})
 
Perl character classes:
\ddigits (≡ [0-9])
\Dnot digits (≡ [^0-9])
\swhitespace (≡ [\t\n\f\r ])
\Snot whitespace (≡ [^\t\n\f\r ])
\wword characters (≡ [0-9A-Za-z])
\Wnot word characters (≡ [^0-9A-Za-z])
 
\hhorizontal space (NOT SUPPORTED)
\Hnot horizontal space (NOT SUPPORTED)
\vvertical space (NOT SUPPORTED)
\Vnot vertical space (NOT SUPPORTED)
 
ASCII character classes:
[:alnum:]alphanumeric (≡ [0-9A-Za-z])
[:alpha:]alphabetic (≡ [A-Za-z])
[:ascii:]ASCII (≡ [\x00-\x7F])
[:blank:]blank (≡ [\t ])
[:cntrl:]control (≡ [\x00-\x1F\x7F])
[:digit:]digits (≡ [0-9])
[:graph:]graphical (≡ [!-~] == [A-Za-z0-9!"#$%&'()+,-./:;<=>?@[\]^</tt><tt>{|}~]</tt>)</td></tr> <tr><td><tt>[:lower:]</tt></td><td>lower case (≡ <tt>[a-z]</tt>)</td></tr> <tr><td><tt>[:print:]</tt></td><td>printable (≡ <tt>[ -~] == [ [:graph:]]</tt>)</td></tr> <tr><td><tt>[:punct:]</tt></td><td>punctuation (≡ <tt>[!-/:-@[-</tt><tt>{-~])
[:space:]whitespace (≡ [\t\n\v\f\r ])
[:upper:]upper case (≡ [A-Z])
[:word:]word characters (≡ [0-9A-Za-z])
[:xdigit:]hex digit (≡ [0-9A-Fa-f])
 
Unicode character class names--general category:
Cother
Cccontrol
Cfformat
Cnunassigned code points (NOT SUPPORTED)
Coprivate use
Cssurrogate
Lletter
LCcased letter (NOT SUPPORTED)
L&cased letter (NOT SUPPORTED)
Lllowercase letter
Lmmodifier letter
Loother letter
Lttitlecase letter
Luuppercase letter
Mmark
Mcspacing mark
Meenclosing mark
Mnnon-spacing mark
Nnumber
Nddecimal number
Nlletter number
Noother number
Ppunctuation
Pcconnector punctuation
Pddash punctuation
Peclose punctuation
Pffinal punctuation
Piinitial punctuation
Poother punctuation
Psopen punctuation
Ssymbol
Sccurrency symbol
Skmodifier symbol
Smmath symbol
Soother symbol
Zseparator
Zlline separator
Zpparagraph separator
Zsspace separator
 
Unicode character class names--scripts:
ArabicArabic
ArmenianArmenian
BalineseBalinese
BengaliBengali
BopomofoBopomofo
BrailleBraille
BugineseBuginese
BuhidBuhid
Canadian_AboriginalCanadian Aboriginal
CarianCarian
ChamCham
CherokeeCherokee
Commoncharacters not specific to one script
CopticCoptic
CuneiformCuneiform
CypriotCypriot
CyrillicCyrillic
DeseretDeseret
DevanagariDevanagari
EthiopicEthiopic
GeorgianGeorgian
GlagoliticGlagolitic
GothicGothic
GreekGreek
GujaratiGujarati
GurmukhiGurmukhi
HanHan
HangulHangul
HanunooHanunoo
HebrewHebrew
HiraganaHiragana
Inheritedinherit script from previous character
KannadaKannada
KatakanaKatakana
Kayah_LiKayah Li
KharoshthiKharoshthi
KhmerKhmer
LaoLao
LatinLatin
LepchaLepcha
LimbuLimbu
Linear_BLinear B
LycianLycian
LydianLydian
MalayalamMalayalam
MongolianMongolian
MyanmarMyanmar
New_Tai_LueNew Tai Lue (aka Simplified Tai Lue)
NkoNko
OghamOgham
Ol_ChikiOl Chiki
Old_ItalicOld Italic
Old_PersianOld Persian
OriyaOriya
OsmanyaOsmanya
Phags_Pa'Phags Pa
PhoenicianPhoenician
RejangRejang
RunicRunic
SaurashtraSaurashtra
ShavianShavian
SinhalaSinhala
SundaneseSundanese
Syloti_NagriSyloti Nagri
SyriacSyriac
TagalogTagalog
TagbanwaTagbanwa
Tai_LeTai Le
TamilTamil
TeluguTelugu
ThaanaThaana
ThaiThai
TibetanTibetan
TifinaghTifinagh
UgariticUgaritic
VaiVai
YiYi
 
Vim character classes:
\iidentifier character (NOT SUPPORTED)/font> VIM
\I\i except digits (NOT SUPPORTED) VIM
\kkeyword character (NOT SUPPORTED) VIM
\K\k except digits (NOT SUPPORTED) VIM
\ffile name character (NOT SUPPORTED) VIM
\F\f except digits (NOT SUPPORTED) VIM
\pprintable character (NOT SUPPORTED) VIM
\P\p except digits (NOT SUPPORTED) VIM
\swhitespace character (≡ [ \t](NOT SUPPORTED) VIM
\Snon-white space character (≡ [^ \t](NOT SUPPORTED) VIM
\ddigits (≡ [0-9]VIM
\Dnot \d VIM
\xhex digits (≡ [0-9A-Fa-f](NOT SUPPORTED) VIM
\Xnot \x (NOT SUPPORTED) VIM
\ooctal digits (≡ [0-7](NOT SUPPORTED) VIM
\Onot \o (NOT SUPPORTED) VIM
\wword character VIM
\Wnot \w VIM
\hhead of word character (NOT SUPPORTED) VIM
\Hnot \h (NOT SUPPORTED) VIM
\aalphabetic (NOT SUPPORTED) VIM
\Anot \a (NOT SUPPORTED) VIM
\llowercase (NOT SUPPORTED) VIM
\Lnot lowercase (NOT SUPPORTED) VIM
\uuppercase (NOT SUPPORTED) VIM
\Unot uppercase (NOT SUPPORTED) VIM
_x\x plus newline, for any x (NOT SUPPORTED) VIM
 
Vim flags:
\cignore case (NOT SUPPORTED) VIM
\Cmatch case (NOT SUPPORTED) VIM
\mmagic (NOT SUPPORTED) VIM
\Mnomagic (NOT SUPPORTED) VIM
\vverymagic (NOT SUPPORTED) VIM
\Vverynomagic (NOT SUPPORTED) VIM
\Zignore differences in Unicode combining characters (NOT SUPPORTED) VIM
 
Magic:
(?{code})arbitrary Perl code (NOT SUPPORTED) PERL
(??{code})postponed arbitrary Perl code (NOT SUPPORTED) PERL
(?n)recursive call to regexp capturing group n (NOT SUPPORTED)
(?+n)recursive call to relative group +n (NOT SUPPORTED)
(?-n)recursive call to relative group -n (NOT SUPPORTED)
(?C)PCRE callout (NOT SUPPORTED) PCRE
(?R)recursive call to entire regexp (≡ (?0)(NOT SUPPORTED)
(?&name)recursive call to named group (NOT SUPPORTED)
(?P=name)named backreference (NOT SUPPORTED)
(?P>name)recursive call to named group (NOT SUPPORTED)
(?(cond)true|false)conditional branch (NOT SUPPORTED)
(?(cond)true)conditional branch (NOT SUPPORTED)
(ACCEPT)make regexps more like Prolog (NOT SUPPORTED)
(COMMIT)(NOT SUPPORTED)
(F)(NOT SUPPORTED)
(FAIL)(NOT SUPPORTED)
(MARK)(NOT SUPPORTED)
(PRUNE)(NOT SUPPORTED)
(SKIP)(NOT SUPPORTED)
(THEN)(NOT SUPPORTED)
(ANY)set newline convention (NOT SUPPORTED)
(ANYCRLF)(NOT SUPPORTED)
(CR)(NOT SUPPORTED)
(CRLF)(NOT SUPPORTED)
(LF)(NOT SUPPORTED)
(BSR_ANYCRLF)set \R convention (NOT SUPPORTED) PCRE
(*BSR_UNICODE)(NOT SUPPORTED) PCRE
 

扩展阅读

  1. "perlre - Perl regular expressions" http://perldoc.perl.org/perlre.html

  2. "Implementing Regular Expressions" http://swtch.com/~rsc/regexp

  3. The re1 project: http://code.google.com/p/re1

  4. The re2 project: http://code.google.com/p/re2

  5. sregex: A non-backtracking regex engine matching on data streams

  6. sregex: matching Perl 5 regexes on data streams: http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf

参考资料

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值