C++ regex 正则表达式的使用

最新推荐文章于 2024-09-01 16:29:46 发布

ve12345

最新推荐文章于 2024-09-01 16:29:46 发布

阅读量2.1k

点赞数

文章标签： c/c++

在c++中，有三种正则可以选择使用，C ++regex，C regex，boost regex ，如果在windows下开发c++，默认不支持后面两种正则，如果想快速应用，显然C++ regex比较方便使用。文章将讨论C++ regex 正则表达式的使用。

C++ regex函数有3个：regex_match、regex_search 、regex_replace

regex_match

regex_match是正则表达式匹配的函数，下面以例子说明。如果想系统的了解，参考regex_match

// regex_match example
#include <iostream>
#include <string>
#include <regex>

int main ()
{

  if (std::regex_match ("subject", std::regex("(sub)(.*)") ))
    std::cout << "string literal matched\n";

  std::string s ("subject");
  std::regex e ("(sub)(.*)");
  if (std::regex_match (s,e))
    std::cout << "string object matched\n";

  if ( std::regex_match ( s.begin(), s.end(), e ) )
    std::cout << "range matched\n";

  std::cmatch cm;    // same as std::match_results<const char*> cm;
  std::regex_match ("subject",cm,e);
  std::cout << "string literal with " << cm.size() << " matches\n";

  std::smatch sm;    // same as std::match_results<string::const_iterator> sm;
  std::regex_match (s,sm,e);
  std::cout << "string object with " << sm.size() << " matches\n";

  std::regex_match ( s.cbegin(), s.cend(), sm, e);
  std::cout << "range with " << sm.size() << " matches\n";

  // using explicit flags:
  std::regex_match ( "subject", cm, e, std::regex_constants::match_default );

  std::cout << "the matches were: ";
  for (unsigned i=0; i<sm.size(); ++i) {
    std::cout << "[" << sm[i] << "] ";
  }

  std::cout << std::endl;

  return 0;
}

输出如下：

string literal matched
string object matched
range matched
string literal with 3 matches
string object with 3 matches
range with 3 matches
the matches were: [subject] [sub] [ject]

regex_search

regex_match是另外一个正则表达式匹配的函数，下面是regex_search的例子。regex_search和regex_match的主要区别是：regex_match是全词匹配，而regex_search是搜索其中匹配的字符串。如果想系统了解，请参考regex_search

// regex_search example
#include <iostream>
#include <regex>
#include <string>

int main(){
  std::string s ("this subject has a submarine as a subsequence");
  std::smatch m;
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

  std::cout << "Target sequence: " << s << std::endl;
  std::cout << "Regular expression: /\\b(sub)([^ ]*)/" << std::endl;
  std::cout << "The following matches and submatches were found:" << std::endl;

  while (std::regex_search (s,m,e)) {
    for (auto x=m.begin();x!=m.end();x++) 
      std::cout << x->str() << " ";
    std::cout << "--> ([^ ]*) match " << m.format("$2") <<std::endl;
    s = m.suffix().str();
  }
}

输出如下：

Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject --> ([^ ]*) match ject
submarine sub marine --> ([^ ]*) match marine
subsequence sub sequence --> ([^ ]*) match sequence

/********  无情的分割线 ********* /    
  作者：没有开花的树    
  博客：blog.csdn.net/mycwq    
/ *******   无情的copy  *********/

regex_replace

regex_replace是替换正则表达式匹配内容的函数，下面是regex_replace的例子。如果想系统了解，请参考regex_replace

#include <regex> 
#include <iostream> 
 
int main() { 
    char buf[20]; 
    const char *first = "axayaz"; 
    const char *last = first + strlen(first); 
    std::regex rx("a"); 
    std::string fmt("A"); 
    std::regex_constants::match_flag_type fonly = 
        std::regex_constants::format_first_only; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt, fonly) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    std::string str("adaeaf"); 
    std::cout << std::regex_replace(str, rx, fmt) << std::endl; 
 
    std::cout << std::regex_replace(str, rx, fmt, fonly) << std::endl; 
 
    return 0; 
}

输出如下：

AxAyAz
Axayaz
AdAeAf
Adaeaf

C++ regex正则表达式的规则和其他编程语言差不多，如下：

特殊字符（用于匹配很难形容的字符）:

characters	description	matches
`.`	not newline	any character exceptline terminators(LF, CR, LS, PS).
`\t`	tab (HT)	a horizontal tab character (same as`\u0009`).
`\n`	newline (LF)	a newline (line feed) character (same as`\u000A`).
`\v`	vertical tab (VT)	a vertical tab character (same as`\u000B`).
`\f`	form feed (FF)	a form feed character (same as`\u000C`).
`\r`	carriage return (CR)	a carriage return character (same as`\u000D`).
`\c`letter	control code	a control code character whosecode unit valueis the same as the remainder of dividing thecode unit valueofletterby 32. For example:`\ca`is the same as`\u0001`,`\cb`the same as`\u0002`, and so on...
`\x`hh	ASCII character	a character whosecode unit valuehas an hex value equivalent to the two hex digitshh. For example:`\x4c`is the same as`L`, or`\x23`the same as`#`.
`\u`hhhh	unicode character	a character whosecode unit valuehas an hex value equivalent to the four hex digitshhhh.
`\0`	null	a null character (same as`\u0000`).
`\`int	backreference	the result of the submatch whose opening parenthesis is theint-th (intshall begin by a digit other than`0`). Seegroupsbelow for more info.
`\d`	digit	a decimal digit character
`\D`	not digit	any character that is not a decimal digit character
`\s`	whitespace	a whitespace character
`\S`	not whitespace	any character that is not a whitespace character
`\w`	word	an alphanumeric or underscore character
`\W`	not word	any character that is not an alphanumeric or underscore character
`\`character	character	the charactercharacteras it is, without interpreting its special meaning within a regex expression. Anycharactercan be escaped except those which form any of the special character sequences above. Needed for:`^ $ \ . * + ? ( ) [ ] { } \|`
`[`class`]`	character class	the target character is part of the class
`[^`class`]`	negated character class	the target character is not part of the class

注意了，在C++反斜杠字符（\）会被转义

std::regex e1 ("\\d");  //  \d -> 匹配数字字符
std::regex e2 ("\\\\"); //  \\ -> 匹配反斜杠字符

数量：

characters	times	effects
`*`	0 or more	The preceding atom is matched 0 or more times.
`+`	1 or more	The preceding atom is matched 1 or more times.
`?`	0 or 1	The preceding atom is optional (matched either 0 times or once).
`{`int`}`	int	The preceding atom is matched exactlyinttimes.
`{`int`,}`	intor more	The preceding atom is matchedintor more times.
`{`min`,`max`}`	betweenminandmax	The preceding atom is matched at leastmintimes, but not more thanmax.

注意了，模式 "(a+).*" 匹配 "aardvark" 将匹配到 aa，模式 "(a+?).*" 匹配 "aardvark" 将匹配到 a

组（用以匹配连续的多个字符）:

characters	description	effects
`(`subpattern`)`	Group	Creates a backreference.
`(?:`subpattern`)`	Passive group	Does not create a backreference.

注意了，第一种将创建一个反向引用，用于提取匹配到的内容，第二种则没有，相对来说性能方面也没这部分的开销

characters	description	condition for match
`^`	Beginning of line	Either it is the beginning of the target sequence, or follows aline terminator.
`$`	End of line	Either it is the end of the target sequence, or precedes aline terminator.
`\|`	Separator	Separates two alternative patterns or subpatterns..

单个字符

[abc] 匹配 a, b 或 c.
[^xyz] 匹配任何非 x, y, z的字符

范围
[a-z] 匹配任何小写字母 (a, b, c, ..., z).
[abc1-5] 匹配 a, b , c, 或 1 到 5 的数字.

c++ regex还有一种类POSIX的写法

class	description	equivalent (withregex_traits, default locale)
`[:alnum:]`	alpha-numerical character	isalnum
`[:alpha:]`	alphabetic character	isalpha
`[:blank:]`	blank character	isblank
`[:cntrl:]`	control character	iscntrl
`[:digit:]`	decimal digit character	isdigit
`[:graph:]`	character with graphical representation	isgraph
`[:lower:]`	lowercase letter	islower
`[:print:]`	printable character	isprint
`[:punct:]`	punctuation mark character	ispunct
`[:space:]`	whitespace character	isspace
`[:upper:]`	uppercase letter	isupper
`[:xdigit:]`	hexadecimal digit character	isxdigit
`[:d:]`	decimal digit character	isdigit
`[:w:]`	word character	isalnum
`[:s:]`	whitespace character	isspace