正则表达式提取器_C++11新特性7 - 正则表达式

最新推荐文章于 2023-07-11 21:11:29 发布

weixin_39960019

最新推荐文章于 2023-07-11 21:11:29 发布

阅读量244

点赞数

文章标签：正则表达式提取器

C++11 新增了正则表达式的标准库支持，本文简介 C++ 正则表达式的使用

在 C++ 中使用正则表达式，和其它语言差别不大

int main() {
  regex e("abc*");
  bool m = regex_search("abccc", e);

  // 输出 yes
  cout << (m ? "yes" : "no") << endl;
}

C++11 自带了 6 种正则表达式语法的支持

ECMAScript
basic
extended
awk
grep
egrep

C++11 默认使用 ECMAScript 语法，这也是 6 种语法中最强大的，假如想使用其他 5 种语法，只需在声明 regex 对象时指定即可

regex e("^a.", regex_constants::grep);

假如我们不仅仅想知道一个正则表达式是否匹配一个字符串，我们还想要提取出匹配的部分，例如我们需要从邮箱中提取用户名和网址，就需要用到 match_results

int main() {
  string str("Email a@bc.com abc");

  // 等同于 match_results<string>
  smatch m; 

  regex e(
    "([[:w:]]+)@([[:w:]]+.com)"
  );
  bool found = regex_search(
    str, m, e
  );

  // m.size=3, 存储了 3 个 result
  cout << "m.size=" 
    << m.size() << endl;

  /* 迭代 match_results, 输出
  m[0]=a@bc.com (整个匹配)
  m[1]=a (第1个group)
  m[2]=bc.com (第2个group)
  */
  for (int n=0; n<m.size(); n++){
    cout << "m[" << n << "]=" 
      << m[n].str() << endl;
  //等价写法 m.str(n), *(m.begin()+n) 
  }
  
  // m.prefix=Email
  cout << "m.prefix=" 
    << m.prefix().str() << endl;
  
  // m.suffix= is mine
  cout << "m.suffix=" 
    << m.suffix().str() << endl;
}

假如我们想要匹配的字符串中，有多个子串都可以匹配正则表达式，并且我们想把这些子串全部找出来，例如一个字符串中包含多个邮箱地址，那么就需要用到 regex_iterator

int main() {
  string str(
    "a@bc.com, d@ef.com, aa@b.com"
  );

  regex e(
    "([[:w:]]+)@([[:w:]]+.com)"
  );

  sregex_iterator pos(
    str.cbegin(), str.cend(), e
  ); // 定义 regex_iteraror
  
  // C++惯例: 默认构造的迭代器表示序列结束
  sregex_iterator end;

/*
email=a@bc.com, user=a, domain=bc.com
email=d@ef.com, user=d, domain=ef.com
email=aa@bb.com, user=aa, domain=b.com
*/
  for (; pos!=end; pos++) {
    cout << "email=" << pos->str(0) 
      << ", user=" << pos->str(1) 
      << ", domain=" << pos->str(2) 
      << endl;
  }
}

如上我们可以看到，regex_iterator 其实就是迭代字符串中所有正则表达式匹配的 match_results。

除此之外，C++ 还提供了另一种跌到器, regex_token_iterator。不同的是，regex_token_iterator 迭代的是所有正则表达式匹配中的指定子表达式，或迭代未匹配的子字符串

int main() {
  string str(
    "a@bc.com, d@ef.com, aa@bb.com"
  );

  regex e(
    "([[:w:]]+)@([[:w:]]+.com)"
  );

  sregex_token_iterator pos(
    str.cbegin(), str.cend(), e
  ); // 定义regex_token_iterator
  sregex_token_iterator end; //序列结束
  

  /*
  Matched: a@bc.com
  Matched: d@ef.com
  Matched: aa@bb.com
  */
  for (; pos!=end; pos++) {
    cout << "Matched:  " 
      << *pos << endl;
  }
}

我们可以修改 pos 的定义，使它每次迭代 match_results 的第 2 个 group

// 第 4 个参数表示第几个 group
sregex_token_iterator pos(
  str.cbegin(), str.cend(), e, 2
);

值得注意的是，如果我们把这里的参数设为 -1，则迭代字符串中所有不匹配正则表达式的部分，相当于用正则表达式切割字符串

int main() {
  string str("a bb   cd");

  regex e("s+"); // 匹配空格

  // 迭代不匹配正则表达式的部分
  sregex_token_iterator pos(
    str.cbegin(), str.cend(), e, -1
  );
  sregex_token_iterator end;
  
  /*
  Matched: a
  Matched: bb
  Matched: cd
  */
  for (; pos!=end; pos++) {
    cout << "Matched: " 
      << *pos << endl;
  }
}

正则表达式还有一个常用的场景——字符串替换。C++ 中我们可以使用 regex_replace

int main() {
  string str(
    "a@bc.com, d@ef.com, aa@bb.com"
  );

  regex e(
    "([[:w:]]+)@([[:w:]]+.com)"
  );

  cout << regex_replace(
    str, e, "$1 is on $2"
  );
}

输出为

a is on bc.com, d is on ef.com, aa is on bb.com

本文主要翻译自 Bo Qian 的 YouTube 视频

C++ 11 Library: Regular Expression 1youtu.be C++ 11 Library: Regular Expression 2 -- Submatchyoutu.be C++ 11 Library: Regular Expression 3 -- Iteratorsyoutu.be

weixin_39960019

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
正则表达式提取器_C++11新特性7 - 正则表达式

C++11 新增了正则表达式的标准库支持，本文简介 C++ 正则表达式的使用在 C++ 中使用正则表达式，和其它语言差别不大int C++11 自带了 6 种正则表达式语法的支持ECMAScriptbasicextendedawkgrepegrepC++11 默认使用 ECMAScript 语法，这也是 6 种语法中最强大的，假如想使用其他 5 种语法，只需在声明 regex 对象时指定即可reg...
复制链接

扫一扫