C++与正则表达式

wangxudongx

已于 2022-06-26 17:50:06 修改

阅读量715

点赞数 1

文章标签：正则表达式 C++

于 2022-06-26 17:47:45 首次发布

本文链接：https://blog.csdn.net/wangxudongx/article/details/125471827

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

正则表达式

正则表达式也是有多个标准，这里我们采用C++默认的ECMAscript标准来介绍语法。

特性
- 贪婪性
  - * 和 + 限定符都是贪婪的，因为它们会尽可能多的匹配文字，只有在它们的后面加上一个 ? 就可以实现非贪婪或最小匹配
  - 会尽量往后匹配字符
- 懒惰性
  - 使用?令模式的匹配采用懒惰策略
  - 匹配最近的模式
元字符
- 特殊字符
  - [
    - 字符集定义的开始
  - ]
    - 字符集定义的结束
  - \
    - 转义
  - ^
    - 取反
  - -
    - 范围定义
  - |
    - 二选一
      - 用圆括号 () 将所有选择项括起来，相邻的选择项之间用 | 分隔
        用来匹配 xx:xx (如：18:26) 格式时间正则:^([01]\d|2[01234]):([0-5]\d|60)$
  - \n
    - 换行
  - \t
    - 制表符
  - \\
    - 一个反斜线
  - \xhh
    - 用两位十六进制数表示的Unicode字符
  - \uhhhh
    - 用四位十六进制数表示的Unicode字符
  - 定位符
    - ^
      - 匹配输入字符串的开始位置，除非在方括号表达式中使用，当该符号在方括号表达式中使用时，表示不接受该方括号表达式中的字符集合。要匹配 ^ 字符本身，请使用 \^。
    - $
      - 匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性，则 $ 也匹配 '\n' 或 '\r'。要匹配 $ 字符本身，请使用 \$。
    - \b
      - 匹配一个单词边界，即字与空格间的位置。
    - \B
      - 非单词边界匹配
    - \i 或者 \ii
      - 表示第i个分组（子模式）
- 简写
  - \d = [0-9]
  - \w = 单词字符
  - \s = 空白字符
  - \S = 匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
  - [\D] = [^\d]
  - [\W] = [^\w]
  - . = 匹配除换行符（\n、\r）之外的任何单个字符，相等于 [^\n\r]
- 重复
  - ?
    - 0次或一次
  - +
    - 一次或者多次
  - *
    - 0次或者多次
- 限制性重复
  - {min,max}
    - 至少重复min次，最多重复max次
  - {min}
    - 严格重复min次
  - {min,}
    - 重复min次或者更多次
- 字符集

- - 字符集简写
    - 子主题

分组（子模式）
- 被括号限定的部分形成一个分组，可以有多个(),()并列，也可以(),（())两个分组，分组不支持嵌套
- (<(.*?)>(.*?)</.*?>)
 - 三个分组
 - bouquet of roses

R"(<(.*?)>(.*?)</.*?>)"

会得到三个result：

1.(<(.*?)>(.*?)</.*?>)匹配出bouquet of roses

2.(.*?)匹配出bouquet of roses的的b

3.(.*?)匹配出bouquet of roses的bouquet of roses

因为默认模式匹配函数会在第一次匹配后继续匹配嵌套的子表达式，所有又查出了里面两个（）表达式匹配内容

构造时加flat：regex_constants::nosubs

会得到一个result：

bouquet of roses

- 子主题

- 反向引用
 - 对一个正则表达式模式或部分模式两边添加圆括号将导致相关匹配存储到一个临时缓冲区中，所捕获的每个子匹配都按照在正则表达式模式中从左到右出现的顺序存储。缓冲区编号从 1 开始，最多可存储 99 个捕获的子表达式。每个缓冲区都可以使用 \n 访问，其中 n 为一个标识特定缓冲区的一位或两位十进制数。
 - <(.*?)>(.*?)</\1>
 - \1 表示采用分组一的匹配
条件测试
- (?ifthen|else)
- 向前查看
  - (?(?=regex)then|else)

C++ regex

以R开头的字符串字面值
- 举例
 - R"(<(.*?)>(.*?)</.*?>)"
- 作用
 - ()内字符串不用再通过\转义，可以缓解很多特殊字符问题
 - 忽略()外的字符
 - 支持多行
- C++标准中称呼为：裸字符串字面常量
regex
- 从字符序列（如string）构造的一个匹配引擎，状态机模式
- 正则表达式模式的含义由syntax_option_type常量控制
  - 不指定syntax_option_type的话默认是ECMAScirpt
- syntax_option_type
  - icase
    - 匹配时不区分大小写
  - nosubs
    - 在匹配结果中不保存子表达式

bouquet of roses

R"(<(.*?)>(.*?)</.*?>)"

会得到三个result：

1.(<(.*?)>(.*?)</.*?>)匹配出bouquet of roses

2.(.*?)匹配出bouquet of roses的的b

3.(.*?)匹配出bouquet of roses的bouquet of roses

因为默认模式匹配函数会在第一次匹配后继续匹配嵌套的子表达式，所有又查出了里面两个（）表达式匹配内容

构造时加flat：regex_constants::nosubs

会得到一个result：

bouquet of roses

- - optimize
    - 优先原则快速匹配而非快速正则表达式对象构造
  - collate
    - [a-b]字符范围是区域敏感的
  - ECMAScript
    - 正则表达式语法规则采用ECMA-262 中ECMAscript中使用的语法（有微小改动）
  - basic
    - 正则表达式语法为POSIX中使用的基本正则表达式
  - extended
    - 正则表达式语法为POSIX中使用的扩展正则表达式
  - awk
    - 正则表达式语法为POSIX中 awk 所使用的语法
  - grep
    - 正则表达式语法为POSIX中 grep 所使用的语法
  - egrep
    - 正则表达式语法为POSIX中 grep-E 所使用的语法
- 构造
  - regex共有7个构造方法重载，这里我仅列举最具代表性的2个
  - regex r(x, flags);
    - x可以是string, c风格字符串等
    - flags可以是syntax_option_type列举的常量
    - 举例
      - regex r("[a-zA-z]+://[^\s]*", regex_constants::ECMAScript);
  - regex r{};
    - 默认构造函数；构造一个空模式；标志设置为regex_constants::ECMAScript
- 可以通过r.flags()来获知flag是怎样的
正则表达式函数
- regex_match()
  - 为了查找与已知长度的完整序列，（例如一行文本）匹配的模式
- regex_search()
  - 为了在序列（如一个文件）中查找一个模式
  - 在数据流中查找匹配模式的第一次出现
- regex_replace()
  - 为了替换序列（例如一个文件）中匹配模式的部分
  - regex_constants::match_flag_type里面 format打头的标识用来控制regex_replace的format规则

- 匹配的控制选项
  - regex_constants::match_flag_type
    - 默认：match_default

匹配结果
- smatch
- 可将smath对象的元素直接转换为string，表示获取匹配结果字符串。
迭代器
- regex_iterator
  - 双向迭代器
  - 如果需要遍历一个字符序列的每一个匹配模式并执行一些操作，则可以使用regex_iterator

代码示例

#include <iostream>
#include <string>
#include <regex>
#include <fstream>
#include <codecvt>

using namespace std;

void testRegexMatchFunc()
{
	wchar_t *pwstr = L"hello world,\n tom and Jerry";
	char *pstr = "hello world,\n tom and Jerry";
	string str = "hello world,\n tom and<b>bouquet of roses</b> Jerry";
	string str2 = "hello 2022-01-22 world,\n tom and<b>bouquet of roses</b> Jerry http://baidu.com  2022-05-01 heihiei\t\n 293743@qq.com lasjdfljds@gmail.com";
	string str3 = "tomcat";


	regex urlPattern("[a-zA-z]+://[^\s]*");
	regex datePattern("([0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]{1}|[0-9]{1}[1-9][0-9]{2}|[1-9][0-9]{3})-(((0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|((0[469]|11)-(0[1-9]|[12][0-9]|30))|(02-(0[1-9]|[1][0-9]|2[0-8])))", regex_constants::nosubs);
	regex emailPattern(R"([\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?)");
	regex xmlPattern(R"(<(.*?)>(.*?)</.*?>)", regex_constants::nosubs);

	smatch result;

	regex pattern("[a-zA-z]+://[^\s]*", regex_constants::ECMAScript);

	bool found = regex_match(pstr, pattern);

	found = regex_match(str3, result, pattern);
	if (found)
	{
		string e;
		for each (e in result)
		{
			cout << e << endl;
		}
	}

	found = regex_search(str3, result, pattern);
	if (found)
	{
		string e;
		for each (e in result)
		{
			cout << e << endl;
		}
	}

	auto urlPatFlags = urlPattern.flags();
	cout << urlPatFlags << endl;

	found = regex_search(str2, result, urlPattern);
	if (found)
	{
		string e;
		for each (e in result)
		{
			cout << e << endl;
		}
	}

	found = regex_search(str2, result, datePattern);
	if (found)
	{
		string e;
		for each (e in result)
		{
			cout << e << endl;
		}
	}

	found = regex_search(str2, result, emailPattern);
	if (found)
	{
		string e;
		for each (auto e2 in result)
		{
			cout << *e2.first << endl;
			cout << *e2.second << endl;
			cout << e2.str() << endl;
		}
	}

	found = regex_search(str2, result, xmlPattern);
	if (found)
	{
		string e;
		for each (auto e2 in result)
		{
			cout << *e2.first << endl;
			cout << *e2.second << endl;
			cout << e2.str() << endl;
		}
	}

	regex pattern2("<(.*?)>(.*?)</\\1>");
	found = regex_match(str, result, pattern2);
	cout << found << result.length() << endl;

	regex pattern3(R"(([0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]{1}|[0-9]{1}[1-9][0-9]{2}|[1-9][0-9]{3})-(((0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|((0[469]|11)-(0[1-9]|[12][0-9]|30))|(02-(0[1-9]|[1][0-9]|2[0-8]))))");
	found = regex_match(str2, result, pattern3);
	cout << found << result.length() << endl;
}

void testRegexSearchFunc()
{
	wchar_t *pwstr = L"hello world,\n tom and Jerry";
	char *pstr = "hello world,\n tom and Jerry";
	string str = "hello world,\n tom and<b>bouquet of roses</b> Jerry lsjdfl@outlook.com  hshdfehjf@qq.com";

	regex pattern(R"(<(.*?)>(.*?)</\1>)");
	regex pat1("hello\S*");
	regex emailPattern(R"([\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?)");

	// pattern.egrep = true;

	smatch result;
	bool found = regex_match(str, result, pattern);

	cout << found << endl;

	found = regex_search(str, result, emailPattern);
	cout << found << endl;
}

void test_nosubs_flag_func() {
	cout << "######################### 测试nosubs flag start #####################" << endl;
	string str2 = "hello 2022-01-22 world,\n tom and<b>bouquet of roses</b> Jerry http://baidu.com  2022-05-01 heihiei\t\n 293743@qq.com lasjdfljds@gmail.com";
	regex xmlPattern(R"(<(.*?)>(.*?)</.*?>)");
	regex xmlPatternWithNosubs(R"(<(.*?)>(.*?)</.*?>)", regex_constants::nosubs);
	smatch result;
	bool found;

	found = regex_search(str2, result, xmlPattern);
	if (found)
	{
		for each (string e2 in result)
		{
			cout << e2 << endl;
		}
	}

	found = regex_search(str2, result, xmlPatternWithNosubs);
	if (found)
	{
		for each (string e2 in result)
		{
			cout << e2 << endl;
		}
	}

	cout << "######################### 测试nosubs flag end #####################" << endl;

}

void TestUnicodeUTF16()
{
	wifstream textFileInputStream("./utf16textfile.txt");
	textFileInputStream.imbue(locale(locale::classic(), new codecvt_utf16<wchar_t>()));
	if (textFileInputStream.is_open())
	{
		wstring wstr;
		std::streamsize readLimitSize = 1024;

		while (!textFileInputStream.eof())
		{
			textFileInputStream >> wstr;
			wcout << wstr << endl;
		}
	}

	textFileInputStream.clear();
	textFileInputStream.close();
}

int main()
{

	{
		// R字面值的使用，适合用于定义regex pattern，不需要手动加反斜杠转义
		string str_r_1 = R"(
{
  "id": "velit enim ipsum nostrud nisi",
  "name": "ut dolore quis mollit in",
  "namePinyin": "ullamco",
  "namePy": "id in mollit",
  "username": "proident nulla aliquip",
  "sex": -10581978.03552425,
  "jobNumber": "mollit in anim",
  "mobile": "incididunt qui laborum do",
  "email": "exercitation magna labore do anim",
  "rfid": "velit officia consequat qui",
  "avatarLink": "in incididunt amet",
  "preferences": {
    "lang": "irure elit amet ea"
  }
}
)";

		cout << str_r_1 << endl;

	}


	testRegexMatchFunc();
	testRegexSearchFunc();
	test_nosubs_flag_func();

	TestUnicodeUTF16();

	_wsystem(L"pause");
	return 0;
}

输出


{
  "id": "velit enim ipsum nostrud nisi",
  "name": "ut dolore quis mollit in",
  "namePinyin": "ullamco",
  "namePy": "id in mollit",
  "username": "proident nulla aliquip",
  "sex": -10581978.03552425,
  "jobNumber": "mollit in anim",
  "mobile": "incididunt qui laborum do",
  "email": "exercitation magna labore do anim",
  "rfid": "velit officia consequat qui",
  "avatarLink": "in incididunt amet",
  "preferences": {
    "lang": "irure elit amet ea"
  }
}

1
http://baidu.com  2022-05-01 heihiei
 293743@qq.com la
2022-01-22
2

293743@qq.com
<

<b>bouquet of roses</b>
00
00
0
1
######################### 测试nosubs flag start #####################
<b>bouquet of roses</b>
b
bouquet of roses
<b>bouquet of roses</b>
######################### 测试nosubs flag end #####################
请按任意键继续. . .