c++ 正则匹配std::regex的使用

ding_zhikai

已于 2023-01-30 12:47:40 修改

阅读量2.6k

点赞数

分类专栏： c/c++ 文章标签： c++ 正则表达式开发语言

于 2023-01-30 12:37:25 首次发布

本文链接：https://blog.csdn.net/ding_zhikai/article/details/128797473

版权

c/c++ 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

0.文章主旨
1.关注点
2.简单介绍
- 2.1 语法简介
3.使用样例

0.文章主旨

本文简单介绍关于正则匹配 std::regex 的使用。

1.关注点

（1）编译正则表达式
（2）正则匹配运算 & ()分组的位置
（3）中文 & wchar :
① 中文单字的位置
② 打印
③ 读取文件

2.简单介绍

2.1 语法简介

2.1.1 语法-头文件

2.1.1.1 char类型

#include <regex>

using std::regex;
using std::cmatch;
using std::regex_match;
using std::regex_search;

*注：后两个是函数

2.1.1.2 wchar类型

using std::wregex;
using std::wcmatch;

*注：使用方法与char类型的相同（注意 wchar_t[] 字面值形如 L"abc"），函数仍使用 std::regex_match() 或 std::regex_search()。

2.1.2 语法-编译正则表达式

regex rule(rule_str);

*注：rule_str 是正则表达式string

2.1.3 语法-创建一个用于存储正则匹配结果的中间变量

cmatch match_rslt;

使用这个变量可以获取更多正则匹配的细节

2.1.4 语法-正则匹配运算

文本全匹配

regex_match(line.c_str(), match_rslt, rule)

文本局部匹配

regex_search(line.c_str(), match_rslt, rule)

*注：
（1）line是实际待匹配的文本，
（2）match_rslt用于存储详细的匹配结果（如果只考虑是否匹配，不考虑匹配细节可以不要这个参数），
（3）rule是之前编译好的正则表达式（比如：使用一个静态的正则表达式文本文件作为预设的资源，此时可以预先再程序初始化时编译好正则表达式，然后反复使用，节省编译时间）。

2.1.5 语法-获取匹配结果

2.1.5.1 如何判断：是否匹配到了？

关注 regex_match() 和 regex_search() 的返回值

2.1.5.2整体匹配

整体匹配-起始位置

match_rslt.position()

整体匹配-整体匹配的长度

match_rslt.length()

整体匹配-匹配的文本内容

A) 方法1

match_rslt.str()

B) 方法2

match_rslt[0].str()

*注：
a) 注意先判断是否匹配上（或判空）；
b) 可以用cout打印，甚至直接打印 match_rslt[0]；
c) 注意 wchar类型时，不能这样打印！！！

2.1.5.3 局部匹配

局部匹配-分组个数

match_rslt.size()-1

*注：
（1）-1是因为第一个元素是整体匹配的结果
（2）这个数量相当于正则表达式中()的数量

局部匹配-分组长度
首先假设我们这样遍历匹配到的分组：

for(auto& elm: match_rslt)

elm 相当于 match_rslt[i]
接着我们可以这样查看分组长度：

elm.length()

局部匹配-分组起始位置计算

elm.first-match_rslt[0].first+match_rslt.position()

解析：
（1）elm.first 是一个地址，也就是这个分组在原输入序列的起始索引
（2）match_rslt[0].first 相当于整体匹配在原序列的起始地址，我们做差值自然可以得到当前分组相对于整体匹配的索引
（3）最后我们加上整体匹配在原始序列中的起始索引，最终得到的是当前分组相对于原始序列的起始索引

举例：
正则表达式：“你好[，。！]?我叫.{1,4}[，。！]?(请多指教|)$”
输入序列：“你好我叫甲乙丙，请多指教”
整体匹配：“你好我叫甲乙丙，请多指教”, start=0, len=12
分组匹配（只有一个分组）：“请多指教”, start=8, len=4

*注：
（1）上面这个例子是用 wchar 正则匹配的
（2）.{1,4}表示单字到四字的通配，在char条件下基本上只能匹配单字，但是wchar可以匹配到四字

局部匹配-分组内容

对于 char
方法1：

elm.str()

*注：
（1）可以用cout打印出来，甚至可以直接打印 elm
（2）对于wchar 的打印，不推荐使用类似于：wcout<<elm.str()<<endl; 的方式。

方法2:

line.substr(start,len)

对于 wchar
（1）取出 wstring分组的方法与char相同（elm.str() 或 line.substr(start,len)），
（2）但是打印不推荐使用类似于：wcout<<elm.str()<<endl; 的方式
（3）打印的推荐方式（先转成string，在用cout）：

cout<<"1===>"<<to_byte_string(elm.str())<<endl;

*注：
to_byte_string() 这个函数是网上抄来的，用于将wstring转为string，详细代码在小节“3.使用样例”里面有。

3.使用样例

#include <iostream>
#include <fstream>

#include <string>
#include <regex>

#include <locale>
#include <codecvt>
#include <Windows.h>

using std::cout;
using std::endl;
using std::ifstream;

using std::string;

using std::regex;
using std::cmatch;
using std::regex_match;
using std::regex_search;

using std::wcout;
using std::wstring;
using std::wregex;
using std::wcmatch;
using std::wifstream;

// === functions ===================
std::string to_byte_string(const std::wstring& input)
{
    //std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    return converter.to_bytes(input);
}

std::wstring StringToWString(const std::string& str)
{
    int num = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, NULL, 0);
    wchar_t *wide = new wchar_t[num];
    MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wide, num);
    std::wstring w_str(wide);
    delete[] wide;
    return w_str;
}

void show_wstr(const wstring& line, const string& split_tag = ", ", const string& link_tag=":") {
    for(int i=0; i<line.size(); ++i) {
        cout<<i<<link_tag<<to_byte_string(wstring({line[i]}))<<split_tag;
    }
    cout<<endl;
}


void try_regex(const string& rule_str, const string& line, const string& end_line = "\n") {
    regex rule(rule_str);
    cmatch match_rslt;

    if (regex_match(line.c_str(), match_rslt, rule)) {
        // 整句匹配
        cout<<"MATCH: line(\""<<line<<"\") MATCH rule(\""<<rule_str<<"\")."<<endl;
        int i = 0;
        for(auto& elm: match_rslt) {
            cout<<"match_part["<<i<<"]: \""<<elm
                <<"\", start="<<elm.first-match_rslt[0].first+match_rslt.position() // 匹配到的部分 起始位置计算 !!!
                <<", len="<<elm.length()<<endl;
            ++i;
        }
        cout<<"match poi: "<<match_rslt.position()<<endl;
    } else if (regex_search(line.c_str(), match_rslt, rule)) {
        // 局部匹配
        cout<<"SEARCH: line(\""<<line<<"\") SEARCH rule(\""<<rule_str<<"\")."<<endl;
        int i = 0;
        for(auto& elm: match_rslt) {
            cout<<"match_part["<<i<<"]: \""<<elm
                <<"\", start="<<elm.first-match_rslt[0].first+match_rslt.position()
                <<", len="<<elm.length()<<endl;
            ++i;
        }
        cout<<"match poi: "<<match_rslt.position()<<endl;
    } else {
        // 不匹配
        cout<<"NOT MATCH: line(\""<<line<<"\") NOT match rule(\""<<rule_str<<"\")!"<<endl;
    }
    cout<<"match part num: "<<match_rslt.size()<<endl;
    cout<<end_line;
}

void try_wregex(const wstring& rule_wstr, const wstring& line, const string& end_line = "\n") {
    wregex rule(rule_wstr);
    wcmatch match_rslt;

    if (regex_match(line.c_str(), match_rslt, rule)) {
        // 整句匹配
        cout<<"MATCH: line(\""<<to_byte_string(line)<<"\") MATCH rule(\""<<to_byte_string(rule_wstr)<<"\")."<<endl;
        int i = 0;
        for(auto& elm: match_rslt) {
            size_t start = elm.first-match_rslt[0].first+match_rslt.position();
            size_t len = elm.length();
            cout<<"match_part["<<i<<"]: \""<<to_byte_string(line.substr(start,len)) // wstring 的elm 没法直接打印 !!!
                <<"\", start="<<start
                <<", len="<<len<<endl;
            ++i;
        }
        cout<<"match poi: "<<match_rslt.position()<<endl;
    } else if (regex_search(line.c_str(), match_rslt, rule)) {
        // 局部匹配
        cout<<"SEARCH: line(\""<<to_byte_string(line)<<"\") SEARCH rule(\""<<to_byte_string(rule_wstr)<<"\")."<<endl;
        int i = 0;
        for(auto& elm: match_rslt) {
            size_t start = elm.first-match_rslt[0].first+match_rslt.position();
            size_t len = elm.length();
            cout<<"match_part["<<i<<"]: \""<<to_byte_string(line.substr(start,len))
                <<"\", start="<<start
                <<", len="<<len<<endl;
            ++i;
        }
        cout<<"match poi: "<<match_rslt.position()<<endl;
    } else {
        // 不匹配
        cout<<"NOT MATCH: line(\""<<to_byte_string(line)<<"\") NOT match rule(\""<<to_byte_string(rule_wstr)<<"\")!"<<endl;
    }
    cout<<"match part num: "<<match_rslt.size()<<endl;
    cout<<end_line;
}

void load_rule_txt_and_test(const char* pth, const string& in_str) {
    ifstream f(pth);
    string rule_line;
    while(getline(f,rule_line)){
        if (rule_line.size()) {
            try_regex(rule_line, in_str);
        }
    }

    f.close();
}

void load_wrule_txt_and_test(const char* pth, const wstring& in_wstr) {
    ifstream f(pth);
    string rule_line;
    while(getline(f,rule_line)){
        if (rule_line.size()) {
            cout<<"rule now: "<<rule_line<<endl;
            try_wregex(StringToWString(rule_line), in_wstr); // 读取 中文正则文件, 采用: 先读取string,再转为wstring的方式 !!!
        }
    }

    f.close();
}

// === main ==========================
int main() {
    // 1. match & search
    // string rule2 = "[0-9]+([a-z]+)([A-Z]+)";
    // try_regex(rule2, "123ab"); // not match
    // try_regex(rule2, "ab123abABCD45"); // search
    // try_regex(rule2, "123abABCD");// match
    // cout<<"======================"<<endl;

    // 2. wstring
    // wstring wrule1 = L"你好[，。！]?我叫.{1,4}[，。！]?(请多指教|)$";
    // try_wregex(wrule1, L"你好我叫甲乙丙，请多指教");
    // try_wregex(wrule1, L"你好我叫12345，请多指教");

    // 3. get regex rules by open file
    // const char* pth = "D:\\learn\\cpp\\code\\regex_rule_for_test.txt";
    // load_rule_txt_and_test(pth, "你好我叫123,请多指教");
    // cout<<"-------------------"<<endl;
    // load_rule_txt_and_test(pth, "你好我叫甲乙丙,请多指教");
    // cout<<"======================"<<endl;

    // 4. get regex rules (zn_ch mode) by open file
    const char* pth = "D:\\learn\\cpp\\code\\regex_rule_for_test.txt";
    load_wrule_txt_and_test(pth, L"你好我叫甲乙丙，请多指教");
    cout<<"-------------------"<<endl;
    load_wrule_txt_and_test(pth, L"你好我叫12345，请多指教");
    cout<<"======================"<<endl;

    return 0;
}