C++ 11 正则表达式

我对C处理字符串学习得非常少,之前最多使用过strtok函数来split字符串。每次处理文本的时候都是使用的通过python来做处理。上次看同事使用做多语言版本的时候,使用C++ 11 标准编写了一个程序,去分析全部的代码里面的中文字串,列出一个清单写到excel。说明C++也是能做这样类似的处理了。本文将通过读手册来学习C++ 11 regular expression。

手册

手册

objects

定义文件:

#include <regex>

几乎全部正则操作能被归类于对于下面对象的操作:

target sequence

目标序列,就是提供出来给正则来处理的字符串。可能是一个指定了iterators区间、一个null结束的字符串,或者是std::string

Pattern

模式:就是正则表达式本身。
c++ 11 正则语法选项

Matched array

匹配队列:匹配信息可能被转换成一个std::match_results类型的对象。

Replacement string

这个字符串决定了如何替掉匹配的字符串。

Main classes

basic_regex

正则表达式对象。
窄字节、宽字节的支持:

TypeDefinition
regexbasic_regex
wregexbasic_regex

flag_type:可以定制正则表达式对象做一些定制。定制文档
这里主要是定制正则表达式的一些选项。

ValueEffect(s)
icaseCharacter matching should be performed without regard to case.
nosubsWhen performing matches, all marked sub-expressions (expr) are treated as non-marking sub-expressions (?:expr). No matches are stored in the supplied std::regex_match structure and mark_count() is zero
optimizeInstructs the regular expression engine to make matching faster, with the potential cost of making construction slower. For example, this might mean converting a non-deterministic FSA to a deterministic FSA.
collateCharacter ranges of the form “[a-b]” will be locale sensitive.
ECMAScriptUse the Modified ECMAScript regular expression grammar
basicUse the basic POSIX regular expression grammar (grammar documentation).
extendedUse the extended POSIX regular expression grammar (grammar documentation).
awkUse the regular expression grammar used by the awk utility in POSIX (grammar documentation)
grepUse the regular expression grammar used by the grep utility in POSIX. This is effectively the same as the basic option with the addition of newline ‘\n’ as an alternation separator.
egrepUse the regular expression grammar used by the grep utility, with the -E option, in POSIX. This is effectively the same as the extended option with the addition of newline ‘\n’ as an alternation separator in addtion to ‘|’.

举个例子:

std::regex("meow", std::regex::icase)

sub_match

定义被子表达式匹配的字符序列

match_results

定义一个正则匹配结果,包含全部的子表达式匹配结果。

Algorithms

regex_match

match是做一次匹配。

#include <iostream>
#include <string>
#include <regex>

int main()
{
    // Simple regular expression matching
    std::string fnames[] = {"foo.txt", "bar.txt", "baz.dat", "zoidberg"};
    std::regex txt_regex("[a-z]+\\.txt");

    for (const auto &fname : fnames) {
        std::cout << fname << ": " << std::regex_match(fname, txt_regex) << '\n';
    }   

    // Extraction of a sub-match
    std::regex base_regex("([a-z]+)\\.txt");
    std::smatch base_match;

    for (const auto &fname : fnames) {
        if (std::regex_match(fname, base_match, base_regex)) {
            // The first sub_match is the whole string; the next
            // sub_match is the first parenthesized expression.
            if (base_match.size() == 2) {
                std::ssub_match base_sub_match = base_match[1];
                std::string base = base_sub_match.str();
                std::cout << fname << " has a base of " << base << '\n';
            }
        }
    }

    // Extraction of several sub-matches
    std::regex pieces_regex("([a-z]+)\\.([a-z]+)");
    std::smatch pieces_match;

    for (const auto &fname : fnames) {
        if (std::regex_match(fname, pieces_match, pieces_regex)) {
            std::cout << fname << '\n';
            for (size_t i = 0; i < pieces_match.size(); ++i) {
                std::ssub_match sub_match = pieces_match[i];
                std::string piece = sub_match.str();
                std::cout << "  submatch " << i << ": " << piece << '\n';
            }   
        }   
    }   
}

例子演示了如何通过sub_match将一次match出来的内容分组读取出来。

foo.txt: 1
bar.txt: 1
baz.dat: 0
zoidberg: 0
foo.txt has a base of foo
bar.txt has a base of bar
foo.txt
  submatch 0: foo.txt
  submatch 1: foo
  submatch 2: txt
bar.txt
  submatch 0: bar.txt
  submatch 1: bar
  submatch 2: txt
baz.dat
  submatch 0: baz.dat
  submatch 1: baz
  submatch 2: dat

regex_search

search是可以迭代将全部匹配的找出来。

#include <iostream>
#include <string>
#include <regex>

int main()
{
    std::string lines[] = {"Roses are #ff0000",
                           "violets are #0000ff",
                           "all of my base are belong to you"};

    std::regex color_regex("#([a-f0-9]{2})"
                            "([a-f0-9]{2})"
                            "([a-f0-9]{2})");

    // simple match
    for (const auto &line : lines) {
        std::cout << line << ": " << std::boolalpha
                  << std::regex_search(line, color_regex) << '\n';
    }   
    std::cout << '\n';

    // show contents of marked subexpressions within each match
    std::smatch color_match;
    for (const auto& line : lines) {
        if(std::regex_search(line, color_match, color_regex)) {
            std::cout << "matches for '" << line << "'\n";
            std::cout << "Prefix: '" << color_match.prefix() << "'\n";
            for (size_t i = 0; i < color_match.size(); ++i) 
                std::cout << i << ": " << color_match[i] << '\n';
            std::cout << "Suffix: '" << color_match.suffix() << "\'\n\n";
        }
    }

    // repeated search (see also std::regex_iterator)
    std::string log(R"(
        Speed:  366
        Mass:   35
        Speed:  378
        Mass:   32
        Speed:  400
    Mass:   30)");
    std::regex r(R"(Speed:\t\d*)");
    std::smatch sm;
    while(regex_search(log, sm, r))
    {
        std::cout << sm.str() << '\n';
        log = sm.suffix();
    }
}

Output:

Roses are #ff0000: true
violets are #0000ff: true
all of my base are belong to you: false

matches for 'Roses are #ff0000'
Prefix: 'Roses are '
0: #ff0000
1: ff
2: 00
3: 00
Suffix: ''

matches for 'violets are #0000ff'
Prefix: 'violets are '
0: #0000ff
1: 00
2: 00
3: ff
Suffix: ''

Speed:  366
Speed:  378
Speed:  400

regex_replace

查找匹配到的字符并且替换。这个也支持常用的抓取替换的写法。

#include <iostream>
#include <iterator>
#include <regex>
#include <string>

int main()
{
   std::string text = "Quick brown fox";
   std::regex vowel_re("a|e|i|o|u");

   // write the results to an output iterator
   std::regex_replace(std::ostreambuf_iterator<char>(std::cout),
                      text.begin(), text.end(), vowel_re, "*");

   // construct a string holding the results
   std::cout << '\n' << std::regex_replace(text, vowel_re, "[$&]") << '\n';
}

output

Q**ck br*wn f*x
Q[u][i]ck br[o]wn f[o]x
void auto_test2()
{
    std::string erl_text = "-define(CMD_CONNECT_EXCHANGE_KEY_REQ, 3).";
    std::regex match_erl_define("-define\\(([a-zA-Z_]+), ([0-9]+)\\)\\.");

    // construct a string holding the results
    std::cout << '\n' << std::regex_replace(erl_text, match_erl_define, "#define $1 $2") << '\n';
}

输出:

#define CMD_CONNECT_EXCHANGE_KEY_REQ 3

全面了解正则内建的替换字符串表意可以查看这个手册

RegExp.lastMatch
RegExp['$&']
RegExp.$1-$9
RegExp.input ($_)
RegExp.lastParen ($+)
RegExp.leftContext ($`)
RegExp.rightContext ($')

注意在c++里面写正则的时候,需要写\来做转义符号。而且中间()这种catch都是直接支持的。
匹配的时候*,+都不需要转义。{2}这种限定出现次数的符号也不需要转义。
在wiki里面叫做:Quantification

fmt - the regex replacement format string, exact syntax depends on the value of flags

Iterators

迭代器

regex_iterator

可以迭代方式去查询正则表达式匹配到的内容,这个是用的场景应该是当需要在查找过程中控制次数的情况。打个比方,一段文字,里面写了某个酒店里面客人消费数目,你想通过正则抓取出当酒店收到消费额度到100块的时候,这些消费的客人清单,就可以用这个来做。当循环到发现数目已经到了100块就可以结束掉正则匹配。而不是将全部的清单匹配出来,然后一条条过,最后用前面几条。

#include <regex>
#include <iterator>
#include <iostream>
#include <string>

int main()
{
    const std::string s = "Quick brown fox.";

    std::regex words_regex("[^\\s]+");
    auto words_begin = 
        std::sregex_iterator(s.begin(), s.end(), words_regex);
    auto words_end = std::sregex_iterator();

    std::cout << "Found " 
              << std::distance(words_begin, words_end) 
              << " words:\n";

    for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
        std::smatch match = *i;                                                 
        std::string match_str = match.str(); 
        std::cout << match_str << '\n';
    }   
}

输出内容:

Found 3 words:
Quick
brown
fox.

regex_token_iterator

#include <fstream>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>

int main()
{
   std::string text = "Quick brown fox.";
   // tokenization (non-matched fragments)
   // Note that regex is matched only two times: when the third value is obtained
   // the iterator is a suffix iterator.
   std::regex ws_re("\\s+"); // whitespace
   std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));

   // iterating the first submatches
   std::string html = "<p><a href=\"http://google.com\">google</a> "
                      "< a HREF =\"http://cppreference.com\">cppreference</a>\n</p>";
   std::regex url_re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"", std::regex::icase);
   std::copy( std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

关于两者的区别在于,token可以在查找的时候可以定制submatches。
实例1
实例2

Exceptions

regex_error

抓取正则的错误。

#include <regex>
#include <iostream>

int main()
{
    try {
        std::regex re("[a-b][a");
    } 

    catch (const std::regex_error& e) {
        std::cout << "regex_error caught: " << e.what() << '\n';
        if (e.code() == std::regex_constants::error_brack) {
            std::cout << "The code was error_brack\n";
        }
    }
}
regex_error caught: The expression contained mismatched [ and ].
The code was error_brack

总结

基本上看完这些东西,使用c++来做一些匹配上的工作了。

void test_chinese_re()
{
    string text = "vice jax teemo, 老张  武松 ";
    regex reg(" ([\u4e00-\u9fa5]+) ");

    sregex_iterator pos(text.cbegin(), text.cend(), reg);
    sregex_iterator end;
    for (; pos != end; ++pos) {
        cout << "match:  " << pos->str() << endl;
        cout << " tag:   " << pos->str(1) << endl;
    }

}
match:   老张
 tag:   老张
match:   武松
 tag:   武松
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值