c ++正则表达式空格_现代C ++正则表达式简介

最新推荐文章于 2024-08-03 05:56:08 发布

weixin_26746861

最新推荐文章于 2024-08-03 05:56:08 发布

阅读量998

点赞数

文章标签：正则表达式 python

原文链接：https://medium.com/dev-genius/introduction-to-regular-expression-with-modern-c-2d6e50e6d5a7

版权

c ++正则表达式空格

Regular expressions (or regex in short) is a much-hated & underrated topic so far with Modern C++. But at the same time, correct use of regex can spare you writing many lines of code. If you have spent quite enough time in the industry. And not knowing regex then you are missing out on 20–30% productivity. In that case, I highly recommend you to learn regex, as it is one-time investment(something similar to learn once, write anywhere philosophy).

到目前为止，对于Modern C ++，正则表达式(或简称regex)是一个令人讨厌且被低估的话题。但是同时，正确使用正则表达式可以使您不必编写很多代码。如果您在行业中花费了足够的时间。而且不知道正则表达式，那么您就会错过20–30％的生产力。在这种情况下，我强烈建议您学习正则表达式，因为它是一次性投资(类似于一次学习，随处编写哲学)。

/!\: This article has been originally published on my blog. If you are interested in receiving my latest articles, please sign up to my newsletter.

/！\：本文最初发布在我的 博客上 。 如果您有兴趣接收我的最新文章， 请注册我的新闻通讯 。

Initially, In this article, I have decided to include regex-in-general also. But it doesn’t make sense, as there is already people/tutorial out there who does better than me in teaching regex. But still, I left a small section to address Motivation & Learning Regex. For the rest of the article, I will be focusing on functionality provided by C++ to work with regex. And if you are already aware of regex, you can use the above mind-map as a refresher.

最初，在本文中，我决定也将regex-in-general也包括在内。但这没有任何意义，因为已经有很多人/教程在教正则表达式方面比我做得更好。但是，我仍然留下一小部分来讨论动机与学习正则表达式。在本文的其余部分，我将重点介绍C ++提供的与正则表达式一起使用的功能。并且，如果您已经了解了正则表达式，则可以使用上面的思维导图作为复习。

Pointer: The C++ standard library offers several different “flavours” of regex syntax, but the default flavour (the one you should always use & I am demonstrating here) was borrowed wholesale from the standard for ECMAScript.

指针：C ++标准库提供了几种不同的regex语法“味道”，但是默认味道(您应始终使用的味道，我将在此处演示)是从ECMAScript标准中大量借用的。

动机 (Motivation)

I know its pathetic and somewhat confusing tool-set. Consider the below regex pattern for an example that extract time in 24-hour format i.e. HH:MM.
我知道它的工具集可悲且令人困惑。以下面的正则表达式模式为例，该示例以24小时格式(即HH：MM)提取时间。

\b([01]?[0-9]|2[0-3]):([0-5]\d)\b

I mean! Who wants to work with this cryptic text?
我的意思是！ 谁想使用这个神秘的文字？
And whatever running in your mind is 100% reasonable. In fact, I have procrastinated learning regex twice due to the same reason. But, believe me, all the ugly looking things are not that bad.
而您脑海中所想的都是100％合理的。实际上， 由于相同的原因 ， 我两次拖延学习正则表达式 。但是，相信我，所有难看的东西都还不错。
The way(↓) I am describing here won’t take more than 2–3 hours to learn regex that too intuitively. And After learning it you will see the compounding effect with return on investment over-the-time.
我在这里描述的方式( ↓ )花费的学习正则表达式的时间不会超过2-3小时。学习之后，您将看到随着时间推移投资回报率的复合效应。

学习正则表达式 (Learning Regex)

Do not google much & try to analyse which tutorial is best. In fact, don’t waste time in such analysis. Because there is no point in doing so. At this point in time(well! if you don’t know the regex) what really matters is “Getting Started” rather than “What Is Best!”.
谷歌搜索不多，并尝试分析哪个教程是最好的。实际上，不要在这种分析中浪费时间。因为这样做没有意义。在这个时间点上(嗯！如果您不知道正则表达式)，真正重要的是“入门”而不是“什么是最好的！”。
Just go to https://regexone.com without much overthinking. And complete all the lessons. Trust me here, I have explored many articles, courses(<=this one is free, BTW) & books. But this is best among all for getting started without losing motivation.
无需过多考虑即可 访问 https://regexone.com 。并完成所有课程。在这里相信我，我浏览了许多文章，课程 (<=这是免费的，顺便说一句)和书籍。但这是开始而不失动力的最佳方法。
And after it, if you still have an appetite to solve more problem & exercises. Consider the below links:
之后，如果您仍然有解决更多问题和练习的胃口。考虑以下链接：

std :: regex和std :: regex_error示例 (std::regex & std::regex_error Example)

int main() {
    try {
        static const auto r = std::regex(R"(\)"); // Escape sequence error
    } catch (const std::regex_error &e) {
        assert(strcmp(e.what(), "Unexpected end of regex when escaping.") == 0);
        assert(e.code() == std::regex_constants::error_escape);
    }
    return EXIT_SUCCESS;
}

You see! I am using raw string literals. You can also use the normal string. But, in that case, you have to use a double backslash for an escape sequence.
你看！我正在使用原始字符串文字。您也可以使用普通字符串。但是，在这种情况下，必须对转义序列使用双反斜杠。
The current implementation of std::regex is slow(as it needs regex interpretation & data structure creation at runtime), bloated and unavoidably require heap allocation(not allocator-aware). So, beware if you are using std::regex in a loop(see C++ Weekly -- Ep 74 -- std::regex optimize by Jason Turner). Also, there is only a single member function that I think could be of use is std::regex::mark_count() which returns a number of capture groups.
当前std::regex实现速度很慢(因为它需要在运行时进行regex解释和数据结构创建)，过时且不可避免地需要堆分配(不了解分配器)。因此， 请注意是否 在循环 中使用 std::regex (请参阅C ++每周-第74页-Jad Turner优化std :: regex )。另外，我认为只有一个成员函数可能是std :: regex :: mark_count() ，它返回多个捕获组。
Moreover, if you are using multiple strings to create a regex pattern at run time. Then you may need exception handling i.e. std::regex_error to validate its correctness.
此外，如果您在运行时使用多个字符串创建正则表达式模式。然后，您可能需要异常处理，即std::regex_error来验证其正确性。

std :: regex_search示例 (std::regex_search Example)

int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"((\w+):(\w+);)");
    smatch m;    if (regex_search(input, m, r)) {
        assert(m.size() == 3);
        assert(m[0].str() == "PQR:2;");                // Entire match
        assert(m[1].str() == "PQR");                   // Substring that matches 1st group
        assert(m[2].str() == "2");                     // Substring that matches 2nd group
        assert(m.prefix().str() == "ABC:1->   ");      // All before 1st character match
        assert(m.suffix().str() == ";;   XYZ:3<<<");   // All after last character match        // for (string &&str : m) { // Alternatively. You can also do
        //     cout << str << endl;
        // }
    }
    return EXIT_SUCCESS;
}

smatch is the specializations of std::match_results that stores the information about matches to be retrieved.
smatch是std :: match_results的特长，用于存储有关要检索的匹配项的信息。

std :: regex_match示例 (std::regex_match Example)

Short & sweet example that you may always find in every regex book is email validation. And that is where our std::regex_match function fits perfectly.
您经常在每本正则表达式书中都能找到的简短示例就是电子邮件验证。这就是我们的std::regex_match函数完美适合的地方。

bool is_valid_email_id(string_view str) {
    static const regex r(R"(\w+@\w+\.(?:com|in))");
    return regex_match(str.data(), r);
}int main() {
    assert(is_valid_email_id("vishalchovatiya@ymail.com") == true);
    assert(is_valid_email_id("@abc.com") == false);
    return EXIT_SUCCESS;
}

I know this is not full proof email validator regex pattern. But my intention is also not that.
我知道这不是完整的电子邮件验证器正则表达式模式。但是我的意图也不是那样。
Rather you should wonder why I have used std::regex_match! not std::regex_search! The rationale is simple std::regex_match matches the whole input sequence.
相反，您应该想知道为什么我使用了std::regex_match ！不是std::regex_search ！基本原理很简单std::regex_match 匹配整个输入序列 。
Also, Noticeable thing is static regex object to avoid constructing (“compiling/interpreting”) a new regex object every time the function entered.
另外，值得注意的是静态正则表达式对象，以避免在每次输入函数时构造(“编译/解释”)新的正则表达式对象 。
The irony of above tiny code snippet is that it produces around 30k lines of assembly that too with -O3 flag. And that is ridiculous. But don't worry this is already been brought to the ISO C++ community. And soon we may get some updates. Meanwhile, we do have other alternatives (mentioned at the end of this article).
上面的小代码段具有讽刺意味的是，它也带有-O3标志，产生了约30k行汇编 。这太荒谬了。但是不用担心，这已经被带入了ISO C ++社区。很快我们可能会得到一些更新。同时，我们还有其他选择(在本文末尾提到)。

std :: regex_match和std :: regex_search之间的区别？ (Difference Between std::regex_match & std::regex_search?)

You might be wondering why do we have two functions doing almost the same work? Even I had the doubt initially. But, after reading the description provided by cppreference over and over. I found the answer. And to explain that answer, I have created the example(obviously with the help of StackOverflow):
您可能想知道为什么我们有两个功能几乎可以完成相同的工作？一开始我什至也有疑问。但是，在反复阅读cppreference提供的描述之后。我找到了答案。为了解释这个答案，我创建了示例(显然是在StackOverflow的帮助下)：

int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"((\w+):(\w+);)");
    smatch m;    assert(regex_match(input, m, r) == false);    assert(regex_search(input, m, r) == true && m.ready() == true && m[1] == "PQR");    return EXIT_SUCCESS;
}

std::regex_match only returns true when the entire input sequence has been matched, while std::regex_search will succeed even if only a sub-sequence matches the regex.
std::regex_match 仅 在整个输入序列都已匹配 时才返回 true ，而 std::regex_search 即使只有一个子序列与regex匹配也将成功。

std :: regex_iterator示例 (std::regex_iterator Example)

std::regex_iterator is helpful when you need very detailed information about matched & sub-matches.
当您需要有关匹配和子匹配的非常详细的信息时， std::regex_iterator会很有帮助。

#define C_ALL(X) cbegin(X), cend(X)int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"((\w+):(\d))");    const vector<smatch> matches{
        sregex_iterator{C_ALL(input), r},
        sregex_iterator{}
    };    assert(matches[0].str(0) == "ABC:1" 
        && matches[0].str(1) == "ABC" 
        && matches[0].str(2) == "1");    assert(matches[1].str(0) == "PQR:2" 
        && matches[1].str(1) == "PQR" 
        && matches[1].str(2) == "2");    assert(matches[2].str(0) == "XYZ:3" 
        && matches[2].str(1) == "XYZ" 
        && matches[2].str(2) == "3");    return EXIT_SUCCESS;
}

Earlier(in C++11), there was a limitation that using std::regex_interator is not allowed to be called with a temporary regex object. Which has been rectified with overload from C++14.
较早的版本(在C ++ 11中)存在一个限制，即不允许使用std::regex_interator与临时regex对象一起调用。已通过C ++ 14的重载进行了纠正。

std :: regex_token_iterator示例 (std::regex_token_iterator Example)

std::regex_token_iterator is the utility you are going to use 80% of the time. It has a slight variation as compared to std::regex_iterator. The difference between std::regex_iterator & std::regex_token_iterator is
std::regex_token_iterator是您将在80％的时间内使用的实用程序。与std::regex_iterator相比，它略有变化。 std::regex_iterator 和 std::regex_token_iterator 之间的区别是
std::regex_iterator points to match results.
std::regex_iterator 指向匹配结果。
std::regex_token_iterator points to sub-matches.
std::regex_token_iterator 指向子匹配项。
In std::regex_token_iterator, each iterator contains only a single matched result.
在std::regex_token_iterator ，每个迭代器仅包含一个匹配的结果。

#define C_ALL(X) cbegin(X), cend(X)int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"((\w+):(\d))");    // Note: vector<string> here, unlike vector<smatch> as in std::regex_iterator
    const vector<string> full_match{
        sregex_token_iterator{C_ALL(input), r, 0}, // Mark `0` here i.e. whole regex match
        sregex_token_iterator{}
    };
    assert((full_match == decltype(full_match){"ABC:1", "PQR:2", "XYZ:3"}));    const vector<string> cptr_grp_1st{
        sregex_token_iterator{C_ALL(input), r, 1}, // Mark `1` here i.e. 1st capture group
        sregex_token_iterator{}
    };
    assert((cptr_grp_1st == decltype(cptr_grp_1st){"ABC", "PQR", "XYZ"}));    const vector<string> cptr_grp_2nd{
        sregex_token_iterator{C_ALL(input), r, 2}, // Mark `2` here i.e. 2nd capture group
        sregex_token_iterator{}
    };
    assert((cptr_grp_2nd == decltype(cptr_grp_2nd){"1", "2", "3"}));    return EXIT_SUCCESS;
}

与std :: regex_token_iterator的反向匹配 (Inverted Match With std::regex_token_iterator)

#define C_ALL(X) cbegin(X), cend(X)int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"((\w+):(\d))");    const vector<string> inverted{
        sregex_token_iterator{C_ALL(input), r, -1}, // `-1` = parts that are not matched
        sregex_token_iterator{}
    };
    assert((inverted == decltype(inverted){
                            "",
                            "->   ",
                            ";;;   ",
                            "<<<",
                        }));    return EXIT_SUCCESS;
}

std :: regex_replace示例 (std::regex_replace Example)

string transform_pair(string_view text, regex_constants::match_flag_type f = {}) {
    static const auto r = regex(R"((\w+):(\d))");
    return regex_replace(text.data(), r, "$2", f);
}int main() {
    assert(transform_pair("ABC:1, PQR:2"s) == "1, 2"s);    // Things that aren't matched are not copied
    assert(transform_pair("ABC:1, PQR:2"s, regex_constants::format_no_copy) == "12"s);
    return EXIT_SUCCESS;
}

You see in 2nd call of transform_pair, we passed flag std::regex_constants::format_no_copy which suggest do not copy thing that isn't matched. There are many such useful flags under std::regex_constant.
您在transform_pair的第二次调用中看到，我们传递了标志std::regex_constants::format_no_copy ，该标志建议您不要复制不匹配的内容。在std :: regex_constant下有许多这样有用的标志。
Also, we have constructed the fresh string holding the results. But what if we do not want a new string. Rather wants to append the results directly to somewhere(probably container or stream or already existing string). Guess what! the standard library has covered this also with overloaded std::regex_replace as follows:
另外，我们构造了保存结果的新字符串。但是，如果我们不想要新的字符串怎么办。而是想将结果直接附加到某个地方(可能是容器或流或已经存在的字符串)。你猜怎么了！标准库也通过重载std::regex_replace对此进行了介绍，如下所示：

int main() {
    const string input = "ABC:1->   PQR:2;;;   XYZ:3<<<"s;
    const regex r(R"(-|>|<|;| )");    // Prints "ABC:1     PQR:2      XYZ:3   "
    regex_replace(ostreambuf_iterator<char>(cout), C_ALL(input), r, " ");    return EXIT_SUCCESS;
}

用例 (Use Cases)

用定界符分割字符串 (Splitting a String With Delimiter)

Although std::strtok is best suitable & optimal candidate for such a task. But just to demonstrate how you can do it with regex:
尽管std::strtok是最适合此类任务的最佳选择。但是只是为了演示如何使用正则表达式：

#define C_ALL(X) cbegin(X), cend(X)vector<string> split(const string& str, string_view pattern) {
    const auto r = regex(pattern.data());
    return vector<string>{
        sregex_token_iterator(C_ALL(str), r, -1),
        sregex_token_iterator()
    };
}int main() {
    assert((split("/root/home/vishal", "/")
                == vector<string>{"", "root", "home", "vishal"}));
    return EXIT_SUCCESS;
}

从字符串修剪空白 (Trim Whitespace From a String)

string trim(string_view text) {
    static const auto r = regex(R"(\s+)");
    return regex_replace(text.data(), r, "");
}int main() {
    assert(trim("12   3 4      5"s) == "12345"s);
    return EXIT_SUCCESS;
}

从文件中查找包含或不包含某些单词的行 (Finding Lines Containing or Not Containing Certain Words From a File)

string join(const vector<string>& words, const string& delimiter) {
    return accumulate(next(begin(words)), end(words), words[0],
            [&delimiter](string& p, const string& word)
            {
                return p + delimiter + word;
            });
}vector<string> lines_containing(const string& file, const vector<string>& words) {
    auto prefix = "^.*?\\b("s;
    auto suffix = ")\\b.*$"s;    //  ^.*?\b(one|two|three)\b.*$
    const auto pattern = move(prefix) + join(words, "|") + move(suffix);    ifstream        infile(file);
    vector<string>  result;    for (string line; getline(infile, line);) {
        if(regex_match(line, regex(pattern))) {
            result.emplace_back(move(line));
        }
    }    return result;
}int main() {
   assert((lines_containing("test.txt", {"one","two"})
                                        == vector<string>{"This is one",
                                                          "This is two"}));
    return EXIT_SUCCESS;
}
/* test.txt
This is one
This is two
This is three
This is four
*/

Same goes for finding lines that are not containing words with the pattern ^((?!(one|two|three)).)*$.
查找不包含模式为^((?!(one|two|three)).)*$单词的行也是如此。

在目录中查找文件 (Finding Files in a Directory)

namespace fs = std::filesystem;vector<fs::directory_entry> find_files(const fs::path &path, string_view rg) {
    vector<fs::directory_entry> result;
    regex r(rg.data());
    copy_if(
        fs::recursive_directory_iterator(path),
        fs::recursive_directory_iterator(),
        back_inserter(result),
        [&r](const fs::directory_entry &entry) {
            return fs::is_regular_file(entry.path()) &&
                   regex_match(entry.path().filename().string(), r);
        });
    return result;
}int main() {
    const auto dir        = fs::temp_directory_path();
    const auto pattern    = R"(\w+\.png)";
    const auto result     = find_files(fs::current_path(), pattern);
    for (const auto &entry : result) {
        cout << entry.path().string() << endl;
    }
    return EXIT_SUCCESS;
}

一般使用正则表达式的提示 (Tips For Using Regex-In-General)

Use raw string literal for describing the regex pattern in C++.
使用原始字符串文字描述C ++中的正则表达式模式。
Use the regex validating tool like https://regex101.com. What I like about regex101 is code generation & time-taken(will be helpful when optimizing regex) feature.
使用正则表达式验证工具，例如https://regex101.com 。我喜欢regex101的地方是代码生成和耗时(在优化regex时会有所帮助)功能。
Also, try to add generated explanation from validation tool as a comment exactly above the regex pattern in your code.
此外，请尝试将验证工具中生成的说明作为注释添加到代码中正则表达式模式的正上方。
Performance:
性能：
If you are using alternation, try to arrange options in high probability order like com|net|org.
如果使用交替，请尝试按com|net|org类的高概率顺序排列选项。
Try to use lazy quantifiers if possible.
如果可能，请尝试使用惰性量词。
Use non-capture groups wherever possible.
尽可能使用非捕获组。
Disable Backtracking.
禁用回溯。
Using the negated character class is more efficient than using a lazy dot.
使用否定字符类比使用惰性点更有效。

分词 (Parting Words)

It’s not just that you will use regex with only C++ or any other language. I myself use it mostly on IDE(in vscode to analyse log files) & on Linux terminal. But, bear in mind that overusing regex gives the feel of cleverness. And, it’s a great way to make your co-workers (and anyone else who needs to work with your code) very angry with you. Also, regex is overkill for most parsing tasks that you’ll face in your daily work.

不仅仅是您将正则表达式与C ++或任何其他语言一起使用。我自己主要在IDE(在vscode中分析日志文件)和Linux终端上使用它。但是，请记住，过度使用正则表达式会给人以聪明的感觉。而且，这是使您的同事(以及需要与您的代码一起工作的任何其他人)非常生气的好方法。同样，对于您在日常工作中将要面对的大多数解析任务，正则表达式也显得过于刻板。

The regexes really shine for complicated tasks where hand-written parsing code would be just as slow anyway; and for extremely simple tasks where the readability and robustness of regular expressions outweigh their performance costs.

正则表达式对于复杂的任务确实非常有用，因为无论如何手写解析代码都同样缓慢。对于非常简单的任务，其中正则表达式的可读性和鲁棒性超过其性能成本。

One more notable thing is current regex implementation(till 19th June 2020) in standard libraries have performance & code bloating issues. So choose wisely between Boost, CTRE and Standard library versions. Most probably you might go with the Hana Dusíková’s work on Compile Time Regular Expression. Also, her CppCon talk from 2018 & 2019’s would be helpful especially if you plan to use regex in embedded systems.

还有一件值得注意的事情是，标准库中的当前正则表达式实现(到2020年6月19日)存在性能和代码膨胀问题。因此，请明智地在Boost，CTRE和Standard库版本之间进行选择。最有可能您可能会喜欢HanaDusíková关于编译时正则表达式的工作。此外，她在2018年和2019年的CppCon演讲将很有帮助，特别是如果您计划在嵌入式系统中使用正则表达式。