c++ regex && sed正则表达式删除控制台特殊控制符

最新推荐文章于 2022-11-04 17:26:35 发布

cupidove

最新推荐文章于 2022-11-04 17:26:35 发布

阅读量2.4k

点赞数

分类专栏： windows shell linux

本文链接：https://blog.csdn.net/cupidove/article/details/45871055

版权

linux 同时被 3 个专栏收录

81 篇文章 2 订阅

订阅专栏

windows

37 篇文章 3 订阅

订阅专栏

shell

8 篇文章 0 订阅

订阅专栏

在c++中，有三种正则可以选择使用，C ++regex，C regex，boost regex ，如果在windows下开发c++，默认不支持后面两种正则，如果想快速应用，显然C++ regex 比较方便使用。文章将讨论C++ regex 正则表达式的使用。

C++ regex函数有3个：regex_match、 regex_search 、regex_replace

regex_match

regex_match是正则表达式匹配的函数，下面以例子说明。如果想系统的了解，参考regex_match

// regex_match example
#include <iostream>
#include <string>
#include <regex>

int main ()
{

  if (std::regex_match ("subject", std::regex("(sub)(.*)") ))
    std::cout << "string literal matched\n";

  std::string s ("subject");
  std::regex e ("(sub)(.*)");
  if (std::regex_match (s,e))
    std::cout << "string object matched\n";

  if ( std::regex_match ( s.begin(), s.end(), e ) )
    std::cout << "range matched\n";

  std::cmatch cm;    // same as std::match_results<const char*> cm;
  std::regex_match ("subject",cm,e);
  std::cout << "string literal with " << cm.size() << " matches\n";

  std::smatch sm;    // same as std::match_results<string::const_iterator> sm;
  std::regex_match (s,sm,e);
  std::cout << "string object with " << sm.size() << " matches\n";

  std::regex_match ( s.cbegin(), s.cend(), sm, e);
  std::cout << "range with " << sm.size() << " matches\n";

  // using explicit flags:
  std::regex_match ( "subject", cm, e, std::regex_constants::match_default );

  std::cout << "the matches were: ";
  for (unsigned i=0; i<sm.size(); ++i) {
    std::cout << "[" << sm[i] << "] ";
  }

  std::cout << std::endl;

  return 0;
}

输出如下：

string literal matched
string object matched
range matched
string literal with 3 matches
string object with 3 matches
range with 3 matches
the matches were: [subject] [sub] [ject]

regex_search

regex_match是另外一个正则表达式匹配的函数，下面是regex_search的例子。regex_search和regex_match的主要区别是：regex_match是全词匹配，而regex_search是搜索其中匹配的字符串。如果想系统了解，请参考regex_search

// regex_search example
#include <iostream>
#include <regex>
#include <string>

int main(){
  std::string s ("this subject has a submarine as a subsequence");
  std::smatch m;
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

  std::cout << "Target sequence: " << s << std::endl;
  std::cout << "Regular expression: /\\b(sub)([^ ]*)/" << std::endl;
  std::cout << "The following matches and submatches were found:" << std::endl;

  while (std::regex_search (s,m,e)) {
    for (auto x=m.begin();x!=m.end();x++) 
      std::cout << x->str() << " ";
    std::cout << "--> ([^ ]*) match " << m.format("$2") <<std::endl;
    s = m.suffix().str();
  }
}

输出如下：

Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject --> ([^ ]*) match ject
submarine sub marine --> ([^ ]*) match marine
subsequence sub sequence --> ([^ ]*) match sequence

regex_replace

regex_replace是替换正则表达式匹配内容的函数，下面是regex_replace的例子。如果想系统了解，请参考regex_replace

#include <regex> 
#include <iostream> 
 
int main() { 
    char buf[20]; 
    const char *first = "axayaz"; 
    const char *last = first + strlen(first); 
    std::regex rx("a"); 
    std::string fmt("A"); 
    std::regex_constants::match_flag_type fonly = 
        std::regex_constants::format_first_only; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt, fonly) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    std::string str("adaeaf"); 
    std::cout << std::regex_replace(str, rx, fmt) << std::endl; 
 
    std::cout << std::regex_replace(str, rx, fmt, fonly) << std::endl; 
 
    return 0; 
}

输出如下：

AxAyAz
Axayaz
AdAeAf
Adaeaf

正则表达式进行字符串切割

std::vector<std::string> split(const string& input, const string& regex) 
{
	// passing -1 as the submatch index parameter performs splitting
	std::regex re(regex);
	std::sregex_token_iterator first{ input.begin(), input.end(), re, -1 },last;
	return{ first, last };
}

C++ regex正则表达式的规则和其他编程语言差不多，如下：

特殊字符（用于匹配很难形容的字符）:

characters	description	matches
`.`	not newline	any character except line terminators (LF, CR, LS, PS).
`\t`	tab (HT)	a horizontal tab character (same as `\u0009`).
`\n`	newline (LF)	a newline (line feed) character (same as `\u000A`).
`\v`	vertical tab (VT)	a vertical tab character (same as `\u000B`).
`\f`	form feed (FF)	a form feed character (same as `\u000C`).
`\r`	carriage return (CR)	a carriage return character (same as `\u000D`).
`\c`letter	control code	a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. For example: `\ca` is the same as `\u0001`, `\cb` the same as `\u0002`, and so on...
`\x`hh	ASCII character	a character whose code unit value has an hex value equivalent to the two hex digits hh. For example: `\x4c` is the same as `L`, or `\x23` the same as `#`.
`\u`hhhh	unicode character	a character whose code unit value has an hex value equivalent to the four hex digitshhhh.
`\0`	null	a null character (same as `\u0000`).
`\`int	backreference	the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than `0`). See groups below for more info.
`\d`	digit	a decimal digit character
`\D`	not digit	any character that is not a decimal digit character
`\s`	whitespace	a whitespace character
`\S`	not whitespace	any character that is not a whitespace character
`\w`	word	an alphanumeric or underscore character
`\W`	not word	any character that is not an alphanumeric or underscore character
`\`character	character	the character character as it is, without interpreting its special meaning within a regex expression. Any character can be escaped except those which form any of the special character sequences above. Needed for: `^ $ \ . * + ? ( ) [ ] { } \|`
`[`class`]`	character class	the target character is part of the class
`[^`class`]`	negated character class	the target character is not part of the class

注意了，在C++反斜杠字符（\）会被转义

std::regex e1 ("\\d"); // \d -> 匹配数字字符

std::regex e2 ("\\\\"); // \\ -> 匹配反斜杠字符

数量：

characters	times	effects
`*`	0 or more	The preceding atom is matched 0 or more times.
`+`	1 or more	The preceding atom is matched 1 or more times.
`?`	0 or 1	The preceding atom is optional (matched either 0 times or once).
`{`int`}`	int	The preceding atom is matched exactly int times.
`{`int`,}`	int or more	The preceding atom is matched int or more times.
`{`min`,`max`}`	between min and max	The preceding atom is matched at least min times, but not more than max.

注意了，模式 "(a+).*" 匹配 "aardvark" 将匹配到 aa，模式 "(a+?).*" 匹配 "aardvark" 将匹配到 a

组（用以匹配连续的多个字符）:

characters	description	effects
`(`subpattern`)`	Group	Creates a backreference.
`(?:`subpattern`)`	Passive group	Does not create a backreference.

注意了，第一种将创建一个反向引用，用于提取匹配到的内容，第二种则没有，相对来说性能方面也没这部分的开销

characters	description	condition for match
`^`	Beginning of line	Either it is the beginning of the target sequence, or follows a line terminator.
`$`	End of line	Either it is the end of the target sequence, or precedes a line terminator.
`\|`	Separator	Separates two alternative patterns or subpatterns..

单个字符

[abc] 匹配 a, b 或 c.
[^xyz] 匹配任何非 x, y, z的字符

范围
[a-z] 匹配任何小写字母 (a, b, c, ..., z).
[abc1-5] 匹配 a, b , c, 或 1 到 5 的数字.

c++ regex还有一种类POSIX的写法

class	description	equivalent (with regex_traits, default locale)
`[:alnum:]`	alpha-numerical character	isalnum
`[:alpha:]`	alphabetic character	isalpha
`[:blank:]`	blank character	isblank
`[:cntrl:]`	control character	iscntrl
`[:digit:]`	decimal digit character	isdigit
`[:graph:]`	character with graphical representation	isgraph
`[:lower:]`	lowercase letter	islower
`[:print:]`	printable character	isprint
`[:punct:]`	punctuation mark character	ispunct
`[:space:]`	whitespace character	isspace
`[:upper:]`	uppercase letter	isupper
`[:xdigit:]`	hexadecimal digit character	isxdigit
`[:d:]`	decimal digit character	isdigit
`[:w:]`	word character	isalnum
`[:s:]`	whitespace character	isspace

sed例子：

Remove color codes (special characters) with sed

sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g"

Remove ( color / special / escape / ANSI ) codes, from text, with sed

Credit to the original folks who I've copied this command from.

The diff here is:

Theirs: [m|K]

Theirs is supposed to remove \E[NUMBERS;NUMBERS[m OR K]

This statement is incorrect in 2 ways.

1. The letters m and K are two of more than 20+ possible letters that can end these sequences.

2. Inside []'s , OR is already assumed, so they are also looking for sequences ending with | which is not correct.

This : [a-zA-Z]

This resolves the "OR" issue noted above, and takes care of all sequences, as they all end with a lower or upper cased letter.

This ensures 100% of any escape code 'mess' is removed.

sed "s,\x1B\[[0-9;]*[a-zA-Z],,g"

cupidove

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

c++ regex &amp;&amp; sed正则表达式删除控制台特殊控制符

Remove color codes (special characters) with sed

c++ regex && sed正则表达式删除控制台特殊控制符