re: 正则表达式快速入门，大成

最新推荐文章于 2024-04-22 09:22:56 发布

非正经研究生

最新推荐文章于 2024-04-22 09:22:56 发布

阅读量228

点赞数

分类专栏：搞笑开发 linux

本文链接：https://blog.csdn.net/paulkg12/article/details/85253109

版权

搞笑开发同时被 2 个专栏收录

144 篇文章 0 订阅

订阅专栏

linux

67 篇文章 0 订阅

订阅专栏

文章目录

图片笔记

文字的在这里看

前言
http://www.regular-expressions.info/quickstart.html

本文是简短的叙述，你可能需要完全的学习，请参考http://www.regular-expressions.info/tutorial.html

不同编程语言可能有不同的正则风格，我们叙述最广泛使用的风格。

小引：在subl里的实践

找到行首的空格：^\s+
定位）{ 这样的非K&R风格的函数： \)\s{0,3}{
找到markdown 文件里的图片语句：!.+?\)
慵懒，或许是这样定位这种行吧

使用[\s][0-9].+? KB 找到所有后，then shitf + end 删除后面的所有，以至于：
在这里插入图片描述

文本模式和匹配

A regular expression, or regex for short, is a pattern describing a certain amount of text. On this website, regular expressions are highlighted in red as regex. This is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text regex. Matches are highlighted in blue on this site. We use the term “string” to indicate the text that the regular expression is applied to. Strings are highlighted in green.

Characters with special meanings in regular expressions are highlighted in various different colors. The regex (?x)([Rr]egexp?)? shows meta tokens in purple, grouping in green, character classes in orange, quantifiers and other special tokens in blue, and escaped characters in gray.

正则表达式简称regex。本网站的正则表达式标记为红色比如：regex. ；原单纯文本中找到的匹配是蓝色，比如： regex. ；我们使用的范例文本是绿色标记：green. ；

其他带有特殊意义的字符，有着不同的颜色标记。比如： (?x)([Rr]egexp?)?
其中meta token是紫色(?x)，grouping 是绿色()，character classes是橘色[Rr]，quantifiers和其他特殊字符是蓝色， escaped 字符是灰色?

Literal Characters（字符字面量）

The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. If the string is Jack is a boy, it matches the a after the J.
This regex can match the second a too. It only does so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its “Find Next” or “Search Forward” function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match.

Twelve characters have special meanings in regular expressions: the backslash , the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {. These special characters are often called “metacharacters”. Most of them are errors when used alone.
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1+1=2. Otherwise, the plus sign has a special meaning.

最基本的正则表达式由单独一个字符组成，比如 a 它会匹配第一个搜寻字符串中的那个字符，在例子 Jack is a boy 中，会匹配 the a after the J.

当你告诉regex 引擎去搜索下一个 a，它才会去找下一个；平常你在文本编辑软件里，这是集成的功能，但是在regex里，这是分开的功能，你要自己会用。

这里有12个特殊字符，他们有特殊的含义：

the backslash ,
the caret ^,
the dollar sign $,
the period or dot ., 匹配任何字符
the vertical bar or pipe symbol |,
the question mark ?, 匹配至多一次（有用，当你：想找无参函数foo() ,你用re： (.?) ）（？可以理解为不贪婪匹配，*是最贪婪的，0也要，+也还贪婪，1次以上）
the asterisk or star *, 匹配0或者多次
the plus sign +, 匹配至少一次
the opening parenthesis (,
the closing parenthesis ),
the opening square bracket [,
and the opening curly brace {.
单独使用这些字符可能造成你不希望的错误。顺便说一下：他们被称为metacharacters

如果你想用这些特殊字符为普通文本字符（use them as a literal in a regex),你应该使用escape。
我举例给你看：如果你要匹配 1+1=2 的话，那么正确的正则是 1+1=2 ，否则加号会有特殊意义（我们稍后叙述）

Learn more about literal characters

Character Classes or Character Sets（字符类和字符组）

A “character class” matches only one out of several characters. To match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. A character class matches only a single character. gr[ae]y does not match graay, graey or any such thing. The order of the characters inside a character class does not matter.
You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X.
Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. q[^x] matches qu in question. It does not match Iraqsince there is no character after the q for the negated character class to match.
Learn more about character classes

character class 使得你可以在几个字符中找到你要匹配的一个字符；注意：[] 这种方式你相当于做了一次多选一的选择题，[abcdefg]多个细小粒度的候选者，每次仅仅留其中一个粒。ps：大粒度是(cat|dog) food 这样。
想要匹配 a 或者 e，请使用 [ae] 。比如你可以使用 gr[ae]y，来匹配 gray 和grey ；
记住character class, 仅仅会匹配单个字符；比如 gr[ae]y 就不能匹配 graay 和 graey ；
在character class 里的字符的顺序没啥关系，请随便放置；

你可以使用连字符 - ；
这将为character class 指定范围。比如 [0-9]会匹配单个 0到9 的数字。你可以组合多个范围，比如 [0-9a-fA-F] ，这会匹配单独一个16进制数字，大小写不敏感。也可以组合多个范围和单个字符，比如 [0-9a-fxA-FX] ，这会匹配16进制数，以及字符X

输入一个脱字符号 ^ ，这是否定括号内character class的方式。结果是，此character class将会匹配任何与自己不同的字符串。比如 q[^x] 会匹配 question 中的 qu ；它不会匹配 Iraq ，因为这里面的q后面啥都没有。

2017/12/15 10:14

捷径类的字符，让你搜索边界，空白
Shorthand Character Classes(缩减字符类)

\d matches a single character that is a digit, \w matches a “word character” (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks). The actual characters matched by the shorthands depends on the software you’re using. In modern applications, they include non-English letters and numbers.
Learn more about shorthand character classes

\d 匹配数字， \w 匹配任何字母或数字， \s 匹配空白格字符，比如 tab 和断行(line break)空格也算 …注：此匹配依赖于你的软件实现，对于现代的正则匹配软件，可以包含非英文字符和数字。

寻找非打印字符，tab，回车，line feed，escape
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.
If your application supports Unicode, use \uFFFF or \x{FFFF} to insert a Unicode character. \u20AC or \x{20AC}matches the euro currency sign.
If your application does not support Unicode, use \xFF to match a specific character by its hexadecimal index in the character set. \xA9 matches the copyright symbol in the Latin-1 character set.
All non-printable characters can be used directly in the regular expression, or as part of a character class.
Learn more about non-printable characters

要知道这些非打印字符中，unix下一般只有一个0x0A表示换行("\n")，windows下一般都是0x0D和0x0A两个字符("\r\n")，苹果机(MAC OS系统)则采用回车符CR表示下一行(\r)
有一些比较异端的家伙： \a : 打铃 0x07; \e 跳脱字符； \f form feed 不知道啥， \v 说是匹配竖直tab，但是检测的结果是它也匹配的像 \n

锚定字符，让你快速定位行与单词的开头结尾
Anchors
Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a “multi-line” mode that makes ^ match after any line break, and $before any line break. E.g. ^b matches only the first b in bob.
\b matches at a word boundary. A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w. \b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters. \B matches at every position where \b cannot match.
Learn more about anchors

锚定字符，不匹配任何字符，像它名字所述，匹配位置
^ 是行首
$ 是行尾
\b 和 \B 有点蒙

选择，| 让你完成或运算
Alternation
Alternation is the regular expression equivalent of “or”. cat|dog matches cat in About cats and dogs. If the regex is applied again, it matches dog. You can add as many alternatives as you want: cat|dog|mouse|fish.
Alternation has the lowest precedence of all regex operators. cat|dog food matches cat or dog food. To create a regex that matches cat food or dog food, you need to group the alternatives: (cat|dog) food.
Learn more about alternation
或！！
候选者模式，这就是比较大粒度的[] 了，注意 cat|dog food 匹配 cat or dog food.只有 (cat|dog) food.才能匹配 cat food or dog food

eg：比如你要帮宇环找内存分配/释放，你输入：(SSC_SDN_MALLOC()|(SSC_SDN_FREE()

重复，总有一些东西具有相同特征，定位他们
Repetition
The question mark makes the preceding token in the regular expression optional. colou?r matches colour or color.
问号表明可选： u？表示u 可以有，可以没有

The asterisk or star tells the engine to attempt to match the preceding token zero or more times.
星号，表示0+ 次

The plus tells the engine to attempt to match the preceding token once or more.
加号，表示 1+ 次

<[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.
使用 - 的连接符，来整范围

Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.
Learn more about quantifiers

使用{} 来指定重复次数

eg 如果你要移除所有函数的注释(不超过10行的 `/**/` 区块注释)

/\*\*.*(\s.*\*.*\s){1,10}

如下：
在这里插入图片描述

贪婪和慵懒：默认总是贪婪，？好奇让我们慵懒
Greedy and Lazy Repetition
The repetition operators or quantifiers are greedy. They expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. The regex <.+> matches first in This is a first test.
Place a question mark after the quantifier to make it lazy. <.+?> matches in the above string.
A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly match an HTML tag without regard to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.
Learn more about greedy and lazy quantifiers

<.+> 贪婪匹配；结果： first 匹配了<> 所有的包裹字符
<.+?> 懒匹配；结果： ，匹配仅仅第一个

用括号来包裹一个模式，然后捕捉此模式
Grouping and Capturing
Place parentheses around multiple tokens to group them together. You can then apply a quantifier to the group. E.g. Set(Value)? matches Set or SetValue.
Parentheses create a capturing group. The above example has one group. After the match, group number one contains nothing if Set was matched. It contains Value if SetValue was matched. How to access the group’s contents depends on the software or programming language you’re using. Group zero always contains the entire regex match.
Use the special syntax Set(?:Value)? to group tokens without creating a capturing group. This is more efficient if you don’t plan to use the group’s contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.
Learn more about grouping and capturing

使用（）来包裹你想查询的东西，你在这个（）外给定一些限定符即可，eg：你匹配文献引用的所有像[34]这样的标号，那么请：[([0-9])+]

Backreferences
Within the regular expression, you can use the backreference \1 to match the same text that was matched by the capturing group. ([abc])=\1 matches a=a, b=b, and c=c. It does not match anything else. If your regex has multiple capturing groups, they are numbered counting their opening parentheses from left to right.
Learn more about backreferences

Named Groups and Backreferences
If your regex has many groups, keeping track of their numbers can get cumbersome. Make your regexes easier to read by naming your groups. (?[abc])=\k is identical to ([abc])=\1, except that you can refer to the group by its name.
Learn more about named groups

Unicode Properties
\p{L} matches a single character that is in the given Unicode category. L stands for letter. \P{L} matches a single character that is not in the given Unicode category. You can find a complete list of Unicode categories in the tutorial.
Learn more about Unicode regular expressions

Lookaround
Lookaround is a special kind of group. The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result. Lookaround matches a position, just like anchors. It does not expand the regex match.
q(?=u) matches the q in question, but not in Iraq. This is positive lookahead. The u is not part of the overall regex match. The lookahead matches at each position in the string before a u.
q(?!u) matches q in Iraq but not in question. This is negative lookahead. The tokens inside the lookahead are attempted, their match is discarded, and the result is inverted.
To look backwards, use lookbehind. (?<=a)b matches the b in abc. This is positive lookbehind. (?<!a)b fails to match abc.
You can use a full-fledged regular expression inside lookahead. Most applications only allow fixed-length expressions in lookbehind.
Learn more about lookaround

环视

Free-Spacing Syntax
Many application have an option that may be labeled “free-spacing” or “ignore whitespace” or “comments” that makes the regular expression engine ignore unescaped spaces and line breaks and that makes the # character start a comment that runs until the end of the line. This allows you to use whitespace to format your regular expression in a way that makes it easier for humans to read and thus makes it easier to maintain.
Learn more about free-spacing

自动忽略空白和注释

非正经研究生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
re: 正则表达式快速入门，大成

文章目录图片笔记文字的在这里看图片笔记文字的在这里看前言http://www.regular-expressions.info/quickstart.html本文是简短的叙述，你可能需要完全的学习，请参考http://www.regular-expressions.info/tutorial.html不同编程语言可能有不同的正则风格，我们叙述最广泛使用的风格。小引：在subl里的实...
复制链接

扫一扫