regex_通过实际示例使RegEx神秘化

最新推荐文章于 2024-04-27 10:37:48 发布

culi4814

最新推荐文章于 2024-04-27 10:37:48 发布

阅读量368

点赞数

文章标签：字符串 python 正则表达式 java 人工智能

原文链接：https://www.sitepoint.com/demystifying-regex-with-practical-examples/

版权

regex

A regular expression is a sequence of characters used for parsing and manipulating strings. They are often used to perform searches, replace substrings and validate string data. This article provides tips, tricks, resources and steps for going through intricate regular expressions.

正则表达式是用于解析和处理字符串的一系列字符。它们通常用于执行搜索，替换子字符串和验证字符串数据。本文提供了技巧，窍门，资源和经过复杂正则表达式的步骤。

If you don’t have the basic skillset under your belt, you can learn regex with our beginner’s guide. As arcane as regular expressions look, it won’t take you long to learn the concepts.

如果您不具备基本技能，则可以通过我们的初学者指南学习正则表达式。正则表达式看起来很神秘，您很快就会学到这些概念。

There are many books, articles, websites and the PHP official documentation that explain regular expressions, so instead of writing another explanation I’d prefer to go straight to more practical examples. You can find a useful cheat sheet at this link.

有很多书籍，文章，网站和PHP官方文档解释了正则表达式，因此，我宁愿直接阅读更实际的示例，也不想写另一个解释。您可以在此链接中找到有用的备忘单。

Along with a host of useful resources, there is also a conference video by Lea Verou at the bottom of this post – it’s a bit long, but it’s excellent in breaking down RegEx.

除了大量有用的资源外，这篇文章的底部还有一个Lea Verou的会议视频-有点长，但是在分解RegEx方面非常出色。

如何建立一个好的正则表达式 (How to build a good regex)

Regular expressions are often used in the developer’s daily routine – log analysis, form submission validation, find and replace, and so on. That’s why every good developer should know how to use them, but what is the best practice to build a good regex?

在开发人员的日常工作中经常使用正则表达式-日志分析，表单提交验证，查找和替换等。这就是每个优秀的开发人员都应该知道如何使用它们的原因，但是构建一个好的正则表达式的最佳实践是什么？

1.定义方案 (1. Define a scenario)

Using natural language to define the problem will give you a better idea of the approach to use. The words could and must, used in a definition, are useful to describe mandatory constraints or assertions.

使用自然语言定义问题将使您对使用方法有更好的了解。定义中使用的“ 可能”和“ 必须 ”一词可用于描述强制性约束或断言。

Below is an example:

下面是一个示例：

The string must start with ‘h’ and finish with ‘o’ (e.g. hello, halo).
字符串必须以“ h”开头，并以“ o”结尾(例如，hello，halo)。
The string could be wrapped in parentheses.
该字符串可以用括号括起来。

2.制定计划 (2. Develop a plan)

After having a good definition of the problem, we can understand the kind of elements that are involved in our regular expression:

在对问题进行了很好的定义之后，我们可以了解正则表达式中涉及的元素类型：

What are the types of characters allowed (word, digit, new line, range, …)?
允许使用什么类型的字符(单词，数字，换行符，范围等)？
How many times must a character appear (one or more, once, …)?
一个字符必须出现几次(一次或多次，一次，…)？
Are there some constraints to follow (optionals, lookahead/behind, if-then-else, …)?
是否有一些限制要遵循(可选，向前/向后，if-then-else等)？

3.实施/测试/重构 (3. Implement/Test/Refactor)

It’s very important to have a real-time test environment to test and improve your regular expression. There are websites like regex101.com, regexr.com and debuggex.com that provide some of the best environments.

拥有一个实时测试环境来测试和改善您的正则表达式非常重要。像regex101.com ， regexr.com和debuggex.com这样的网站提供了一些最佳环境。

To improve the efficiency of the regex, you could try to answer some of these additional questions:

为了提高正则表达式的效率，您可以尝试回答以下一些其他问题：

Are the character classes correctly defined for the specific domain?
是否为特定域正确定义了字符类？
Should I write more test strings to cover more use cases?
我是否应该编写更多测试字符串以涵盖更多用例？
Is it possible to find and isolate some problems and test them separately?
是否可以找到并隔离一些问题并分别进行测试？
Should I refactor my expression with subpatterns, groups, conditions, etc., to make it smaller, clearer and more flexible?
我是否应该使用子模式，组，条件等重构表达式，使其更小，更清晰，更灵活？

实际例子 (Practical examples)

The goal of the following examples is not to write an expression that will only solve the problem, but to write the most effective expression for the specific use cases, using important elements like character ranges, assertions, conditions, groups and so on.

以下示例的目的不是编写仅解决问题的表达式，而是使用诸如字符范围，断言，条件，组等重要元素来为特定用例编写最有效的表达式。

匹配密码 (Matching a password)

Scenario:

场景：

6 to 12 characters in length
长度为6至12个字符
Must have at least one uppercase letter
必须至少有一个大写字母
Must have at least one lower case letter
必须至少有一个小写字母
Must have at least one digit
必须至少有一位数字
Should contain other characters
应包含其他字符

Pattern:

模式：

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{6,12}$

^(?=.*[az])(?=.*[AZ])(?=.*\d).{6,12}$

This expression is based on multiple positive lookahead (?=(regex)). The lookahead matches something followed by the declared (regex). The order of the conditions doesn’t affect the result. Lookaround expressions are very useful when there are several conditions. We could also use the negative lookahead (?!(regex)) to exclude some character ranges. For example, I could exclude the % with (?!.*#).

此表达式基于多个正向超前(?=(regex)) 。前瞻匹配声明的(regex) 。条件的顺序不影响结果。有多个条件时，环视表达式非常有用。我们还可以使用负前瞻(?!(regex))排除某些字符范围。例如，我可以用(?!.*#)排除% 。

Let’s explain each pattern of the above expression:

让我们解释以上表达式的每种模式：

^ asserts position at start of the string
^声明字符串开头的位置
(?=.*[a-z]) positive lookahead, asserts that the regex .*[a-z] can be matched:
(?=.*[az])正向查找，断言正则表达式.*[az]可以匹配：
- .* matches any character (except newline) between zero and unlimited times
  .*匹配零到无限制时间之间的任何字符(换行符除外)
- [a-z] matches a single character in the range between a and z (case sensitive)
  [az]匹配a和z之间的单个字符(区分大小写)
(?=.*[a-z]) positive lookahead, asserts that the regex .*[a-z] can be matched:
(?=.*[az])正向查找，断言正则表达式.*[az]可以匹配：
(?=.*[A-Z]) positive lookahead, asserts that the regex .*[A-Z] can be matched:
(?=.*[AZ])正向查找，声称正则表达式.*[AZ]可以匹配：
- .* matches any character (except newline) between zero and unlimited times
  .*匹配零到无限制时间之间的任何字符(换行符除外)
- [A-Z] matches a single character between A and Z (case sensitive)
  [AZ]匹配A和Z之间的单个字符(区分大小写)
(?=.*[A-Z]) positive lookahead, asserts that the regex .*[A-Z] can be matched:
(?=.*[AZ])正向查找，声称正则表达式.*[AZ]可以匹配：
(?=.*\d) positive lookahead, asserts that the regex *\dcan be matched:
(？=。* \ d)正向查找，断言可以匹配正则表达式*\d ：
- .* matches any character (except newline) between zero and unlimited times
  .*匹配零到无限制时间之间的任何字符(换行符除外)
- \d matches a digit [0-9]
  \d与数字[0-9]匹配
(?=.*\d) positive lookahead, asserts that the regex *\dcan be matched:
(？=。* \ d)正向查找，断言可以匹配正则表达式*\d ：
.{6,12} matches any character (except newline) between 6 and 12 times
.{6,12}与6到12次之间的任何字符(换行符除外)匹配
$ asserts position at end of the string
$声明字符串末尾的位置

匹配网址 (Matching URL)

Scenario:

场景：

Must start with http or https or ftp followed by ://
必须以http或https或ftp开头，后跟://
Must match a valid domain name
必须与有效域名匹配
Could contain a port specification (http://www.sitepoint.com:80)
可能包含端口规范( http://www.sitepoint.com:80 )
Could contain digit, letter, dots, hyphens, forward slashes, multiple times
可以多次包含数字，字母，点，连字符，正斜杠

Pattern:

模式：

^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)

The first scenario is pretty easy to solve with ^(http|https|ftp):[\/]{2}. To match the domain name we need to bear in mind that to be valid it can only contain letters, digits, hyphen and dots. In my example, I limited the number of characters after the punctuation from 2 to 4, but could be extended for new domains like .rocks or .codes. The domain name is matched by ([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}).

第一种情况很容易用^(http|https|ftp):[\/]{2} 。要匹配域名，我们需要牢记，有效的域名只能包含字母，数字，连字符和点。在我的示例中，我将标点符号后的字符数从2个限制为4个，但可以扩展到.rocks或.codes类的新域。域名与([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})匹配。

The optional port specification is matched by the simple (:[0-9]+)?.

可选的端口规范与简单的(:[0-9]+)?匹配(:[0-9]+)? 。

A URL can contain multiple slashes and multiple characters repeated many times (see RFC3986), this is matched by using a range of characters in a group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*). It’s really useful to match every important element with a group capture (), because it will return only the matches we need. Remember that certain characters need to be escaped with \.

URL可以包含多个斜杠和多个字符重复多次(请参阅RFC3986 )，这可以通过在组中使用一系列字符来匹配([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*) 。用组捕获()匹配每个重要元素非常有用，因为它只会返回我们需要的匹配。请记住，某些字符需要使用\进行转义。

Below, every single subpattern explained:

下面，说明每个子模式：

^ asserts position at start of the string
^声明字符串开头的位置
capturing group (http|https|ftp), captures http or https or ftp
捕获组(http|https|ftp) ，捕获http或https或ftp
: escaped character, matches the character : literally
:转义字符，与字符匹配:字面上
[\/]{2} matches exactly 2 times the escaped character /
[\/]{2}与转义字符精确匹配2倍/
capturing group ([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}):
捕获组([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}) ：
- [a-zA-Z0-9\-\.]+ matches one and unlimited times character in the range between a and z, A and Z, 0 and 9, the character - literally and the character . literally
  [a-zA-Z0-9\-\.]+在a和z，A和Z，0和9之间的字符-从字面意义上与)之间匹配一个无限次的字符. 从字面上看
- \. matches the character . literally
  \. 匹配字符. 从字面上看
- [a-zA-Z]{2,4} matches a single character between 2 and 4 times between a and z or A and Z (case sensitive)
  [a-zA-Z]{2,4}在a和z或A和Z之间匹配单个字符2至4次(区分大小写)
capturing group ([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}):
捕获组([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}) ：
capturing group (:[0-9]+)?:
捕获组(:[0-9]+)? ：
- quantifier ? matches the group between zero or more times
  量词? 在零次或多次之间匹配组
- : matches the character : literally
  :匹配字符:字面上
- [0-9]+ matches a single character between 0 and 9 one or more times
  [0-9]+与0到9之间的单个字符匹配一次或多次
capturing group (:[0-9]+)?:
捕获组(:[0-9]+)? ：
\/? matches the character / literally zero or one time
\/? 匹配字符/字面上为零或一次
capturing group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*):
捕获组([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*) ：
- [a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]* matches between zero and unlimited times a single character in the range a-z, A-Z, 0-9, the characters: -._?,'/\+&%$#=~.
  [a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]* AZ，0-9，字符： -._?,'/\+&%$#=~ 。
capturing group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*):
捕获组([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*) ：

匹配HTML TAG (Matching HTML TAG)

Scenario:

场景：

The start tag must begin with < followed by one or more characters and end with >
开始标记必须以<开头，后跟一个或多个字符，并以>结尾
The end tag must start with </ followed by one or more characters and end with >
结束标记必须以</开头，后跟一个或多个字符，并以>结尾
We must match the content inside a TAG element
我们必须匹配TAG元素内的内容

Pattern:

模式：

<([\w]+).*>(.*?)<\/\1>

Matching the start tag and the content inside it’s pretty easy with <([\w]+).*> and (.*?), but in the pattern above I have added a useful thing: the reference to a capturing group. Every capturing group defined by parentheses () could be referred to using its position number, (first)(second)(third), which will allow for further operations. The expression above could be explained as:

使用<([\w]+).*>和(.*?)可以很容易地匹配开始标记和其中的内容，但是在上面的模式中，我添加了一个有用的东西：对捕获组的引用。括号()定义的每个捕获组都可以使用其位置编号(first)(second)(third)进行引用，这将允许进行进一步的操作。上面的表达式可以解释为：

Start with <
以<开始
Capture the tag name
捕获标签名称
Followed by one or more chars
跟一个或多个字符
Capture the content inside the tag
捕获标签内的内容
The closing tag must be </tag name captured before>
结束标记必须是</tag name captured before>

Including only two capture groups in the expression, the tag name and the content, will return a very clear match, a list of tag names with related content.

表达式中仅包括两个捕获组(标签名称和内容)将返回非常清晰的匹配，即包含相关内容的标签名称列表。

Let’s dig a little deeper and explain the subpatterns:

让我们更深入地介绍一下子模式：

< matches the character < literally
<与字符<从字面上匹配
capturing group ([\w]+) matches any word character a-zA-Z0-9_ one or more times
捕获组([\w]+)与任何单词字符a-zA-Z0-9_一次或多次
.* matches any character (except newline) between zero or more times
.*匹配零次或多次之间的任何字符(换行符除外)
> matches the character > literally
>匹配字符>从字面上
capturing group (.*?), matches any character (except newline), zero and more times
捕获组(.*?) ，匹配任何字符(换行符除外)，零次或更多次
< matches the characters < literally
<与字符<从字面上匹配
\/ matches the character / literally
\/匹配字符/从字面上
\1 matches the same text matched by the first capturing group: ([\w]+)
\1与第一个捕获组匹配的相同文本匹配： ([\w]+)
> matches the characters > literally
>匹配字符>从字面上

匹配重复的单词 (Matching duplicated words)

Scenario:

场景：

The words are space separated
单词之间用空格隔开
We must match every duplication – non-consecutive ones as well
我们必须匹配所有重复项-也要不连续

Pattern:

模式：

\b(\w+)\b(?=.*\1)

This regular expression seems challenging but uses some of the concept previously shown. The pattern introduces the concept of word boundaries.

此正则表达式似乎具有挑战性，但使用了先前显示的一些概念。该模式引入了单词边界的概念。

A word boundary \b mainly checks positions. It matches when a word character (i.e.: abcDE) is followed by a non-word character (Ie: -~,!). Below you can find some example uses of word boundary to make it clearer: – Given the phrase Regular expressions are awesome – The pattern \bare\b matches are – The pattern \w{3}\b could match the last three letters of the words: lar, ion, are, ome

单词边界\b主要检查位置。当一个单词字符(即：它匹配abcDE )之后非单词字符(即： -~,! 在下面，您可以找到一些使单词边界更清楚的示例用法：–给定短语Regular expressions are awesome –模式\bare\b匹配are –模式\w{3}\b可以匹配字符边界的最后三个字母单词： lar, ion, are, ome

The expression above could be explained as:

上面的表达式可以解释为：

Match every word character followed by a non-word character (in our case space)
匹配每个单词字符后跟一个非单词字符(在我们的案例空间中)
Check if the matched word is already present or not
检查匹配的单词是否已经存在

Below you will find the explanation for each sub pattern:

在下面，您将找到每个子模式的说明：

\b word boundary
\b字边界
capturing group ([\w]+) matches any word character a-zA-Z0-9_
捕获组([\w]+)与任何单词字符a-zA-Z0-9_
\b word boundary
\b字边界
(?=.*\1) positive lookahead assert that the following can be matched:
(?=.*\1)正向超前断言可以匹配以下内容：
- .* matches any character (except newline)
  .*匹配任何字符(换行符除外)
- \1 matches same text as first capturing group
  \1与第一个捕获组匹配相同的文本
(?=.*\1) positive lookahead assert that the following can be matched:
(?=.*\1)正向超前断言可以匹配以下内容：

The expression will make more sense if we return all the matches instead of returning only the first one. See the PHP function preg_match_all for more information.

如果我们返回所有匹配项而不是仅返回第一个匹配项，则该表达式将更有意义。有关更多信息，请参见PHP函数preg_match_all 。

最后的想法 (Final thoughts)

Regular expressions are double-edged swords. The more complexity is added, the more difficult it is to solve the problem. That’s why, sometimes, it’s hard to find a regular expression that will match all the cases, and it’s better to use several smaller regex instead.

正则表达式是双刃剑。添加的复杂性越高，解决问题就越困难。因此，有时很难找到适合所有情况的正则表达式，最好使用几个较小的正则表达式。

Having a good scenario of the problem could be very helpful, and will allow you to start thinking of the character range, constraints, assertions, repetitions, optional values, etc. Paying more attention to group captures will make the matches useful for further processing. Feel free to improve the expressions in the examples, and let us know how you do!

很好地解决问题可能会非常有帮助，并使您开始思考字符范围，约束，断言，重复，可选值等。更加注意组捕获将使匹配对于进一步处理很有用。随时改进示例中的表达式，让我们知道您的操作！

有用的资源 (Useful resources)

Below you can find further information and resources to help your regex skills grow. Feel free to add a comment to the article if you find something useful that isn’t listed.

您可以在下面找到更多信息和资源，以帮助您提高正则表达式的技能。如果您发现未列出的有用内容，请随时在文章中添加评论。

Lea Verou – / Reg(exp){2} lained /：揭秘正则表达式 (Lea Verou – /Reg(exp){2}lained/: Demystifying Regular Expressions)

https://www.youtube.com/watch?v=EkluES9Rvak

PHP库 (PHP libraries)

Name	Description
RegExpBuilder	Creates regex using human-readable chains of methods
NooNooFluentRegex	Builds Regex expressions using fluent setters and English language terms like above
Hoa\Regex	Provides tools to analyze regex and generate strings
Regex reverse	Given a regular expression will generate a string

名称	描述
RegExpBuilder	使用人类可读的方法链创建正则表达式
NooNooFluentRegex	使用流利的setter和上述英语术语构建Regex表达式
Hoa \ Regex	提供分析正则表达式并生成字符串的工具
正则表达式反向	给定一个正则表达式将生成一个字符串

网站 (Websites)

URL	Description
regex101.com	PCRE online regex tester
regextester.com	PCRE online regex tester
rexv.org	PCRE online regex tester
debuggex.com	Supports PCRE and provides a very useful visual regex debugger
regexper.com	Javascript style regex, but useful for debug
phpliveregex.com	Online tester for preg functions
regxlib.com	Database of regular expressions ready to use
regular-expressions.info	Regex tutorials, books review, examples

网址	描述
regex101.com	PCRE在线正则表达式测试仪
regextester.com	PCRE在线正则表达式测试仪
rexv.org	PCRE在线正则表达式测试仪
debuggex.com	支持PCRE，并提供了非常有用的可视化正则表达式调试器
regexper.com	Javascript风格的正则表达式，但对调试很有用
phpliveregex.com	在线功能测试仪
regxlib.com	准备使用的正则表达式数据库
regular-expressions.info	正则表达式教程，书评，示例

图书 (Books)

Title	Description	Author	Editor
Mastering Regular Expressions	The must have regex book	Jeffrey Friedl	O’Reilly
Regular Expression Pocket Reference	Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET	Tony Stubblebine	O’Reilly

标题	描述	作者	编辑
掌握正则表达式	必须有正则表达式的书	杰弗里·弗里德尔	奥赖利
正则表达式口袋参考	Perl，Ruby，PHP，Python，C，Java和.NET的正则表达式	托尼·斯图宾宾	奥赖利

翻译自: https://www.sitepoint.com/demystifying-regex-with-practical-examples/

regex

culi4814

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
regex_通过实际示例使RegEx神秘化

regexA regular expression is a sequence of characters used for parsing and manipulating strings. They are often used to perform searches, replace substrings and validate string data. This article prov...
复制链接

扫一扫