初学正则表达式_学习正则表达式：初学者指南

最新推荐文章于 2024-09-24 20:30:00 发布

culi3182

最新推荐文章于 2024-09-24 20:30:00 发布

阅读量758

点赞数

文章标签：字符串正则表达式 java python 编程语言

原文链接：https://www.sitepoint.com/learn-regex/

版权

初学正则表达式

In this guide, you’ll learn regex, or regular expression syntax. By the end, you’ll be able to apply regex solutions in most scenarios that call for it in your web development work.

在本指南中，您将学习正则表达式或正则表达式语法。 最后，您将能够在大多数需要Web开发工作的情况下应用正则表达式解决方案。

Regular expressions have many uses cases, which include:

正则表达式有很多用例，其中包括：

form input validation
表单输入验证
web scraping
网页抓取
search and replace
搜索并替换
filtering for information in massive text files such as logs
过滤大量文本文件(例如日志)中的信息

Regular expressions, or regex as they’re commonly called, look complicated and intimidating for new users. Take a look at this example:

正则表达式 (通常称为正则表达式)对于新用户而言看起来很复杂且令人生畏。看一下这个例子：

/^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

It just look like garbled text. But don’t despair, there’s method behind this madness.

它看起来像乱码。但是不要失望，这种疯狂背后有一种方法。

Credit: xkcd

贷方： xkcd

I’ll show you how to master regular expressions in no time. First, let’s clarify the terminology used in this guide:

我将向您展示如何立即掌握正则表达式。首先，让我们澄清一下本指南中使用的术语：

pattern: regular expression pattern
模式：正则表达式模式
string: test string used to match the pattern
string ：用于匹配模式的测试字符串
digit: 0-9
数字：0-9
letter: a-z, A-Z
字母：az，AZ
symbol: !$%^&*()_+|~-=`{}[]:”;'<>?,./
符号：！$％^＆*()_ + |〜-=`{} []：”;'<>？，。/
space: single white space, tab
空格：单个空格，制表符
character: refers to a letter, digit or symbol
字符：指字母，数字或符号

基本 (Basics)

To learn regex quickly with this guide, visit Regex101, where you can build regex patterns and test them against strings (text) that you supply.

要通过本指南快速学习正则表达式，请访问Regex101 ，您可以在其中构建正则表达式模式并针对您提供的字符串(文本)进行测试。

When you open the site, you’ll need to select the JavaScript flavor, as that’s what we’ll be using for this guide. (Regex syntax is mostly the same for all languages, but there are some minor differences.)

打开网站时，您需要选择JavaScript风格，因为这就是我们将在本指南中使用的风格。 (所有语言的Regex语法基本相同，但有一些细微的差别。)

Next, you need to disable the global and multi line flags in Regex101. We’ll cover them in the next section. For now, we’ll look at the simplest form of regular expression we can build. Input the following:

接下来，您需要在Regex101中禁用global和multi line标志。我们将在下一部分中介绍它们。现在，我们将讨论可以构建的最简单的正则表达式形式。输入以下内容：

regex input field: cat
正则表达式输入字段 ：cat
test string: rat bat cat sat fat cats eat tat cat mat CAT
测试绳 ：老鼠蝙蝠猫坐肥猫吃肥猫垫子CAT

Take note that regular expressions in JavaScript start and end with /. If you were to write a regular expression in JavaScript code, it would look like this: /cat/ without any quotation marks. In the above state, the regular expression matches the string “cat”. However, as you can see in the image above, there are several “cat” strings that are not matched. In the next section, we’ll look at why.

请注意，JavaScript中的正则表达式以/开头和结尾。如果要用JavaScript代码编写正则表达式，则它看起来像这样： /cat/不带引号。在上述状态下，正则表达式与字符串“ cat”匹配。但是，如上图所示，有几个不匹配的“ cat”字符串。在下一节中，我们将介绍原因。

全局和不区分大小写的正则表达式标志 (Global and Case Insensitive Regex Flags)

By default, a regex pattern will only return the first match it finds. If you’d like to return additional matches, you need to enable the global flag, denoted as g. Regex patterns are also case sensitive by default. You can override this behavior by enabling the insensitive flag, denoted by i. The updated regex pattern is now fully expressed as /cat/gi. As you can see below, all “cat” strings have been matched including the one with a different case.

默认情况下，正则表达式模式将仅返回其找到的第一个匹配项。如果您想返回其他匹配项，则需要启用全局标志，表示为g 。默认情况下，正则表达式模式也区分大小写。您可以通过启用不敏感标志(由i表示)来覆盖此行为。更新后的正则表达式模式现在完全表示为/cat/gi 。如下所示，所有“ cat”字符串都已匹配，包括大小写不同的字符串。

字符集 (Character Sets)

In the previous example, we learned how to perform exact case-sensitive matches. What if we wanted to match “bat”, “cat”, and “fat”. We can do this by using character sets, denoted with []. Basically, you put in multiple characters that you want to get matched. For example, [bcf]at will match multiple strings as follows:

在前面的示例中，我们学习了如何执行精确的区分大小写的匹配。如果我们想匹配“蝙蝠”，“猫”和“脂肪”怎么办？我们可以使用以[]表示的字符集来完成此操作。基本上，您输入了多个要匹配的字符。例如， [bcf]at将匹配多个字符串，如下所示：

Character sets also work with digits.

字符集也可以与数字一起使用。

范围 (Ranges)

Let’s assume we want to match all words that end with at. We could supply the full alphabet inside the character set, but that would be tedious. The solution is to use ranges like this [a-z]at:

假设我们要匹配所有单词为此与at 。我们可以在字符集中提供完整的字母，但这将是乏味的。解决方案是在[az]at使用如下范围：

Here’s the full string that’s being tested: rat bat cat sat fat cats eat tat cat dog mat CAT.

这是正在测试的完整字符串： rat bat cat sat fat cats eat tat cat dog mat CAT 。

As you can see, all words are matching as expected. I’ve added the word dog just to throw in an invalid match. Here are other ways you can use ranges:

如您所见，所有单词都按预期匹配。我添加了“ dog ”一词只是为了扔掉无效的比赛。您可以使用以下其他方式使用范围：

Partial range: selections such as [a-f] or [g-p].

部分范围 ：诸如[af]或[gp] 。
Capitalized range: [A-Z].

大写范围 ： [AZ] 。
Digit range: [0-9].

位数范围 ： [0-9] 。
Symbol range: for example, [#$%&@].

符号范围 ：例如[#$%&@] 。
Mixed range: for example, [a-zA-Z0-9] includes all digits, lower and upper case letters. Do note that a range only specifies multiple alternatives for a single character in a pattern.

混合范围 ：例如， [a-zA-Z0-9]包含所有数字，大写和小写字母。请注意，范围仅为模式中的单个字符指定多个替代项。

To further understand how to define a range, it’s best to look at the full ASCII table in order to see how characters are ordered.
为了进一步了解如何定义范围，最好查看完整的ASCII表以了解字符的排序方式。

重复字符 (Repeating Characters)

Let’s say you’d like to match all three-letter words. You’d probably do it like this:

假设您要匹配所有三个字母的单词。您可能会这样：

[a-z][a-z][a-z]

This would match all three-letter words. But what if you want to match a five- or eight-character word. The above method is tedious. There’s a better way to express such a pattern using the {} curly braces notation. All you have to do is specify the number of repeating characters. Here are examples:

这将匹配所有三个字母的单词。但是，如果您想匹配五个或八个字符的单词，该怎么办？上述方法是乏味的。有一种更好的方法可以使用{}大括号表示法来表示这种模式。您要做的就是指定重复字符的数量。以下是示例：

a{5} will match “aaaaa”.
a{5}将匹配“ aaaaa”。
n{3} will match “nnn”.
n{3}将匹配“ nnn”。
[a-z]{4} will match any four-letter word such as “door”, “room” or “book”.
[az]{4}将匹配任何四个字母的单词，例如“ door”，“ room”或“ book”。
[a-z]{6,} will match any word with six or more letters.
[az]{6,}将匹配具有六个或更多字母的任何单词。
[a-z]{8,11} will match any word between eight and 11 letters. Basic password validation can be done this way.
[az]{8,11}将匹配8到11个字母之间的任何单词。基本密码验证可以通过这种方式完成。
[0-9]{11} will match an 11-digit number. Basic international phone validation can be done this way.
[0-9]{11}将匹配一个11位数字。基本的国际电话验证可以通过这种方式完成。

元字符 (Metacharacters)

Metacharacters allow you to write regular expression patterns that are even more compact. Let’s go through them one by one:

元字符允许您编写更紧凑的正则表达式模式。让我们一一讲解它们：

\d matches any digit that is the same as [0-9]
\d匹配与[0-9]相同的任何数字
\w matches any letter, digit and underscore character
\w匹配任何字母，数字和下划线字符
\s matches a whitespace character — that is, a space or tab
\s匹配空格字符，即空格或制表符
\t matches a tab character only
\t仅匹配制表符

From what we’ve learned so far, we can write regular expressions like this:

通过到目前为止的学习，我们可以编写如下正则表达式：

\w{5} matches any five-letter word or a five-digit number
\w{5}匹配任何五个字母的单词或一个五位数的数字
\d{11} matches an 11-digit number such as a phone number
\d{11}与11位数字(例如电话号码\d{11}匹配

特殊的角色 (Special Characters)

Special characters take us a step further into writing more advanced pattern expressions:

特殊字符使我们更进一步地编写了更高级的模式表达式：

+: One or more quantifiers (preceding character must exist and can be optionally duplicated). For example, the expression c+at will match “cat”, “ccat” and “ccccccccat”. You can repeat the preceding character as many times as you like and you’ll still get a match.

+ ：一个或多个量词(必须存在前导字符，并且可以有选择地重复)。例如，表达式c+at将匹配“ cat”，“ ccat”和“ ccccccccat”。您可以根据需要多次重复前面的字符，但仍会得到匹配。
?: Zero or one quantifier (preceding character is optional). For example, the expression c?at will only match “cat” or “at”.

? ：零或一个量词(前字符是可选的)。例如，表达式c?at仅匹配“ cat”或“ at”。
*: Zero or more quantifier (preceding character is optional and can be optionally duplicated). For example, the expression c*at will match “at”, “cat” and “ccccccat”. It’s like the combination of + and ?.

* ：零个或多个量词(前面的字符是可选的，可以有选择地重复)。例如，表达式c*at将匹配“ at”，“ cat”和“ ccccccat”。就像+和?的组合。
\: this “escape character” is used when we want to use a special character literally. For example, c\* will exactly match “c*” and not “ccccccc”.

\ ：当我们想按字面意义使用特殊字符时，使用此“转义字符”。例如， c\*将完全匹配“ c *”而不是“ ccccccc”。
[^]: this “negate” notation is used to indicate a character that should not be matched within a range. For example, the expression b[^a-c]ld will not match “bald” or “bbld” because the second letters a to c are negative. However, the pattern will match “beld”, “bild”, “bold” and so forth.

[^] ：此“否定”符号用于指示在范围内不应匹配的字符。例如，表达式b[^ac]ld将不匹配“秃头”或“ bbld”，因为第二个字母a至c为负。但是，该模式将匹配“ beld”，“ bild”，“ bold”等。
.: this “do” notation will match any digit, letter or symbol except newline. For example, .{8} will match a an eight-character password consisting of letters, numbers and symbols. for example, “password” and “P@ssw0rd” will both match.

. ：此“执行”符号将匹配除换行符以外的任何数字，字母或符号。例如， .{8}将匹配由字母，数字和符号组成的八个字符的密码。例如，“ password”和“ P @ ssw0rd”都将匹配。

From what we’ve learned so far, we can create an interesting variety of compact but powerful regular expressions. For example:

通过到目前为止的学习，我们可以创建有趣但紧凑但功能强大的正则表达式。例如：

.+ matches one or an unlimited number of characters. For example, “c” , “cc” and “bcd#.670” will all match.
.+匹配一个或多个字符。例如，“ c”，“ cc”和“ bcd＃.670”将全部匹配。
[a-z]+ will match all lowercase letter words irrespective of length, as long as they contain at least one letter. For example, “book” and “boardroom” will both match.
[az]+将匹配所有小写字母单词，无论其长度如何，只要它们包含至少一个字母即可。例如，“书”和“会议室”都将匹配。

团体 (Groups)

All the special characters we just mentioned only affect a single character or a range set. What if we wanted the effect to apply to a section of the expression? We can do this by creating groups using round brackets — (). For example, the pattern book(.com)? will match both “book” and “book.com”, since we’ve made the “.com” part optional.

我们刚才提到的所有特殊字符仅影响单个字符或范围集。如果我们想申请到表达的部分效果？为此，我们可以使用方括号()创建组。例如，图案book(.com)? 由于我们已将“ .com”部分设为可选，因此将匹配“ book”和“ book.com”。

Here’s a more complex example that would be used in a realistic scenario such as email validation:

这是一个更复杂的示例，将在实际情况下使用，例如电子邮件验证：

pattern: @\w+\.\w{2,3}(\.\w{2,3})?
模式： @\w+\.\w{2,3}(\.\w{2,3})?
test string: abc.com abc@mail @mail.com @mail.co.ke
测试字符串： abc.com abc@mail @mail.com @mail.co.ke

替代字符 (Alternate Characters)

In regex, we can specify alternate characters using the “pipe” symbol — |. This is different from the special characters we showed earlier as it affects all the characters on each side of the pipe symbol. For example, the pattern sat|sit will match both “sat” and “sit” strings. We can rewrite the pattern as s(a|i)t to match the same strings.

在正则表达式中，我们可以使用“竖线”符号指定替代字符| 。这与我们之前显示的特殊字符不同，因为它会影响管道符号每一侧的所有字符。例如， sat|sit模式将同时匹配“ sat”和“ sit”字符串。我们可以将模式重写为s(a|i)t来匹配相同的字符串。

The above pattern can be expressed as s(a|i)t by using () parentheses.

通过使用()括号，可以将上述模式表示为s(a|i)t 。

起始和结束模式 (Starting and Ending Patterns)

You may have noticed that some positive matches are a result of partial matching. For example, if I wrote a pattern to match the string “boo”, the string “book” will get a positive match as well, despite not being an exact match. To remedy this, we’ll use the following notations:

您可能已经注意到，某些正匹配是部分匹配的结果。例如，如果我写了一个模式来匹配字符串“ boo”，那么即使不是完全匹配，字符串“ book”也将得到正匹配。为了解决这个问题，我们将使用以下符号：

^: placed at the start, this character matches a pattern at the start of a string.
^ ：放在开头，此字符与字符串开头的模式匹配。
$: placed at the end, this character matches a pattern at the end of the string.
$ ：放置在末尾，此字符与字符串末尾的模式匹配。

To fix the above situation, we can write our pattern as boo$. This will ensure that the last three characters match the pattern. However, there’s one problem we haven’t considered yet, as the following image shows:

为了解决上述情况，我们可以将模式编写为boo$ 。这将确保最后三个字符与模式匹配。但是，有一个我们尚未考虑的问题，如下图所示：

The string “sboo” gets a match because it still fulfills the current pattern matching requirements. To fix this, we can update the pattern as follows: ^boo$. This will strictly match the word “boo”. If you use both of them, both rules are enforced. For example, ^[a-z]{5}$ strictly matches a five-letter word. If the string has more than five letters, the pattern doesn’t match.

字符串“ sboo”得到匹配，因为它仍然满足当前的模式匹配要求。为了解决这个问题，我们可以如下更新模式： ^boo$ 。这将严格匹配单词“ boo”。如果同时使用这两个规则，则将强制执行这两个规则。例如， ^[az]{5}$严格匹配五个字母的单词。如果字符串包含五个以上的字母，则模式不匹配。

JavaScript中的正则表达式 (Regex in JavaScript)

// Example 1
const regex1=/a-z/ig

//Example 2
const regex2= new RegExp(/[a-z]/, 'ig')

If you have Node.js installed on your machine, open a terminal and execute the command node to launch the Node.js shell interpreter. Next, execute as follows:

如果您的计算机上安装了Node.js，请打开终端并执行命令node以启动Node.js Shell解释器。接下来，执行如下：

Feel free to play with more regex patterns. When done, use the command .exit to quit the shell.

随意使用更多正则表达式模式。完成后，使用命令.exit退出外壳程序。

真实示例：电子邮件验证 (Real World Example: Email Validation)

As we conclude this guide, let’s look at a popular usage of regex, email validation. (For example, we might want to check that an email address a user has entered into a form is a valid email address.)

当我们总结本指南时，让我们看一下正则表达式的流行用法，即电子邮件验证 。 (例如，我们可能要检查用户输入到表单中的电子邮件地址是否为有效的电子邮件地址。)

This subject is more complicated than you might think. The email address syntax is quite simple: {name}@{domain}. In theory, an email address can contain a limited number of symbols such as #-@&%. etc. However, the placement of these symbols matters. Mail servers also have different rules on the use of symbols. For example, some servers treat the + symbol as invalid. In other mail servers, the symbol is used for email subaddressing.

这个主题比您想象的要复杂。电子邮件地址的语法非常简单： {name}@{domain} 。从理论上讲，电子邮件地址可以包含数量有限的符号，例如#-@&%. 但是，这些符号的位置很重要。邮件服务器在使用符号方面也有不同的规则。例如，某些服务器将+符号视为无效。在其他邮件服务器中，该符号用于电子邮件子寻址。

As a challenge to test your knowledge, try to build a regular expression pattern that matches only the valid email addresses marked below:

作为测试您的知识的挑战，请尝试构建仅与下面标记的有效电子邮件地址匹配的正则表达式模式：

# invalid email
abc
abc.com

# valid email address
abc@mail.com
abc@mail.nz
abc@mail.co.nz
abc123@mail.com
abc.def@music.com

# invalid email prefix
abc-@mail.com
abc..def@mail.com
.abc@mail.com
abc#def@mail.com

# valid email prefix
abc-d@mail.com
abc.def@mail.com
abc@mail.com
abc_def@mail.com

# invalid domain suffix
abc.def@mail.c
abc.def@mail#archive.com
abc.def@mail
abc.def@mail..com

# valid domain suffix
abc.def@mail.cc
abc.def@mail-archive.com
abc.def@mail.org
abc.def@mail.com
fully-qualified-domain@example.com

Do note some email addresses marked as valid may be invalid for certain organizations, while some that are marked as invalid may actually be allowed in other organizations. Either way, learning to build custom regular expressions for the organizations you work for is paramount in order to cater for their needs. In case you get stuck, you can look at the following possible solutions. Do note that none of them will give you a 100% match on the above valid email test strings.

请注意标记为有效可能是无效的某些组织的一些电子邮件地址，而一些被标记为无效可能实际上在其他组织被允许。无论哪种方式，学会为您工作的组织构建自定义正则表达式都是至关重要的，以便满足他们的需求。万一被卡住，可以查看以下可能的解决方案。请注意，在上述有效的电子邮件测试字符串中，没有一个会给您100％的匹配。

Possible Solution 1:
可能的解决方案1 ：

^\w*(\-\w)?(\.\w*)?@\w*(-\w*)?\.\w{2,3}(\.\w{2,3})?$

Possible Solution 2:
可能的解决方案2 ：

^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$

摘要 (Summary)

I hope you’ve now learned the basics of regular expressions. We haven’t covered all regex features in this quick beginner guide, but you should have enough information to tackle most problems that call for a regex solution. To learn more, read our guide on best practices for the practical application of regex in real-world scenarios.

希望您现在已经了解了正则表达式的基础知识。在本快速入门指南中，我们并未涵盖所有正则表达式功能，但是您应该有足够的信息来解决大多数需要正则表达式解决方案的问题。要了解更多信息，请阅读有关正则表达式在实际场景中实际应用的最佳实践的指南。