正则表达式python_Python正则表达式简介

最新推荐文章于 2023-10-02 17:56:16 发布

culiu9261

最新推荐文章于 2023-10-02 17:56:16 发布

阅读量1k

点赞数 4

文章标签：字符串 python 正则表达式 java 编程语言

原文链接：https://scotch.io/tutorials/an-introduction-to-regex-in-python

版权

正则表达式python

A regular expression is simply a sequence of characters that define a pattern.

正则表达式只是定义模式的一系列字符。

When you want to match a string to perhaps validate an email or password, or even extract some data, a regex is an indispensable tool.

当您想匹配一个字符串以验证电子邮件或密码，甚至提取某些数据时，正则表达式是必不可少的工具。

Everything in regex is a character. Even this .

正则表达式中的所有内容都是一个字符。即使这样。

While unicode characters can be used to match any international text, most patterns use normal ASCII (letters, digits, punctuation and keyboard symbols like $@%#!.)

虽然可以使用Unicode字符来匹配任何国际文本，但是大多数模式都使用普通的ASCII（字母，数字，标点符号和键盘符号，例如$ @％＃!。）

为什么要学习正则表达式？ ( Why should I learn regex? )

Regular expressions are everywhere. Here's some of the reasons why you should learn them:

正则表达式无处不在。这是您学习它们的一些原因：

They do a lot with less –– You can write a few characters to do something that could have taken dozens of lines of code to implement
他们用更少的钱做很多事 –您可以编写一些字符来完成一些可能需要花费数十行代码才能实现的事情
Standing out from the crowd –– Most programmers don't know regex. If you don't know it, you are about to detatch yourself from that category
在人群中脱颖而出 –大多数程序员都不了解正则表达式。如果您不知道，您将离开该类别
They are super fast –– Regex patterns wrote with performance in mind takes a very short time to execute. Backtracking might take some time, but even that has optimal variations that run super fast
它们非常快 –在考虑性能的情况下编写正则表达式模式只需要很短的时间即可执行。回溯可能会花费一些时间，但即使如此，它也可以以最佳的方式运行，而且运行速度非常快
They are portable –– The majority of regex syntax works the same way in a variety of programming languages
它们是可移植的 –大多数regex语法在各种编程语言中的工作方式都相同
You should learn them for the same reason I do –– they make your work a lot easier Are there any real world applications?
您应该按照与我相同的理由来学习它们 -它们使您的工作变得容易得多。是否有实际应用程序？

Common applications of regex are:

正则表达式的常见应用是：

Input validation (emails, usernames, passwords)
输入验证（电子邮件，用户名，密码）
Web scraping
网页抓取
Data wrangling
数据争吵
Simple parsing
简单解析

Also, regex is used for text matching in spreadsheets, text editors, IDEs and Google Analytics.

另外，正则表达式用于电子表格，文本编辑器，IDE和Google Analytics（分析）中的文本匹配。

让编码开始！ ( Let the coding begin! )

We are going to use python to write some regex. Python is known for its readability so it makes it easier to implement them.

我们将使用python编写一些正则表达式。 Python以其易读性而闻名，因此它使实现它们更加容易。

In python, the re module provides full support for regular expressions. A github repo contains code and concepts we'll use here.

在python中， re模块完全支持正则表达式。 github存储库包含我们将在此处使用的代码和概念。

我们的第一个正则表达式模式 (Our first regex pattern)

Python uses raw string notations to write regular expressions – r"write-expression-here" First, we'll import the re module. Then write out the regex pattern.

Python使用原始字符串表示法编写正则表达式r"write-expression-here"首先，我们将导入re模块。然后写出正则表达式模式。

import re

pattern = re.compile(r"")

The purpose of the compile method is to compile the regex pattern which will be used for matching later. It's advisable to compile regex when it'll be used several times in your program. Resaving the resulting regular expression object for reuse, which re.compile does, is more efficient.

编译方法的目的是编译正则表达式模式，该模式将在以后用于匹配。建议在程序中多次使用正则表达式时进行编译。将生成的正则表达式对象保存为可重用，这与re.compile一样，效率更高。

To add some regular expression inside the raw string notation, we'll put some special sequences to make our work easier.

为了在原始字符串表示法中添加一些正则表达式，我们将添加一些特殊的序列以使我们的工作更加轻松。

那么，什么是特殊序列？ ( So, what are special sequences? )

They are simply a sequence of characters that have a backslash \ character. For instance, \d is a match for one digit [0-9] \w is a match for one alphanumeric character. This means any ASCII character that's either a letter or a number [a-z A-Z 0-9]

它们只是一个带有反斜杠\字符的字符序列。例如， \d是一位数字[0-9]的匹配项\w是一位字母数字字符的匹配项。这表示任何字母或数字的ASCII字符[az AZ 0-9]

It's important to know them since they help us write simpler and shorter regex.

了解它们很重要，因为它们可以帮助我们编写更简单，更短的正则表达式。

Here's a table with more special sequences

这是具有更多特殊顺序的表格

Element	Description
.	This element matches any character except \n
\d	This matches any digit [0-9]
\D	This matches non-digit characters [^0-9]
\s	This matches whitespace character [ \t\n\r\f\v]
\S	This matches non-whitespace character [^ \t\n\r\f\v]
\w	This matches alphanumeric character [a-zA-Z0-9_]
\W	This matches any non-alphanumeric character [^a-zA-Z0-9]

元件	描述
。	该元素匹配\ n以外的任何字符
\ d	匹配任何数字[0-9]
\ D	匹配非数字字符[^ 0-9]
\ s	匹配空格字符[\ t \ n \ r \ f \ v]
\ S	匹配非空格字符[^ \ t \ n \ r \ f \ v]
\ w	这与字母数字字符[a-zA-Z0-9_]相匹配
\ W	这匹配任何非字母数字字符[^ a-zA-Z0-9]

Points to note:

注意事项：

**[0-9] is the same as [0123456789] **
** [0-9]与[0123456789]相同**
\d is short for [0-9]
\ d是[0-9]的缩写
\w is short for [a-zA-Z0-9]
\ w是[a-zA-Z0-9]的缩写
[7-9] is the same as [789]
[7-9]与[789]相同

Having learned something about special sequences, let's continue with our coding. Write down and run the code below.

了解了特殊序列后，让我们继续进行编码。写下并运行下面的代码。

import re

pattern = re.compile(r"\w") 
# Let's feed in some strings to match
string = "regex is awesome!"
# Then call a matching method to match our pattern
result = pattern.match(string)
print result.group() # will print out 'r'

The match method returns a match object, or None if no match was found. We are printing a result.group(). The group() is a match object method that returns an entire match. If not, it returns a NoneType, which mean there was no match to our compiled pattern.

match方法返回一个match对象，如果找不到匹配项，则返回None。我们正在打印一个result.group() 。 group（）是一个匹配对象方法，它返回整个匹配项。如果不是，则返回NoneType，这意味着与我们的编译模式不匹配。

You may wonder why the output is only a letter and not the whole word. It's simply because \w sequence matches only the first letter or digit at the start of the string. We've just wrote our first regex program!

您可能想知道为什么输出只是一个字母而不是整个单词。仅仅是因为\w序列仅匹配字符串开头的第一个字母或数字 。我们刚刚编写了第一个正则表达式程序！

让我们做更多的事情 ( Let's do more than that )

We want to do more than simply matching a single letter. So we ammend our code to look like this

我们要做的不仅仅是简单地匹配单个字母。因此，我们将代码修改为如下所示

# Replace the pattern variable with this
pattern = re.compile(r"\w+") # Notice the plus sign we just added

The + on our second pattern is what we call a quantifier.

第二种模式中的+是所谓的量词。

Quantifiers simply specify the quantity of characters to match.

量词仅指定要匹配的字符数。

Here are some other regex quantifiers and how to use them.

这是其他一些正则表达式量词以及如何使用它们。

Quantifier	Description	Example	Sample match
+	one or more	\w+	ABCDEF097
{2}	exactly 2 times	\d{2}	01
{1,}	one or more times	\w{1,}	smiling
{2,4}	2, 3 or 4 times	\w{2,4}	1234
*	0 or more times	A*B	AAAAB
?	once or none(lazy)	\d+?	1 in 12345

量词	描述	例	样品搭配
+	一个或多个	\ w +	ABCDEF097
{2}	正好2次	\ d {2}	01
{1，}	一或多次	\ w {1，}	微笑
{2,4}	2、3或4次	\ w {2,4}	1234
*	0次以上	A * B	AAAAB
？	一次或一次（懒惰）	\ d +？	12345中的1

Let's write some more quantifiers in our program!

让我们在程序中再写一些量词！

import re

def regex(string):
    """This function returns at least one matching digit."""
    pattern = re.compile(r"\d{1,}") # For brevity, this is the same as r"\d+"
    result = pattern.match(string)
    if result:
        return  result.group()
    return None

# Call our function, passing in our string
regex("007 James Bond")

The above regex uses a quantifier to match at least one digit. Calling the function will print this output: '007'

上面的正则表达式使用量词匹配至少一位数字。调用该函数将输出以下输出： '007'

什么是^和$？ ( What are ^ and $ ? )

You may have noticed that a regex usually has the ^ and $ characters. For example, r"^\w+$". Here's why.

您可能已经注意到，正则表达式通常包含^和$字符。例如， r"^\w+$" 。这就是为什么。

^ and $ are boundaries or anchors. ^ marks the start, while $ marks the end of a regular expression.

^和$是边界或锚点。 ^表示开始，而$表示正则表达式的结尾。

However, when used in square brackets [^ ... ] it means not. For example, [^\s$] or just [^\s] will tell regex to match anything that is not a whitespace character.

但是，当在方括号[^ ...]中使用时，则表示not 。例如， [^\s$]或仅[^\s]都会告诉regex匹配任何非空格字符 。

Let's write some code to prove this

让我们写一些代码来证明这一点

import re

line = "dance more"
result = re.match(r"[^\d+]", line)
print result     # Prints out 'dance'

First, notice there's no re.compile this time. Programs that use only a few regular expressions at a time don't have to compile a regex. We therefore don't need re.compile for this. Next, re.match() takes in an optional string argument as well, so we fed it with the line variable. Moving on swiftly!

首先，请注意这次没有重新re.compile 。一次只使用几个正则表达式的程序不必编译正则表达式。因此，我们不需要为此进行重新re.compile 。接下来， re.match()接受一个可选的字符串参数，因此我们将其与line变量一起使用。继续前进！

Let's look at some new concept: search.

让我们看看一些新概念： search 。

搜索与匹配 ( Searching versus Matching )

The match method checks for a match only at the beginning of the string, while a re.search() checks for a match anywhere in the string.

match方法仅在字符串的开头检查匹配项，而re.search()在字符串的任何位置检查匹配项。

Let's write some search functionality.

让我们编写一些搜索功能。

import re

string = "\n  dreamer"
result = re.search(r"\w+", string, re.MULTILINE) 
print result.group() # Prints out 'dreamer'

The search method, like the match method, can also take an extra argument. The re.MULTILINE simply tells our method to search on multiple lines that have been separated by the new line space character if any.

与match方法一样，搜索方法也可以使用额外的参数。 re.MULTILINE只是告诉我们的方法是在多行之间进行搜索，这些行已被新的行字符隔开。

Let's take a look at another example on how search works

让我们看一下有关搜索如何工作的另一个示例

import re

pattern = re.compile(r"^<html>")
result = pattern.search("<html></html>")
print result.group()

This will print out <html>.

这将打印出<html> 。

分裂 ( Splitting )

The re.split() splits a string into a list delimited by the passed pattern. For example, consider having names read from a file that we want to put in an list:

re.split()将字符串分割成由传递的模式分隔的列表。例如，考虑从要放入列表中的文件中读取名称：

text= "John Doe
           Jane Doe
           Jin Du
           Chin Doe"

We can use split to read each line and split them into an array as such:

我们可以使用split读取每一行并将它们拆分为一个数组，如下所示：

import re

results = re.split(r"\n+", text)
print results     # will print: ['Jane Doe', 'Jane Doe', 'Jin Du', 'Chin Doe']

全部找到 ( Finding it all )

But what if we wanted to find all instances of words in a string? Enter re.findall.

但是，如果我们想查找字符串中所有单词的实例怎么办？输入re.findall 。

re.findall() finds all the matches of all occurrences of a pattern, not just the first one as re.search() does. Unlike search which returns a match object, findall returns a list of matches. Let's write and run this functionality.

re.findall()查找模式中所有匹配项的所有匹配项，而不仅仅是re.search()的第一个匹配项。与search返回匹配对象不同，findall返回匹配列表。让我们编写并运行此功能。

import re

def finder(string):
    """This function finds all the words in a given string."""
    result_list = re.findall(r"\w+", string)
    return result_list

# Call finder function, passing in the string argument
finder("finding dory")

The output will be a list: ['finding', 'dory']

输出将是一个列表： ['finding', 'dory']

Let's say we want to search for people with 5 or 6-figure salaries. Regex will make it easy for us. Let's try it out:

假设我们要搜索薪水为5位数或6位数的人。正则表达式将使我们更容易。让我们尝试一下：

import re
salaries = "120000   140000   10000   1000   200"

result_list = re.findall(r"\d{5,6}", salaries)
print result_list     # prints out: ['120000', '140000', '10000']

操作该字符串 ( Manipulating that string )

Suppose we wanted to do some string replacement. The re.sub method will help us do that. It simply returns a string that has undergone some replacement using a matched pattern.

假设我们要进行一些字符串替换。 re.sub方法将帮助我们做到这一点。它只是返回一个使用匹配模式进行了某些替换的字符串。

Let's write code to do some string replacement

让我们写代码做一些字符串替换

import re

pattern = re.compile(r"[0-9]+")
result = pattern.sub("__", "there is only 1 thing 2 do")
print result

The program's aim is to replace any digit in the string with the _ character. Therefore, the print output will be there is only __ thing __ do

该程序的目的是用_字符替换字符串中的任何数字。因此，打印输出将there is only __ thing __ do

Let's try out another example. Write down the following code:

让我们尝试另一个示例。写下以下代码：

import re

pattern = re.compile(r"\w+") # Match only alphanumeric characters
input_string = "Lorem ipsum with steroids"
result = pattern.sub("regex", input_string) # replace with the word regex
print result  # prints 'regex regex regex regex'

We have managed to replace the words in the input string with the word "regex". Regex is very powerful in string manipulations.

我们设法用“ regex”替换了输入字符串中的单词。正则表达式在字符串操作中非常强大。

展望！ ( Look ahead! )

Sometimes you might encounter this (?=) in regex. This syntax is defines a look ahead. Instead of matching from the start of the string, match an entity that's followed by the pattern. For instance, r"a (?=b)" will return a match a only if it's followed by b.

有时您可能会在正则表达式中遇到此(?=) 。此语法定义了一个展望。而不是从字符串的开头进行匹配，而是匹配后面跟有模式的实体。举例来说，R “ a (?=b)将返回匹配a只有当它的后面b 。

Let's write some code to elaborate that.

让我们写一些代码来详细说明。

import re

pattern = re.compile(r'\w+(?=\sfox)')
result = pattern.search("The quick brown fox")
print result.group() # prints 'brown'

The pattern tries to match the closest string that is followed by a space character and the word fox.

该模式尝试匹配最接近的字符串，后跟一个space character and the word fox 。

Let's look at another example. Go ahead and write this snippet:

让我们看另一个例子。继续并编写以下代码段：

"""
Match any word followed by a comma.
The example below is not the same as re.compile(r"\w+,")
For this will result in [ 'me,' , 'myself,' ]
"""
pattern = re.compile(r"\w+(?=,)")
res = pattern.findall("Me, myself, and I")
print res

The above regex tries to match all instances of characters that is followed by a comma When we run this, we should print out a list containing: [ 'Me', 'myself' ]

上面的正则表达式尝试匹配all instances字符后跟comma all instances 。当我们运行它时，我们应该打印出包含以下内容的列表： [ 'Me', 'myself' ]

什么时候逃脱 ( When to escape )

What if you wanted to match a string that has a bunch of this special regex characters? A backlash is used to define special characters in regex. So to cover them as characters in our pattern string, we need to escape them and use '\'.

如果您想匹配包含一堆这样的特殊正则表达式字符的字符串怎么办？反斜杠用于在正则表达式中定义特殊字符。因此，要将它们作为字符包含在我们的模式字符串中，我们需要对其进行转义并使用'\'。

Here's an example.

这是一个例子。

import re

pattern = re.compile('\\\\')
result = pattern.match("\\author")
print result.group()    # will print \

Let's try it one more time just to get it – Suppose we want to include a +(a reserved quantifier) in a string to be matched by a pattern. We'll do something like this:

让我们再尝试一次以获得它–假设我们想在字符串中包含+ （保留的量词）以与模式匹配。我们将执行以下操作：

import re

pattern = re.compile(r"\w+\+") # match alphanumeric characters followed by a + character
result = pattern.search("file+")
print result.group() # will print out file+

We have successfully escaped the + character so that regex might not mistake it for being a quantifier.

我们已成功转义了+字符，因此正则表达式可能不会将其误认为是量词。

我们可以货币化吗？ ( Can we monetize it? )

For a real world application, here's a function that monetizes a number using thousands separator commas.

对于现实世界的应用程序，此功能可使用数千个分隔符将数字货币化。

import re
number = input("Enter your number\n")


def monetizer(number):
    """This function adds a thousands separator using comma characters."""
    number = str(number)
    try:
        if type(int(number)) == int:
            # Format into groups of three from the right to the left
            pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
            # substitute with a comma then return
            return pattern.sub(r'\g<0>,', number)
    except:
        return "Not a Number"

# Function call, passing in number as an argument
print monetizer(number)

As you might have noticed, the pattern uses a look-ahead mechanism. The brackets are responsible for grouping the digits into clusters, which can be separated by the commas. For example, the number 1223456 will become 1,223,456.

您可能已经注意到，该模式使用了一种预见机制。方括号负责将数字分组，可以用逗号分隔。例如，数字1223456将变为1,223,456 。

结论 ( Conclusion )

Congratulations for making it to the end of this intro! From the special sequences of characters, matching and searching, to finding all using reliable look aheads and manipulating strings in regex – we've covered quite a lot.

恭喜您完成本介绍的结尾！从特殊的字符序列，匹配和搜索，到使用可靠的先行查找和在正则表达式中处理字符串来查找所有字符，我们已经介绍了很多内容。

There are some advance concepts in regex such as backtracking and performance optimization which we can continue to learn as we grow. A good resource for more intricate details would be the re module documentation.

正则表达式中有一些高级概念，例如回溯和性能优化，随着我们的发展，我们可以继续学习它们。有关更复杂的细节的一个好资源将是re模块文档。

Great job for learning something that many consider difficult! If you found this helpful, spread the word.

学习很多人认为困难的东西的出色工作！如果您发现这有帮助，请广为传播。