python中的正则表达式

介绍 (Introduction)

Regular expressions or regex are a sequence of characters used to check whether a pattern exists in each text (string) or not, for example, to find out if “123” exists in “Practical123DataScie”. The regex parser interprets “123” as an ordinary character that matches only itself in the string. But the real power of regular expressions is when a pattern contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

正则表达式或正则表达式是一系列字符,用于检查每个文本(字符串)中是否存在模式,例如,确定“ Practical123DataScie”中是否存在“ 123”。 正则表达式解析器将“ 123”解释为仅与字符串中的自身匹配的普通字符。 但是正则表达式的真正作用是当模式包含称为元字符的特殊字符时 。 这些对正则表达式匹配引擎具有独特的含义,并大大增强了搜索功能。

Regex functionality resides in a module named re. So, like all modules in Python, we only need to import it as follows to start working with.

正则表达式功能驻留在名为re的模块中。 因此,就像Python中的所有模块一样,我们只需要按以下步骤将其导入即可开始使用。

import re

Very useful functions in re module are covered in this tutorial, such as search() and split() for search and replace, respectively. You also will learn to create complex matching patterns with metacharacters.

本教程介绍了re模块中非常有用的功能,例如分别用于搜索和替换的search()split() 。 您还将学习使用元字符创建复杂的匹配模式。

I).search()函数 (I) .search() function)

A regular expression search is typically written as:

正则表达式搜索通常写为:

re.search(pattern, string)

This function goes through the string to locate the first location where there is a match with the pattern. If there is no match, it returns None. Let us look at the following example:

此函数通过字符串查找与模式匹配的第一个位置。 如果不匹配,则返回无。 让我们看下面的例子:

s1= "Practical123DataScie"
re.search("123", s1)Output: <re.Match object; span=(9, 12), match='123'>

The output provides you a lot of information. It tells you that there is a match and locates at s[9:12] of the string. This is an easy case and we might need to search for complex patterns. Imagine now, you want to look for three consecutive numbers like “456” or “789”. In this case, we would need to use patterns because we are looking for consecutive numbers and we do not know exactly what those numbers are. They could be “124”, “052” and so on. How can we do that?

输出为您提供了很多信息。 它告诉您存在匹配项,并且位于字符串的s [9:12]。 这是一个简单的案例,我们可能需要搜索复杂的模式。 现在想象一下,您想要查找三个连续的数字,例如“ 456”或“ 789”。 在这种情况下,我们将需要使用模式,因为我们正在寻找连续的数字,而我们并不确切知道这些数字是什么。 它们可以是“ 124”,“ 052”,等等。 我们该怎么做?

s2 = “PracticalDataScie052”
re.search(‘[0–9][0–9][0–9]’, s2)Output: <re.Match object; span=(17, 20), match='052'>

There are a lot of concepts here to talk about. The pattern used here is ‘[0-9][0-9][0-9]’. First, let us talk about square brackets ([]). Regular expression or pattern […] tells you to match any single character in square brackets. For example:

这里有很多概念要讨论。 此处使用的模式为'[0-9][0-9][0-9]'. 首先,让我们讨论方括号([]) 。 正则表达式或模式[…]告诉您匹配方括号中的任何单个字符。 例如:

re.search(‘[0]’, s2)Output: <re.Match object; span=(17, 18), match='0'>

This pattern, ‘[0]’, tells to locate character 0 in s2 string and print out if there is a match. If I need to locate more character like three numbers, I can write:

此模式[[0]”告诉您在s2字符串中定位字符0,并在匹配时打印出。 如果需要定位更多字符,例如三个数字,可以编写:

re.search(‘[0][5][2]’, s2)Output: <re.Match object; span=(17, 20), match='052'>

Ok, you are right. I could just type ‘052’ as a pattern to locate it in s2 string, but things get interesting now. I can create another regex within square brackets e.g. which is used for range. What do I mean by that? It means using (-), I can locate for a range of characters. For example:

好吧,你是对的。 我可以只输入“ 052”作为模式以在s2字符串中定位它,但是现在事情变得很有趣了。 我可以在方括号内创建另一个正则表达式,例如用于范围。 那是什么意思 这意味着使用(-) ,我可以找到一个字符范围。 例如:

re.search(‘[0–9]’, s2)Output: <re.Match object; span=(17, 18), match='0'>

It means to find out any digit from zero to nine within s2. So now, let us get back to our question to locate three consecutive numbers. To do that, I can simply write:

这意味着找出s2中从零到九的任何数字。 现在,让我们回到问题来查找三个连续的数字。 为此,我可以简单地写:

re.search(‘[0–9][0–9][0–9]’, s2)Output: <re.Match object; span=(17, 20), match='052'>

Each range within every square bracket tells you to find out a digit number in s2 string. I also would be able to use the range for letters. For example:

每个方括号内的每个范围都告诉您找出s2字符串中的数字。 我也可以使用字母范围。 例如:

re.search(‘[a-z][0–9]’, s2)Output: <re.Match object; span=(16, 18), match='x0'>

This example tells us to locate two characters. The first one, any lowercase letter, and the second character should be a digit. The output (‘x0’) is exactly what we expect to get. Regular expression ‘\d’ is equal to ‘0–9’. So, for the previous example, I also can use:

本示例告诉我们找到两个字符。 第一个,任何小写字母,第二个字符应为数字。 输出('x0')正是我们期望得到的。 正则表达式'\ d'等于'0–9'。 因此,对于前面的示例,我还可以使用:

re.search(‘[a-z][\d]’, s2)Output: <re.Match object; span=(16, 18), match='x0'>

II..split()函数 (II) .split() function)

Similar to the search function, a regular expression split is typically written as:

与搜索功能类似,正则表达式拆分通常写为:

re.split(pattern, string)

re.split( patternstring )

This function splits the string using the pattern as the delimiter and returns the substrings as a list. Let us see at the following example:

此函数使用模式作为分隔符分割字符串,并将子字符串作为列表返回。 让我们看下面的例子:

re.split(‘[;]’, ‘Data;Science and; Data Analysis;courses’)Output: ['Data', 'Science and', ' Data Analysis', 'courses']

In this example, the pattern is [;], and it means that we have the semicolon (;) as a delimiter. Wherever there is a semicolon at the string, it will be split at that location and saved in a list. We can have more than one delimiter. Let us look at a more complex example.

在此示例中,模式为[;],这意味着我们将分号(;)作为分隔符。 只要字符串中有分号,它将在该位置被分割并保存在列表中。 我们可以有多个定界符。 让我们看一个更复杂的例子。

string = “Data12Science567programbyAWS025GoogleCloud”
re.split(‘\d+’, string)Output: ['Data', 'Science', 'programbyAWS', 'GoogleCloud']

I this example, our pattern is ‘\d+’, and as we all know ‘\d’ pattern means any digits (0 to 9). By adding a ‘+’ notation at the end will make the pattern match at least 1 or more digits. Therefore, in this case, we see that any consecutive numbers will be considered as a delimited and substrings are returned in a list.

在此示例中,我们的模式为“ \ d +”,众所周知,“ \ d”模式表示任何数字(0至9)。 通过在末尾添加“ +”符号,将使模式至少匹配1个或多个数字。 因此,在这种情况下,我们看到任何连续的数字都将被视为定界符,并且子字符串将在列表中返回。

Let us consider the following string. I have two courses in the format of “[Course Number] [Programming Language] [Course Name]”. The string is written in two different lines and the spacing between the words is not equal.

让我们考虑以下字符串。 我有两门课程的格式为“ [课程号] [编程语言] [课程名称]”。 字符串用两行不同的文字书写,单词之间的间距不相等。

string = ‘’’101              Python       DataScience 
102 R DataAnalysis’’’
re.split(‘\s+’, string)Output: ['101', 'Python', 'DataScience', '102', 'R', 'DataAnalysis']

In this example, the ‘\s’ pattern matches any whitespace character. By adding a plus sign ‘+’ at the end of it, the pattern will match at least 1 or more spaces.

在此示例中,“ \ s”模式匹配任何空格字符。 通过在其末尾添加加号“ +”,该模式将匹配至少1个或多个空格。

III)结论 (III) Conclusion)

Search and Split functionalities in re module have discussed in this tutorial. Using metacharacters to create different patterns, there would be very beneficial in text mining.

本模块讨论了re模块中的搜索和拆分功能。 使用元字符创建不同的模式,在文本挖掘中将非常有益。

翻译自: https://medium.com/@s.sadathosseini/regular-expressions-in-python-2f79e37f8dff

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值