python学习之正则表达式

最新推荐文章于 2024-08-04 15:22:29 发布

sentimental_dog

最新推荐文章于 2024-08-04 15:22:29 发布

阅读量306

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/sentimental_dog/article/details/52588904

版权

机器学习专栏收录该内容

32 篇文章 0 订阅

订阅专栏

slash 斜杠

backslash 反斜杠

py正则的官方文档地址 http://python.usyiyi.cn/translate/python_278/library/index.html

http://python.usyiyi.cn/documents/python_278/howto/regex.html#regex-howto

正则常见用法（图片来自CSDN）

正则表达式相关注解

（1）数量词的贪婪模式与非贪婪模式

正则表达式通常用于在文本中查找匹配的字符串。Python里数量词默认是贪婪的（在少数语言里也可能是默认非贪婪），总是尝试匹配尽可能多的字符；非贪婪的则相反，总是尝试匹配尽可能少的字符。例如：正则表达式”ab*”如果用于查找”abbbc”，将找到”abbb”。而如果使用非贪婪的数量词”ab*?”，将找到”a”。

注：我们一般使用非贪婪模式来提取。

（2）反斜杠问题

与大多数编程语言相同，正则表达式里使用”\”作为转义字符，这就可能造成反斜杠困扰。假如你需要匹配文本中的字符”\”，那么使用编程语言表示的正则表达式里将需要4个反斜杠”\\\\”：前两个和后两个分别用于在编程语言里转义成反斜杠，转换成两个反斜杠后再在正则表达式里转义成一个反斜杠。

Python里的原生字符串很好地解决了这个问题，这个例子中的正则表达式可以使用r”\\”表示。同样，匹配一个数字的”\\d”可以写成r”\d”。有了原生字符串，妈妈也不用担心是不是漏写了反斜杠，写出来的表达式也更直观勒。

（3）关于分组

组是通过 "(" 和 ")" 元字符来标识的。 "(" 和 ")" 有很多在数学表达式中相同的意思；它们一起把在它们里面的表达式组成一组（也就是一个整体）。举个例子，你可以用重复限制符，象 *, +, ?, 和 {m,n}，来重复组里的内容，比如说(ab)* 将匹配零或更多个重复的 "ab"。

关于m.group

分组有两个作用，一个是把组内的内容作为整体，当然更重要的是，在匹配完成后，分组的内容会被提取出来，比如我需要提取"station_train_code":"D3231","start_station_telecode":中的D3231，我们该如何用re来做呢？

只需要用a=re.compile(r'station_train_code":(\w+?)"',我们用a.findall(text)就可以匹配出D3231这一项了（如果有多个括号，想获取多个匹配结果，请看下文）

关于把内容作为整体，请见下文

Now you can query the match object for information about the matching string. match object instances also have several methods and attributes; the most important ones are:

Method/Attribute	Purpose
`group()`	Return the string matched by the RE
`start()`	Return the starting position of the match
`end()`	Return the ending position of the match
`span()`	Return a tuple containing the (start, end) positions of the match

Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.（group(0) is always the regular expression itself）

group只与编译好的正则表达式有关，和要匹配的表达式无关

>>>

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.

match() versus search()

The match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match.It’s important to keep this distinction in mind. Remember, match() will only report a successful match which will start at 0; if the match wouldn’t start at zero, match() will not report it.

Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead.The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.

Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Usere.search() instead.

关于开启非贪婪模式

　默认是贪婪模式；在量词后面直接加上一个问号？就是非贪婪模式。（量词比如*、?、+）