python基础: 正则表达式

最新推荐文章于 2023-05-23 11:18:42 发布

小帅的私人空间

最新推荐文章于 2023-05-23 11:18:42 发布

阅读量326

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/joshuajinxiaoshuai/article/details/82157075

版权

python 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

被老铁奚落一通, 所以直面问题,不要逃避了.仔细把文档看了,记录在下面
首先:有一个在线查看效果和讲解的网站,太棒了https://regexr.com/
其次:文档在这里:Python3
tutorial是 https://docs.python.org/3/howto/regex.html
大文档是 https://docs.python.org/3/library/re.html#re-syntax

1. Metacharacters基础

Metacharacters保留字符(就是说碰上不会直接匹配,除非转义)
. ^ $ * + ? { } [ ] \ | ( )
[] class
[abc] will match any of the characters a, b, or c; 注意,是任何
[abc] = [a-c]
所以一般匹配字母, 大写的话一般写为[A-Z]
Metacharacters are not active inside classes. 保留字符在[]里面就失去特殊功能,安静的做一个字符了
^ 在[]里面的话表示为除此之外,[^abc] 匹配除了a,b,c之外的字符。
\ backslash 最重要的反斜杠; 小写是包含任意,大写是任意都不包含
\d
Matches any decimal digit; this is equivalent to the class [0-9].
\D
Matches any non-digit character; this is equivalent to the class [^0-9].
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
. 一个点是啥都行,除了新的行.就是说本行只要有内容,啥内容它都匹配
| 或者
Alternation, or the “or” operator.. Crow|Servo will match either ‘Crow’ or ‘Servo’, not ‘Cro’, a ‘w’ or an ‘S’, and ‘ervo’
^一般在开头,

print(re.search(‘^From’, ‘From Here to Eternity’))

2. repeating things 重复字符的匹配

星号(在markdown里面也是保留字符,lol)
For example, ca*t will match ‘ct’ (0 ‘a’ characters), ‘cat’ (1 ‘a’), ‘caaat’ (3 ‘a’ characters), and so forth.
举例: a[bcd]*b 可以match 下面这些: abcb, acbb, acccccccbbbb
加号 +
注意区别:星号匹配零到无穷次, 加号+则需要至少出现一次,加号要求更高
问号?
0 或者1次,问号就是问有没有的意思?没有就是0, 有就有一次
使用{m, n}精确; 注意是两端封闭的区间
就是重复m-n之间的一个次数;比如 {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?
a/{1,3}b will match ‘a/b’, ‘a//b’, and ‘a///b’

3. Python3的RE包使用

1. compile,

import re
p = re.compile('ab*', re.IGNORECASE)

2. The Backslash Plague 哈哈瘟疫…

这个反斜杠真的要让人晕掉, 特别当你想匹配的字符串中也有反斜杠开头.这样应该使用双斜杠,但是python不可以,必须要把双斜杠加倍.最终结果如下:
\section Text string to be matched
\section Escaped backslash for re.compile()
“\\section” Escaped backslashes for a string literal

解决反斜杠诅咒的办法就是加上r 比如 r”\n” is a two-character string containing ‘\’ and ‘n’, while “\n” is a one-character string containing a newline. Python里面的一些特殊化:
“ab*” –> r”ab*”
“\\section” –> r”\section”
“\w+\s+\1” –> r”w+\s+\1”

3. 几个match/查找的函数

match() 从头匹配返回boolean Determine if the RE matches at the beginning of the string.
search() 看有没有返回位置Scan through a string, looking for any location where this RE matches.
findall() 返回list Find all substrings where the RE matches, and returns them as a list.
finditer() 返回iterator Find all substrings where the RE matches, and returns them as an iterator.
看来我的问题没有现成的函数, 需要自己写了

4. API实战

import re
p = re.compile(‘[a-z]+’)
p.match(“”)
print(p.match(“”))
m = p.match(‘tempo’)
match的方法:
group() Return the string matched by the RE
start() Return the starting position of the match
end() Return the ending position of the match
span() Return a tuple containing the (start, end) positions of the match
m.group()
‘tempo’
m.start(), m.end()
(0, 5)
m.span()
(0, 5)
iterator = p.finditer(‘12 drummers drumming, 11 … 10 …’)
iterator

for match in iterator:
… print(match.span())
…
(0, 2)
(22, 24)
(29, 31)

5. 更进一步, 实际模块中的函数, Module-Level Functions

match(), search(), findall(), sub()
Compilation Flags: 好多, 好难记呀

Flag Meaning

ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, S Make . match any character, including newlines.
IGNORECASE, I Do case-insensitive matches.
LOCALE, L Do a locale-aware match.
MULTILINE, M Multi-line matching, affecting ^ and $.
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.

Flag Meaning

ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
Make \w, \W, \b, \B, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.这个很大
DOTALL, S Make . match any character, including newlines.
是的点 . 连newline也可以包括,本来是不匹配的
IGNORECASE, I Do case-insensitive matches.
忽略字母case, 大熊啊都匹配
LOCALE, L Do a locale-aware match.
Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale instead of the Unicode database.
MULTILINE, M Multi-line matching, affecting ^ and $.\ 表示多行, ^一般用在开头, 而$ 用在句末
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.

5. Grouping - 发现这个很有用呀

()

p = re.compile(‘(ab)*’)
print(p.match(‘ababababab’).span())
(0, 10)

p = re.compile(r’\b(\w+)\s+\1\b’)
p.search(‘Paris in the the spring’).group()
‘the the’

6. Lookahead Assertions

http://blog.51cto.com/cnn237111/749047 一个中文解释

7. Modifying Strings

split()

>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

sub()
subn()

8. Tips

Use String Methods

多用一下原始string 函数, 不要动不动就整re的api

match() versus search()

match()必须满足最前面几个字母匹配;
search()只要存在就可以;

Using re.VERBOSE

加上他我们可以把regexp写成多行, 更加简单容易读

pat = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

No Pain No Gain!
http://www.runoob.com/python/python-reg-expressions.html

留个坑, 下次准备写一下yield, 最近发现的一个盲点.