被老铁奚落一通, 所以直面问题,不要逃避了.仔细把文档看了,记录在下面
首先:有一个在线查看效果和讲解的网站,太棒了https://regexr.com/
其次:文档在这里:Python3
tutorial是 https://docs.python.org/3/howto/regex.html
大文档是 https://docs.python.org/3/library/re.html#re-syntax
1. Metacharacters基础
- Metacharacters保留字符(就是说碰上不会直接匹配,除非转义)
. ^ $ * + ? { } [ ] \ | ( ) - [] class
[abc] will match any of the characters a, b, or c; 注意,是任何
[abc] = [a-c]
所以一般匹配字母, 大写的话一般写为[A-Z]
Metacharacters are not active inside classes. 保留字符在[]里面就失去特殊功能,安静的做一个字符了
^ 在[]里面的话表示为除此之外,[^abc] 匹配除了a,b,c之外的字符。 - \ backslash 最重要的反斜杠; 小写是包含任意,大写是任意都不包含
\d
Matches any decimal digit; this is equivalent to the class [0-9].
\D
Matches any non-digit character; this is equivalent to the class [^0-9].
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]. - . 一个点是啥都行,除了新的行.就是说本行只要有内容,啥内容它都匹配
- | 或者
Alternation, or the “or” operator.. Crow|Servo will match either ‘Crow’ or ‘Servo’, not ‘Cro’, a ‘w’ or an ‘S’, and ‘ervo’ ^一般在开头,
print(re.search(‘^From’, ‘From Here to Eternity’))
2. repeating things 重复字符的匹配
- 星号(在markdown里面也是保留字符,lol)
For example, ca*t will match ‘ct’ (0 ‘a’ characters), ‘cat’ (1 ‘a’), ‘caaat’ (3 ‘a’ characters), and so forth.
举例: a[bcd]*b 可以match 下面这些: abcb, acbb, acccccccbbbb - 加号 +
注意区别:星号匹配零到无穷次, 加号+则需要至少出现一次,加号要求更高 - 问号?
0 或者1次,问号就是问有没有的意思?没有就是0, 有就有一次 - 使用{m, n}精确; 注意是两端封闭的区间
就是重复m-n之间的一个次数;比如 {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?
a/{1,3}b will match ‘a/b’, ‘a//b’, and ‘a///b’
3. Python3的RE包使用
1. compile,
import re
p = re.compile('ab*', re.IGNORECASE)
2. The Backslash Plague 哈哈瘟疫…
这个反斜杠真的要让人晕掉, 特别当你想匹配的字符串中也有反斜杠开头.这样应该使用双斜杠,但是python不可以,必须要把双斜杠加倍.最终结果如下:
\section Text string to be matched
\section Escaped backslash for re.compile()
“\\section” Escaped backslashes for a string literal
解决反斜杠诅咒的办法就是加上r 比如 r”\n” is a two-character string containing ‘\’ and ‘n’, while “\n” is a one-character string containing a newline. Python里面的一些特殊化:
“ab*” –> r”ab*”
“\\section” –> r”\section”
“\w+\s+\1” –> r”w+\s+\1”
3. 几个match/查找的函数
match() 从头匹配返回boolean Determine if the RE matches at the beginning of the string.
search() 看有没有返回位置Scan through a string, looking for any location where this RE matches.
findall() 返回list Find all substrings where the RE matches, and returns them as a list.
finditer() 返回iterator Find all substrings where the RE matches, and returns them as an iterator.
看来我的问题没有现成的函数, 需要自己写了
4. API实战
import re
p = re.compile(‘[a-z]+’)
p.match(“”)
print(p.match(“”))
m = p.match(‘tempo’)
match的方法:
group() Return the string matched by the RE
start() Return the starting position of the match
end() Return the ending position of the match
span() Return a tuple containing the (start, end) positions of the match
m.group()
‘tempo’
m.start(), m.end()
(0, 5)
m.span()
(0, 5)
iterator = p.finditer(‘12 drummers drumming, 11 … 10 …’)
iteratorfor match in iterator:
… print(match.span())
…
(0, 2)
(22, 24)
(29, 31)
5. 更进一步, 实际模块中的函数, Module-Level Functions
- match(), search(), findall(), sub()
- Compilation Flags: 好多, 好难记呀
Flag Meaning
ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, S Make . match any character, including newlines.
IGNORECASE, I Do case-insensitive matches.
LOCALE, L Do a locale-aware match.
MULTILINE, M Multi-line matching, affecting ^ and $.
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.
Flag Meaning
ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
Make \w, \W, \b, \B, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.这个很大
DOTALL, S Make . match any character, including newlines.
是的点 . 连newline也可以包括,本来是不匹配的
IGNORECASE, I Do case-insensitive matches.
忽略字母case, 大熊啊都匹配
LOCALE, L Do a locale-aware match.
Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale instead of the Unicode database.
MULTILINE, M Multi-line matching, affecting ^ and
. 表示多行,一般用在开头,而
.
表
示
多
行
,
一
般
用
在
开
头
,
而
用在句末
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.
5. Grouping - 发现这个很有用呀
()
p = re.compile(‘(ab)*’)
print(p.match(‘ababababab’).span())
(0, 10)p = re.compile(r’\b(\w+)\s+\1\b’)
p.search(‘Paris in the the spring’).group()
‘the the’
6. Lookahead Assertions
http://blog.51cto.com/cnn237111/749047 一个中文解释
7. Modifying Strings
- split()
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
- sub()
- subn()
8. Tips
Use String Methods
多用一下原始string 函数, 不要动不动就整re的api
match() versus search()
match()必须满足最前面几个字母匹配;
search()只要存在就可以;
Using re.VERBOSE
加上他我们可以把regexp写成多行, 更加简单容易读
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P<header>[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P<value>.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)
No Pain No Gain!
http://www.runoob.com/python/python-reg-expressions.html
留个坑, 下次准备写一下yield, 最近发现的一个盲点.