python基础: 正则表达式

被老铁奚落一通, 所以直面问题,不要逃避了.仔细把文档看了,记录在下面
首先:有一个在线查看效果和讲解的网站,太棒了https://regexr.com/
其次:文档在这里:Python3
tutorial是 https://docs.python.org/3/howto/regex.html
大文档是 https://docs.python.org/3/library/re.html#re-syntax

1. Metacharacters基础

  1. Metacharacters保留字符(就是说碰上不会直接匹配,除非转义)
    . ^ $ * + ? { } [ ] \ | ( )
  2. [] class
    [abc] will match any of the characters a, b, or c; 注意,是任何
    [abc] = [a-c]
    所以一般匹配字母, 大写的话一般写为[A-Z]
    Metacharacters are not active inside classes. 保留字符在[]里面就失去特殊功能,安静的做一个字符了
    ^ 在[]里面的话表示为除此之外,[^abc] 匹配除了a,b,c之外的字符。
  3. \ backslash 最重要的反斜杠; 小写是包含任意,大写是任意都不包含
    \d
    Matches any decimal digit; this is equivalent to the class [0-9].
    \D
    Matches any non-digit character; this is equivalent to the class [^0-9].
    \s
    Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
    \S
    Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
    \w
    Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
    \W
    Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
  4. . 一个点是啥都行,除了新的行.就是说本行只要有内容,啥内容它都匹配
  5. | 或者
    Alternation, or the “or” operator.. Crow|Servo will match either ‘Crow’ or ‘Servo’, not ‘Cro’, a ‘w’ or an ‘S’, and ‘ervo’
  6. ^一般在开头,

    print(re.search(‘^From’, ‘From Here to Eternity’))

2. repeating things 重复字符的匹配

  1. 星号(在markdown里面也是保留字符,lol)
    For example, ca*t will match ‘ct’ (0 ‘a’ characters), ‘cat’ (1 ‘a’), ‘caaat’ (3 ‘a’ characters), and so forth.
    举例: a[bcd]*b 可以match 下面这些: abcb, acbb, acccccccbbbb
  2. 加号 +
    注意区别:星号匹配零到无穷次, 加号+则需要至少出现一次,加号要求更高
  3. 问号?
    0 或者1次,问号就是问有没有的意思?没有就是0, 有就有一次
  4. 使用{m, n}精确; 注意是两端封闭的区间
    就是重复m-n之间的一个次数;比如 {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?
    a/{1,3}b will match ‘a/b’, ‘a//b’, and ‘a///b’

3. Python3的RE包使用

1. compile,
import re
p = re.compile('ab*', re.IGNORECASE)
2. The Backslash Plague 哈哈瘟疫…

这个反斜杠真的要让人晕掉, 特别当你想匹配的字符串中也有反斜杠开头.这样应该使用双斜杠,但是python不可以,必须要把双斜杠加倍.最终结果如下:
\section Text string to be matched
\section Escaped backslash for re.compile()
“\\section” Escaped backslashes for a string literal

解决反斜杠诅咒的办法就是加上r 比如 r”\n” is a two-character string containing ‘\’ and ‘n’, while “\n” is a one-character string containing a newline. Python里面的一些特殊化:
“ab*” –> r”ab*”
“\\section” –> r”\section”
“\w+\s+\1” –> r”w+\s+\1”

3. 几个match/查找的函数

match() 从头匹配返回boolean Determine if the RE matches at the beginning of the string.
search() 看有没有返回位置Scan through a string, looking for any location where this RE matches.
findall() 返回list Find all substrings where the RE matches, and returns them as a list.
finditer() 返回iterator Find all substrings where the RE matches, and returns them as an iterator.
看来我的问题没有现成的函数, 需要自己写了

4. API实战

import re
p = re.compile(‘[a-z]+’)
p.match(“”)
print(p.match(“”))
m = p.match(‘tempo’)
match的方法:
group() Return the string matched by the RE
start() Return the starting position of the match
end() Return the ending position of the match
span() Return a tuple containing the (start, end) positions of the match
m.group()
‘tempo’
m.start(), m.end()
(0, 5)
m.span()
(0, 5)
iterator = p.finditer(‘12 drummers drumming, 11 … 10 …’)
iterator

for match in iterator:
… print(match.span())

(0, 2)
(22, 24)
(29, 31)

5. 更进一步, 实际模块中的函数, Module-Level Functions

  1. match(), search(), findall(), sub()
  2. Compilation Flags: 好多, 好难记呀
Flag Meaning

ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, S Make . match any character, including newlines.
IGNORECASE, I Do case-insensitive matches.
LOCALE, L Do a locale-aware match.
MULTILINE, M Multi-line matching, affecting ^ and $.
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.

Flag Meaning

ASCII, A Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
Make \w, \W, \b, \B, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.这个很大
DOTALL, S Make . match any character, including newlines.
是的点 . 连newline也可以包括,本来是不匹配的
IGNORECASE, I Do case-insensitive matches.
忽略字母case, 大熊啊都匹配
LOCALE, L Do a locale-aware match.
Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale instead of the Unicode database.
MULTILINE, M Multi-line matching, affecting ^ and . ,, .   表 示 多 行 , 一 般 用 在 开 头 , 而 用在句末
VERBOSE, X (for ‘extended’) Enable verbose REs, which can be organized more cleanly and understandably.

5. Grouping - 发现这个很有用呀

()

p = re.compile(‘(ab)*’)
print(p.match(‘ababababab’).span())
(0, 10)

p = re.compile(r’\b(\w+)\s+\1\b’)
p.search(‘Paris in the the spring’).group()
‘the the’

6. Lookahead Assertions

http://blog.51cto.com/cnn237111/749047 一个中文解释

7. Modifying Strings

  • split()
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
  • sub()
  • subn()

8. Tips

Use String Methods

多用一下原始string 函数, 不要动不动就整re的api

match()必须满足最前面几个字母匹配;
search()只要存在就可以;

Using re.VERBOSE

加上他我们可以把regexp写成多行, 更加简单容易读

pat = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

No Pain No Gain!
http://www.runoob.com/python/python-reg-expressions.html

留个坑, 下次准备写一下yield, 最近发现的一个盲点.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值