python正则表达式复习1

最新推荐文章于 2021-03-01 06:17:02 发布

thomas-23

最新推荐文章于 2021-03-01 06:17:02 发布

阅读量497

点赞数

分类专栏：正则表达式 python 文章标签： python 正则表达式 regex

本文链接：https://blog.csdn.net/u011546806/article/details/46620411

版权

python 同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

正则表达式

1 篇文章 0 订阅

订阅专栏

元字符

. ^ $ * + ? { } [ ] \ | ( )

方括号与特殊字符

[^]:取反
[$]: 去除了元字符含义
[|]：去除元字符含义

\d [0-9]
\D [^0-9]
\s [ \t\n\r\f\v] 匹配所有空字符
\S [^ \t\n\r\f\v] 匹配所有非空字符
\w [a-zA-Z0-9_] 匹配包括下划线的文字字符
\W [^a-zA-Z0-9_] 取\w的反

\A 匹配开头， 与^不同之处在于多行时，^代表每行的开头，而\A就指字符串开头
\Z 字符串结尾
\b 单词边界
\B \b取反

这些特殊含义字符可以放到方括号中[], [\s,\.] 表示可以是空字符，逗号，点

点代表除换行符以外的字符，re.DOTALL 控制点是不是代表所有字符

元字符含义

* 贪婪模式重复匹配 ，零次或者多次
+ 一次或者多次
? 零次或者一次
{m,n} 至少m次，不超过n次
() 分组

* {0,}
+ {1,}
? {0,1}

模式

? 当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多的匹配所搜索的字符串。

(pattern) 匹配pattern 并获取这一匹配。

(?:pattern) 匹配pattern 但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。

(?=pattern) 正向预查，在任何匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。

(?!pattern) 负向预查，与(?=pattern)作用相反

re模块

通过编译好的正则表达式对象，可以间接的提高正则表达式的效率
p = re.compile('ab*', re.IGNORECASE)
p = re.compile('ab*')

正则表达式对象，有4个方法
match()     从字符串开头匹配
search()    扫描整个字符串，看能否找到匹配项
match和search的区别在于:
match从字符串的开始处匹配，而search是从整个字符串来匹配的，
如re.match('foo', 'seafood')将匹配失败， re.search('foo', 'seafood')则匹配成功

findall()   找到所有的匹配项，并返回一个列表
finditer()  找到所有匹配项，并返回一个iterator

match对象有4个方法
Method/Attribute    Purpose
group() 返回匹配的字符串
start() 返回匹配字符串的起始位置
end()   返回匹配字符串的结束位置
span()  返回位置对（起始位置，结束位置）

group(0) 所有匹配，默认选项
group('name') 可以给组指定名字
groups() 返回一个匹配的元组

注：加上re.VERBOSE排除注释，有了注释的好处就是方便回头理解正则表达式代码

    charref = re.compile(r"""
     &[#]                # Start of a numeric entity reference
     (
         0[0-7]+         # Octal form
       | [0-9]+          # Decimal form
       | x[0-9a-fA-F]+   # Hexadecimal form
     )
     ;                   # Trailing semicolon
    """, re.VERBOSE)

修改string的方法

split() 通过匹配项进行分割，分割成list
sub()   找到所有匹配项，并替换
subn()  与sub方法相同，只是返回替换后的字符串和替换的数量

实例

通过match 定位匹配字符串位置

import re
    pattern = 'this'
    text = 'Does this text match the pattern?'
    match = re.search(pattern, text)
    s = match.start()
    e = match.end()

    print 'Found "%s" in "%s" from %d to %d ("%s")' % (match.re.pattern, match.string, s, e, text[s:e])

结果
Found “this” in “Does this text match the pattern?” from 5 to 9 (“this”)

正则表达式对象

# #通过编译complie方法将一个string变成一个RegexObject即正则表达式对象
    # #使用compile不用去看cache,通过compile可以在加载模块时预编译语句因此改变了程序的响应时间
    regexes = [re.compile(p) for p in ['this', 'that']]
    text = 'Does this text match the pattern?'
    print 'Text: %r\n' % text
    for regex in regexes:
        print 'Seeking "%s" ->' % regex.pattern,
        #通过search查找文本
        if regex.search(text):
            print 'match!'
        else:
            print 'no match

’
结果
Text: ‘Does this text match the pattern?’

Seeking “this” -> match!
Seeking “that” -> no match

findall

    # #多个匹配 findall
    text = 'abbaaabbbbaaaaa'
    pattern = 'ab'
    for match in re.findall(pattern, text):
        print 'Found "%s"' % match

结果
Found “ab”
Found “ab”

finditer

# #通过finditer返回一个Match实例，而不是findall中返回的是string
    for match in re.finditer(pattern, text):
        s = match.start()
        e = match.end()
        print 'Found "%s" at %d:%d' % (text[s:e], s, e)

结果
Found “ab” at 0:2
Found “ab” at 5:7

group

import re

    TEST1 = '''
    test.com/geo/search.php?lang=zh-Hans&reallogin=0
    '''

    TEST2 = '''
    /geo/search.php?lang=zh-Hans&reallogin=0
    '''

    TEST3 = '''
    https://test.com/geo/search.php?lang=zh-Hans&reallogin=0
    '''

    URIREX = re.compile('''
       (?P<proto> # http,https:// 协议 可以选择性出现
       ((http)s?://)?)
       (?P<domain> # 域名匹配 如api.jiayuan.com，每个字段长度不超过63，可以选择性出现
       ((\w\w{0,61}\w\.)+\w{2,})?)
       (?P<uri> # 匹配uri, 如/impressionSearchAndriod.php，必需包含
       (/[\w,\.,-]{2,})+)
       (?P<param> # 匹配参数 如?lang=zh-Hans&reallogin=0&ver=5.3
       (\?\S+=\S+&?)+)
    ''',  re.VERBOSE)

    # 输出每个group的值
    for i in xrange(1, 4):
        print '-'*50
        temp = eval("TEST"+str(i))
        print URIREX.search(temp)

        if URIREX.search(temp):
            p = URIREX.search(temp)
            print p.group('proto')
            print p.group('domain')
            print p.group('uri')
            print p.group('param')

结果
————————————————–
<_sre.SRE_Match object at 0x022FA5F8>

test.com
/geo/search.php
?lang=zh-Hans&reallogin=0
————————————————–
<_sre.SRE_Match object at 0x022FA708>

/geo/search.php
?lang=zh-Hans&reallogin=0
————————————————–
<_sre.SRE_Match object at 0x022FA5F8>
https://
test.com
/geo/search.php
?lang=zh-Hans&reallogin=0

thomas-23

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python正则表达式复习1

元字符. ^ $ * + ? { } [ ] \ | ( )方括号与特殊字符[^]:取反[$]: 去除了元字符含义[|]：去除元字符含义\d [0-9]\D [^0-9]\s [ \t\n\r\f\v] 匹配所有空字符\S [^ \t\n\r\f\v] 匹配所有非空字符\w [a-zA-Z0-9_] 匹配包括下划线的文字字符\W [^a-zA-Z0-9_] 取\w的反\A 匹配开头，
复制链接

扫一扫