python之正则表达式

最新推荐文章于 2024-09-15 22:31:42 发布

weixin_34194551

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量63

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/lazy0/p/5720517.html

版权

我们已经搞定了怎样获取页面的内容，不过还差一步，这么多杂乱的代码夹杂文字我们怎样把它提取出来整理呢？下面就开始介绍一个十分强大的工具，正则表达式！

### 什么叫做正则表达式？

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，正则表达式使用耽搁字符串来描述，匹配一系列符合某个句法规则的字符串。

简单理解，就是对字符串的检索匹配和处理组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

### 为什么会有正则表达式？

想要从返回的页面内容提取出我们想要的内容。

### 正则表达式如何匹配的？

    1.依次拿出表达式和文本中的字符比较，
    2.如果每一个字符都能匹配，则匹配成功；一旦有匹配不成功的字符则匹配失败。
    3.如果表达式中有量词或边界，这个过程会稍微有一些不同。

### 正则表达式的基本语法。

    .    匹配除换行符以外的任意字符
    ^    匹配字符串的开始
    $    匹配字符串的结束
    []   用来匹配一个指定的字符类别
    ？   对于前一个字符字符重复0次到1次
    *    对于前一个字符重复0次到无穷次
    {}   对于前一个字符重复m次
    {m，n} 对前一个字符重复为m到n次
    \d   匹配数字，相当于[0-9]
    \D   匹配任何非数字字符，相当于[^0-9]
    \s   匹配任意的空白符，相当于[ fv]
    \S   匹配任何非空白字符，相当于[^ fv]
    \w   匹配任何字母数字字符，相当于[a-zA-Z0-9_]
    \W   匹配任何非字母数字字符，相当于[^a-zA-Z0-9_]
### 一些记得的知识点

    \b   匹配单词的开始或结束
    \d+匹配1个或更多连续的数字
    \d**匹配重复任意次(可能是0次)
    ^匹配你要用来查找的字符串的开头
    $匹配结尾
    ^\d{5,12}$,匹配5-12的数字
    ^\s{5,12}$,匹配5-12位的字符
    ^\w{5,12}$,匹配5-12位的字母，数字，下划线，或者汉子
    \W \D \B \S 意思刚好相反
    [^x]匹配除了x之外的任意字符
    [^aed]匹配除了aed之外的任意字符

    查找 . *  \ 使用\. \* \\
    * 0 - 多次
    + 1 - 多次
    ? 0 - 1次
    {n} n次
    {n,} n - 多次
    {n,m} n - m 次

    \d-->[0-9]
    \w-->[a-z0-9A-Z]

### 一个小小的实例

    $?0\d{2}$?[- ]?\d{8}|0\d{2}[- ]?\d{8}这个表达式匹配3位区号的电话号码，
    其中区号可以用小括号括起来，也可以不用，区号与本地号间可以用连字号或空格间隔，
    也可以没有间隔。你可以试试用分枝条件把这个表达式扩展成也支持4位区号的。

### 正则常用的一些函数用法

    print (re.match('www','www.baidu.com').span())#从起始位置开始
    print (re.search('com','www.baidu.com').span())#未从起始位置开始

    re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

    在Python中使用正则表达式进行查找

    ‘re’模块提供了几个方法对输入的字符串进行确切的查询。我们将会要讨论的方法有：

    re.match()
    re.search()
    re.findall()

    match匹配字符串的开始位置
    search匹配字符串的任意位置
    >>> match = re.search(r'dog', 'dog cat dog')
    >>> match.group(0)
    'dog'

    Python中我使用的最多的查找方法是findall()方法

    ['dog', 'dog']
    >>> re.findall(r'cat', 'dog cat dog')
    ['cat']


    >>> contactInfo = 'Doe, John: 555-1212'
    >>> match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)
    >>> match.group(0)
    'Doe, John: 555-1212'
    >>> match.group(1)
    'Doe'
    >>> match.group(2)
    'John'
    >>> match.group(3)
    '555-1212'

### 实际案例

案例1，

    import re
    pattern=re.compile('hello')
    match=pattern.match('hello world')
    print match.group()

案例2

    import re
    match=re.findall('hello','hello world')
    print match

re模块提供正则表达式的支持
字符串形式编译为Pattern实例；
使用Pattern实例处理文本并获得匹配结果；

案例3，

    import re
    word ='http://www.baidu.com python_1.2'
    key=re.findall('h.',word)
    print key
    匹配.任意一个字符

案例4，

    import re
    word ='http://www.baidu.com python_1.2'
    key=re.findall('\.',word)
    print key
    匹配.转义的字符

案例5，

    import re
    word ='http://www.baidu.com python_1.2'
    key=re.findall('\d\.\d',word)
    print key
    匹配两个数字的字符以及字符.
案例6，

    import re
    word ='httphttp://www.baidu.com python_1.2'
    key=re.findall('http*',word)
    print key
    匹配所有的http
案例7，

    import re
    word ='httphttp://www.baidu.com python_1.2'
    key=re.findall('t{2}',word)
    print key
    匹配所有的http

案例8，
    #!usr/bin/env python
    #-*- coding:utf_8 -*-
    import urllib
    import re
    html='''
        <div class="one"><div class="aaa" title="白帽子" οnclick=......
        '''
    title=re.findall(r'<div class="aaa" title="(.*?)" onclick',html)

    for i in title:
        print i

    匹配title内容

### 下面的是针对语法做的一些代码操作
    #!usr/bin/env python
    #-*- coding:utf_8 -*-
    import urllib
    import re
    import os
    pattern=re.compile(r'hello')
    match=pattern.match('hello world')
    if match:
        print match.group()
    else:
        pass

    m=re.match(r'aaa','aaaaaaa efe')
    print m.group()


    import re
    m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')

    print "m.string:", m.string
    print "m.re:", m.re
    print "m.pos:", m.pos
    print "m.endpos:", m.endpos
    print "m.lastindex:", m.lastindex
    print "m.lastgroup:", m.lastgroup

    print "m.group(1,2):", m.group(1, 2)
    print "m.groups():", m.groups()
    print "m.groupdict():", m.groupdict()
    print "m.start(2):", m.start(2)
    print "m.end(2):", m.end(2)
    print "m.span(2):", m.span(2)
    print r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3')


    p=re.compile(r'\=')#根据特殊字符进行分割操作
    a=p.split('cookie=fwefwvb,password=fwefwefwefw')
    print a


    p=re.compile(r'\d+')#根据特殊字符进行分割
    a=p.split('dwedw1fwefwvb2fwefwe4fwefw')
    print a
    print a[3]
    b=p.findall('dwedw1fwefwvb2fwefwe4fwefw')
    print b
    print b[1]


    #!/usr/bin/python
    import re

    line = "Cats are smarter than dogs"

    matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

    if matchObj:
       print "matchObj.group() : ", matchObj.group()
       print "matchObj.group(1) : ", matchObj.group(1)
       print "matchObj.group(2) : ", matchObj.group(2)
    else:
       print "No match!!"
    #!/usr/bin/python
    import re

    phone = "2004-959-559 # This is Phone Number"

    # Delete Python-style comments
    num = re.sub(r'#.*$', "", phone)
    print "Phone Num : ", num

    # Remove anything other than digits
    num = re.sub(r'\D', "", phone)
    print "Phone Num : ", num          '''     f=open('module/test.txt','r')     for line in f.readlines():         payload=line.strip()         #print type(payload)         if len(payload)!=0:             re_telepone=re.match(r'^(\d{3})-(\d{3,20})$', payload)             print re_telepone.group(2)             p=open('module/test2.txt','w')             p.write(re_telepone.group(2))         else:             break     f.close()     p.close()                       '''     '''     test='010-12345'     if re.match(r'\d{3}-\d{3,8}$',test):         print 'ok'     else:         print 'fail'          a=re.split(r'[\s\,]+','a,b,ccc   dd')     print a     a=re.split(r'[\s\,\;]+','a,b,ccc;;;   dd')     print a     reg='Cookie=aaaaa;falg=ddddd'     a=re.split(r'[\s\,\;]+',reg)     print a[1]     a[1]=re.split(r'[\s\,\=]',a[1])     print a[1]     print a[1][1]          reg='Cookie=aaaaa;falg=ddddd'     a=re.split(r'[\s\,\;,\=]+',reg)     print a[3]          print re.match(r'^(\d+)(0*)$','12300').groups()#贪婪          aa=re.match(r'^(\d+?)(0*)$','12300').groups()#非贪婪     print aa     print aa[1]     ''' ### 以后会继续做补充！！！！