17-python_正则表达式

最新推荐文章于 2020-12-23 04:54:20 发布

forwardNow

最新推荐文章于 2020-12-23 04:54:20 发布

阅读量1.1k

点赞数

分类专栏： python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/wuqinfei_cs/article/details/12313247

版权

python 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

正则表达式 Regular Expression
- 引入 re 模块
- 规则定义 patternName = r"abc..."

1. 概念

- 正则表达式(RE)是一种小型的高度专业化的语言
- 它内嵌在python中, 通过re模块实现

2. 作用

处理字符串.
- 匹配
- 替换
- 分隔

3. 字符匹配

- 普通字符
- 元字符
. ^ $ * + ? {} [] \ | ()

# 匹配普通字符
>>> import re
>>> pattern = r"ab"
>>> re.findall(pattern, "123abc")
['ab']

4. 元字符

4.0 .
- 任意字符

4.1 []
- 在字符序列中选择一个
- 常用来指定一个字符集: [abc], [0-9], [a-zA-Z]
- 元字符在字符集中当做普通字符处理: [abc$]
- 补集 : [^a-z]

>>> import re
# 字符集
>>> pattern = "[a-z]"
>>> re.findall(pattern, "abc")
['a', 'b', 'c']
# 补集
>>> pattern = "[^a-z]"
>>> re.findall(pattern, "abc")
[]
# 特殊字符
>>> pattern = "[a^$]"
>>> re.findall(pattern, "abc^$")
['a', '^', '$']

4.2 ^
- 匹配行首
>>> pattern = "^a"
>>> re.findall(pattern, "baaa")
[]
>>> re.findall(pattern, "abbb")
['a']

4.3 $
- 匹配行尾

>>> pattern = "a$"
>>> re.findall(pattern, "aaab")
[]
>>> re.findall(pattern, "bbba")
['a']

4.4 \ - 转义字符
- 取消元字符的特殊含义, 将其当成普通字符处理
- 特殊含义
- \d <==> [0-9] , 匹配十进制数, decimal
- \D <==> [^0-9], 匹配非数字字符
- \s <==> [\t\n\r\f\v] , 匹配空白字符
- \S <==> [^\t\n\r\f\v]
- \w <==> [a-zA-Z0-9_], 匹配字母数字下划线
- \W <==> [^a-zA-Z0-9_]

4.5 重复

4.5.1 *
- 重复次数: [0, +无穷)

>>> pattern = r"ab*"
>>> re.findall(pattern, "a")
['a']
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abb")
['abb']
>>> re.findall(pattern, "abbbbbbbbbb")
['abbbbbbbbbb']

4.5.2 +
- 重复次数: [1, +无穷)

>>> pattern = r"ab+"
>>> re.findall(pattern, "a")
[]
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abbbbbb")
['abbbbbb']

4.5.3 ?
- 重复次数: [0, 1] , 即有或没有

>>> pattern = r"ab?"
>>> re.findall(pattern, "a")
['a']
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abbbbb")
['ab']

4.5.4 {m,n}
- {m,n} 重复次数: [m, n]
- {m} 重复次数: m
- {m,} 重复次数: [m, +无穷)
- m缺省值为0

>>> pattern = r"\d{1,3}"
>>> re.findall(pattern, "1234")
['123', '4']
>>> pattern = r"\d{1,}"
>>> re.findall(pattern, "1234")
['1234']
>>> pattern = r"\d{1}"
>>> re.findall(pattern, "1234")
['1', '2', '3', '4']

5. 编译正则表达式

5.1 编译
- re模块提供了一个正则表达式引擎接口,
可以将 REstring 编译成对象

>>> import re
>>> telPatternString = r"\d{3}"
>>> telPattern = re.compile(telPatternString)
>>> telPattern
<_sre.SRE_Pattern object at 0x01806170>
>>> telPattern.findall("1")
[]
>>> telPattern.findall("123")
['123']
>>> telPattern.findall("1234")
['123']

5.2 编译时使用参数
- 忽略大小写

>>> import re
>>> namePatternString = r"[a-z]{3}"
>>> namePattern = re.compile( namePatternString, re.IGNORECASE )
>>> namePattern.findall("abc")
['abc']
>>> namePattern.findall("abC")
['abC']

5.3 反斜杠的麻烦
- 字符串前加"r", 反斜杠就不会被任何特殊方式处理

>>> pattern = r"\\"
>>> re.findall(pattern, "c:\dirA")
['\\']
>>> pattern = "\\"
>>> re.findall(pattern, "c:\dirA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\install-2.7\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "D:\Python\install-2.7\lib\re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)

6. Regex object 的一些方法

- match() 只匹配开头的符合规则的字符串, 失败-返回None
- search() 匹配任意位置的符合规则的字符串
- findall() 将符合规则的字符串作为 list返回
- finditer() 将符合规则的字符串作为迭代器返回
- sub() 替换, 返回替换后的字符串
- subn() 替换, 返回(替换后的字符串, 替换的次数)
- split() 切割

6.1 match()
>>> import re
>>> pattern = r"a"
>>> re.match(pattern, "abc")
<_sre.SRE_Match object at 0x01407FA8>
>>> re.match(pattern, "ba")
>>> re.match(pattern, "bac")
>>>

6.2 search()
>>> import re
>>> pattern = r"a"
>>> re.search(pattern, "abc")
<_sre.SRE_Match object at 0x01419330>
>>> re.search(pattern, "bac")
<_sre.SRE_Match object at 0x01407FA8>
>>> re.search(pattern, "bca")
<_sre.SRE_Match object at 0x01419330>
>>>

6.3 findall()
>>> import re
>>> pattern = r"a"
>>> re.findall(pattern, "abacad")
['a', 'a', 'a']

6.4 finditer
>>> import re
>>> pattern = r"[0-9]"
>>> re.finditer(pattern, "1234")
<callable-iterator object at 0x017FE990>
>>> for x in re.finditer(pattern, "1234") :
... print x
...
<_sre.SRE_Match object at 0x01407FA8>
<_sre.SRE_Match object at 0x01419330>
<_sre.SRE_Match object at 0x01407FA8>
<_sre.SRE_Match object at 0x01419330>
>>>

6.5 sub() subn()
- subn(pattern, repl, string, count=0, flags=0)

>>> re.sub(r"a", "x", "abca")
'xbcx'
>>> re.subn(r"a", "x", "abca")
('xbcx', 2)

6.6 split()
- split(pattern, string, maxsplit=0, flags=0)

>>> re.split("[^\d]", "1999-09/19 23:34:59")
['1999', '09', '19', '23', '34', '59']
>>> re.split("[^\d ]", "1 + 2 + 3 - 4 * 5")
['1 ', ' 2 ', ' 3 ', ' 4 ', ' 5']

7. Match object 的一些函数

- group() 返回被正则匹配的字符串 obj.group()
- start() 匹配字符串的起始位置
- end() 匹配字符串的末尾位置
- span() (起始位置, 末尾位置)
- 检查 Match object 是否为 None, 判断是否匹配成功.

8. re属性

- 编译标识 flags
- DOTALL/S 使匹配包括换行在内的所有字符
- IGNORECASE/I 忽略大小写
- LOCALE/L 本地化匹配
- MULTILINE/M 多行匹配, 影响 ^$
- VERBOSE/X 去除"""编写正则时的换行符

# re.S
>>> re.findall(r"a.b", "a\nb")
[]
>>> re.findall(r"a.b", "a\nb", re.S)
['a\nb']

# re.M
>>> s = """
... line1: a1
... line2: a2
... line3: a3
... """
>>> s
'\nline1: a1\nline2: a2\nline3: a3\n'
>>> re.findall(r"^line[0-9]", s)
[]
>>> re.findall(r"^line[0-9]", s, re.M)
['line1', 'line2', 'line3']

# re.X
>>> telPatternStr = r"""
... \d{3,4}
... -?
... \d{7}
... """
>>> telPatternStr
'\n\\d{3,4}\n-?\n\\d{7}\n'
>>> re.findall(telPatternStr, "011-1234567")
[]
>>> re.findall(telPatternStr, "011-1234567", re.X)
['011-1234567']

9. 正则分组 - ()

- ( pattern1 | pattern2 ) 二选一
- 分组优先被返回

# 爬网址
>>> s = """
... <a href="www.baidu.com">baidu</a>
... <a href="www.sina.com.cn">sina</a>
... """
>>> print s

<a href="www.baidu.com">baidu</a>
<a href="www.sina.com.cn">sina</a>

>>> re.findall( r"<a href=\".+\">.+</a>", s )
['<a href="www.baidu.com">baidu</a>', '<a href="www.sina.com.cn">sina</a>']
>>> re.findall( r"<a href=\"(.+)\">.+</a>", s )
['www.baidu.com', 'www.sina.com.cn']
>>>

10. 小爬虫

- 下载贴吧或QQ空间中所有图片

- GrapPicture.py

'''
Created on 2013-10-4

@author: WuQinfei
'''


import re
import urllib

# url : web site
# return : get src code from the URL
def getHtml(url) :
    page = urllib.urlopen(url)  # connect to the url
    html = page.read()          # read it
    return html                 # return src code

# html : html src code
# return : a list of jpg URLs
def getImg(html) :
    reg = r'src="(http://[^\s]*\.jpg)" width'
    imgRe = re.compile(reg)
    imgUrlList = re.findall(imgRe, html)
    return imgUrlList

# url : download by this url
# name : saved by this name in current dir 
def downByUrl(url, name) :
    urllib.urlretrieve(url, name)



################################################
if __name__ == "__main__" :   
    html = getHtml("http://tieba.baidu.com/p/2306540022")

    imgUrlList = getImg(html)
      
    count = 1
    stopNum = 10
    for imgUrl in imgUrlList :
        print "download....", imgUrl
        pictureName = "E:\\desktop\\python\\py_src\\jpg\\%s.jpg" % count
        downByUrl(imgUrl, pictureName)
        count+=1
        if count > stopNum :
            break;
   
    print "the number of pictures =", count-1

forwardNow

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
17-python_正则表达式

正则表达式 Regular Expression - 引入 re 模块 - 规则定义 patternName = r"abc..."1. 概念 - 正则表达式(RE)是一种小型的高度专业化的语言 - 它内嵌在python中, 通过re模块实现2. 作用处理字符串. - 匹配 - 替换 - 分隔3. 字符匹配
复制链接

扫一扫