爬虫基础（5）

最新推荐文章于 2020-04-21 15:25:07 发布

Fergus awsl

最新推荐文章于 2020-04-21 15:25:07 发布

阅读量92

点赞数

分类专栏：数据分析

本文链接：https://blog.csdn.net/weixin_43650411/article/details/91127187

版权

数据分析专栏收录该内容

41 篇文章 2 订阅

订阅专栏

概念
Re库
练习代码
附注：Mooc-python网络爬虫与信息提取

概念
正则表达式：用描述性语言定义一套规则来简单地表达/匹配字符串。
注：字符串是编程时遇到最多的一种数据结构。
Re库

采用raw string类型（原生字符串类型）表示正则表达式：r'text'
主要函数

参数 & 控制标记

import re
re.search(pattern, string, flags=0)
re.match(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)  # maxsplit：最大分割数，剩余部分作为一个元素输出
re.finditer(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)  # repl:替换匹配成功字符串   count:最大替换次数

在这里插入图片描述

等价
Match对象
贪婪匹配 & 最小匹配
re默认是贪婪匹配，即输出匹配成功的最长的字符串。

# 通过在操作符后增加?变成最小匹配。
*?
+?
??
{m, n}?

练习代码

import re

match = re.search(r'[1-9]\d{5}', 'BIT 100081')
if match:
    print(match.group(0))
100081

match = re.match(r'[1-9]\d{5}', 'BIT 100081')
if match:
    print(match.group(0))

match.group(0)  # NoneType:如上，调用match对象前，提前进行if判断
Traceback (most recent call last):
  File "<ipython-input-12-4d972d6c40f1>", line 1, in <module>
    match.group(0)
AttributeError: 'NoneType' object has no attribute 'group'




ls = re.findall(r'[1-9]\d{5}', 'BIT 100081 TSU100084')
ls
Out[13]: ['100081', '100084']

re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
Out[14]: ['BIT', ' TSU', '']

re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
Out[15]: ['BIT', ' TSU100084']

for match in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if match:
        print(match.group(0))
100081
100084

re.sub(r'[1-9]\d{5}', ':zip', 'BIT100081 TSU100084')
Out[18]: 'BIT:zip TSU:zip'

# match对象实例
match = re.search(r'[1-9]\d{5}', 'BIT100081 TSU100084')
match.string
Out[19]: 'BIT100081 TSU100084'

match.re
Out[20]: re.compile(r'[1-9]\d{5}', re.UNICODE)

match.pos
Out[21]: 0

match.endpos
Out[22]: 19

match.group(0)
Out[23]: '100081'

match.start()  # 从0开始
Out[26]: 3

match.end()
Out[27]: 9

match.span()
Out[28]: (3, 9)


# 贪婪匹配
re.match(r'^(\d+)(0*)$', '102300').groups()
Out[35]: ('102300', '')
# 最小匹配
re.match(r'^(\d+?)(0*)$', '102300').groups()
Out[34]: ('1023', '00')

应用

# 切分字符串
re.split(r'[\s\,\;]+', 'a b,c  d;e')
Out[29]: ['a', 'b', 'c', 'd', 'e']

# 分组
match = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
if match:
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.groups())
010-12345
010
12345
('010', '12345')