爬虫——正则表达式 & XPath & BeautifulSoup

最新推荐文章于 2022-11-24 21:57:16 发布

step-forward

最新推荐文章于 2022-11-24 21:57:16 发布

阅读量323

点赞数

本文链接：https://blog.csdn.net/weixin_45734982/article/details/105700635

版权

爬虫具有四个主要步骤：

明确目标 (要知道你准备在哪个范围或者网站去搜索)
爬 (将所有的网站的内容全部爬下来)
取 (去掉对我们没用处的数据)
处理数据 (按照我们想要的方式存储和使用)

在前面我们通过047_爬虫_网络数据采集_requests库以及案例048_爬虫案例_360搜索信息爬取了解了明确目标和爬的过程，但是对于爬下来的数据显然存在一部分需要的一部分不需要的问题。所以，进一步的数据解析是爬虫过程中必不可少的一部分。

正则表达式
正则表达式，又称规则表达式，通常被用来检索、替换那些符合某个模式(规则)的文本。当然正则表达式不是一定存在于爬虫中的数据解析中，其应用范围非常广，适用于很多的数据分析中，如机器学习。正则表达式在线测试工具：http://tool.oschina.net/regex/

>>> import re
>>> re.match(r"..", "a")   # --> 匹配失败，无返回值
>>> re.match(r"..", "ac")
<re.Match object; span=(0, 2), match='ac'>  # --> 匹配成功，结果为：“ac”
>>> re.match(r"\d\w", "a0bc")
>>> re.match(r"\d\w", "1abc")
<re.Match object; span=(0, 2), match='1a'>
>>> re.match(r"6[a-z5-9]", "6f888")
<re.Match object; span=(0, 2), match='6f'>
>>> re.match(r"6[a-z5-9]", "68f88")
<re.Match object; span=(0, 2), match='68'>

限定符
限定符用来指定正则表达式的一个给定的字符集 (组件) 必须要出现多少次才能满足匹配。有 *或 + 或 ? 或 {n} 或 {n,} 或 {n,m} 共6种。

>>> re.match(r"\d*", "123abc")
<re.Match object; span=(0, 3), match='123'>
>>> re.match(r"\d?", "123adc")
<re.Match object; span=(0, 1), match='1'>
>>> re.match(r"\d?", "adc123")
<re.Match object; span=(0, 0), match=''>
>>> re.match(r"\d{2}", "123abc")
<re.Match object; span=(0, 2), match='12'>
>>> re.match(r"\d{2,}", "123abc")
<re.Match object; span=(0, 3), match='123'>
>>> re.match(r"\d{2,3}", "123abc")
<re.Match object; span=(0, 3), match='123'>

边界符
边界符用来指定一个正则表达式的开始和结尾，或者是一个单词的开始和结尾等。

>>> re.match(r"^\w+\s\brl\b", "wo rl d")
<re.Match object; span=(0, 5), match='wo rl'>

分组表示符
分组表示符用来指定正则表达式中的分组，分组是正则表达式的一项功能，它允许将模式分组在一起，并将它们作为一个项目引用。组是使用括号()创建的。

# 方法一：
>>> re.match("[1-9]\d?$|0$|100$", "100")
<re.Match object; span=(0, 3), match='100'>
>>> re.match("[1-9]\d?$|0$|100$", "0")
<re.Match object; span=(0, 1), match='0'>
>>> re.match("[1-9]\d?$|0$|100$", "56")
<re.Match object; span=(0, 2), match='56'>
# 方法二：
>>> re.match("[1-9]?\d?$|100$", "56")
<re.Match object; span=(0, 2), match='56'>
>>> re.match("[1-9]?\d?$|100$", "0")
<re.Match object; span=(0, 1), match='0'>
>>> re.match("[1-9]?\d?$|100$", "100")
<re.Match object; span=(0, 3), match='100'>

例2：提取匹配到的分组

>>> res = re.match(r"<h1>(.*)</h1>", "<h1>匹配分组</h1>")
>>> res.group()   # --> 0: 代表所有符合该正则表达式的字符串
'<h1>匹配分组</h1>'
>>> res.group(1)  # --> 1: 代表第1个括号提取到的字符串
'匹配分组

例3：利用 \num 或者分组别名的形式匹配字符串

>>> s = "<html><h1>python</h1></html>"
>>> re.match(r"<(.+)><(.+)>.+</\2></\1>", s)
<re.Match object; span=(0, 28), match='<html><h1>python</h1></html>'>
>>> re.match(r"<(?P<key1>.+)><(?P<key2>.+)>.+</(?P=key2)></(?P=key1)>", s)
<re.Match object; span=(0, 28), match='<html><h1>python</h1></html>'>

贪婪模式与非贪婪模式

贪婪模式：在整个表达式成功的前提下，匹配尽可能多的字符。
非贪婪模式：在整个表达式成功的前提下，匹配尽可能少的匹配字符，在 *、+、?、{n, m} 后面加上 ? 即为开启非贪婪模式。
Python里数量限定符默认是贪婪的。
示例：

>>> re.match(r"aa(\d+)", "aa1234bbb").group(1)
'1234'
>>> re.match(r"aa(\d+?)", "aa1234bbb").group(1)
'1'
>>> re.match(r"aa(\d+)bbb", "aa1234bbb").group(1)
'1234'
>>> re.match(r"aa(\d+?)bbb", "aa1234bbb").group(1)
'1234'

# 取出 url 中不必要的参数
>>> url = "https://www.baidu.com/?tn=99669880_hao_pg"
>>> re.sub(r"(https://.+?/).*", lambda x: x.group(1), url)
'https://www.baidu.com/'

1.2 re 模块
在 Python 中为使用正则表达式需要利用 re 模块。
re 模块一般使用步骤：

使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象
注意: re对特殊字符进行转义，如果使用原始字符串，只需加一个 r 前缀
通过 Pattern 对象对文本进行匹配查找，获得匹配结果，一个 Match 对象。
使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作
在这里插入图片描述
代码示例：

import re

text = """
    2020-10-10
    2020-11-11
    2030/12/12
"""

# 1. 使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象
# 注意: re对特殊字符进行转义，如果使用原始字符串，只需加一个 r 前缀
# pattern = re.compile(r'\d{4}-\d{1,2}-\d{1,2}')    # 2020-4-11, 无分组的规则
# pattern = re.compile(r'(\d{4})-(\d{1,2})-(\d{1,2})')    # 2020-4-11， 有分组的规则
pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{1,2})-(?P<day>\d{1,2})')  # 2020-4-11， 有命名分组的规则

# 2. 通过 Pattern 对象对文本进行匹配查找，获得匹配结果，一个 Match 对象。
# search从给定的字符串中寻找一个符合规则的字符串， 只返回一个
result = re.search(pattern, text)
print(result)

# 3. 使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作
print("匹配到的信息:", result.group())  # 返回的是匹配到的文本信息
print("匹配到的信息:", result.groups())  # 返回的是位置分组， ('2020', '10', '10')
print("匹配到的信息:", result.groupdict())  # 返回的是关键字分组.{'year': '2020', 'month': '10', 'day': '10'}

正则表达式编译成 Pattern 对象，可以利用 pattern 的一系列方法对文本进行匹配查找了。
在这里插入图片描述

import re

# **************************** split ***************************
text = '1+2*4+8-9/10'
# 字符串方法: '172.25.254.250'.split('.')   => ['172', '25', '254', '250']
pattern = re.compile(r'\+|-|\*|/')
# 将字符串根据+或者-或者*或者/进行切割.
result = re.split(pattern, text)
print("split:", result)


# *********************** sub **************************************
def repl_string(matchObj):
    # matchObj方法: group, groups, groupdict
    items = matchObj.groups()
    # print("匹配到的分组内容: ", items)   # ('2019', '10', '10')
    return "-".join(items)


# 2019/10/10 ====> 2019-10-10
text = "2019/10/10 2020/12/12 2019-12-10  2020-11-10"
pattern = re.compile(r'(\d{4})/(\d{1,2})/(\d{1,2})')  # 注意: 正则规则里面不要随意空格
# 将所有符合条件的信息替换成'2019-10-10'
# result = re.sub(pattern, '2019-10-10', text)
# 将所有符合条件的信息替换成'year-month-day'
result = re.sub(pattern, repl_string, text)
print("sub:", result)

执行结果：
在这里插入图片描述
常量即表示不可更改的变量，一般用于做标记。下图是常用的四个模块常量：

代码示例：

"""
常用的正则常量:
    "ASCII": 'A'
    "IGNORECASE": 'I'
    "MULTILINE":'M'
    "DOTALL":'S'
"""

import re

# ********************************   1. re.ASCII *****************************
text = "正则表达式re模块是python中的内置modelue."
# 匹配所有的\w+(字母数字下划线, 默认也匹配中文), 不想匹配中文时，指定flags=re.A
result1 = re.findall(r'\w+', string=text, flags=re.A)
print("result1:", result1)

# ********************************   2. re.IGNORECASE *****************************
text = 'hello world heLLo westos Hello python'
# 匹配所有he\w+o， 忽略大小写， re.I
result2 = re.findall(r'he\w+o', text, re.I)
print("result2:", result2)  # ['hello', 'heLLo', 'Hello']

# # ********************************   3. re.S *****************************
text = 'hello \n world'
result3 = re.findall(r'^he.*?ld$', text, re.S)
print("result3:", result3)


# # ************************匹配中文**********************
pattern = r'[\u4e00-\u9fa5]'
text = "正则表达式re模块是python中的内置modelue."
result4 = re.findall(pattern, text)
print("result4:", result4)

执行结果：
在这里插入图片描述
2. XPath数据解析库
什么是 XPath？
lxml 是 python 的一个解析库，支持 HTML 和 XML 的解析，也支持 XPath 解析方式，而且解析效率非常高。==> 所以为使用 XPath 需要安装 lxml 模块
XPath (XML Path Language) 是一门在 xml 文档中查找信息的语言，可用来在 xml /html文档中对元素和属性进行遍历。
XPath 帮助文档：https://www.w3school.com.cn/xpath/xpath_syntax.asp

step1：安装 lxml

pip install lxml

step2：解决安装 lxml 后因版本问题仍无法导入etree的问题

# ********** 不用 ********
from lxml import etree

# ********** 而是间接获取 ********
from lxml import html
etree = html.etree

step3：lxml 解析 html 文件

from lxml import html

# 1. etree 库把 HTML 文档中的字符串解析为 Element 对象
text = """
    <h1>title 1</h1>
    <h2>title 2
"""
etree = html.etree
html = etree.HTML(text)
result = etree.tostring(html)
print(result)

# 2. etree 库把 HTML 文档解析为 Element 对象
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)
print(result)

执行结果：
在这里插入图片描述

step-forward

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫——正则表达式 & XPath & BeautifulSoup

爬虫具有四个主要步骤：明确目标 (要知道你准备在哪个范围或者网站去搜索)爬 (将所有的网站的内容全部爬下来)取 (去掉对我们没用处的数据)处理数据 (按照我们想要的方式存储和使用)在前面我们通过047_爬虫_网络数据采集_requests库以及案例048_爬虫案例_360搜索信息爬取了解了明确目标和爬的过程，但是对于爬下来的数据显然存在一部分需要的一部分不需要的问题。所以，进一步的数据解...
复制链接

扫一扫