【Python网络爬虫】python爬虫用正则表达式进行数据清洗与处理

最新推荐文章于 2024-10-04 21:43:10 发布

左手の明天

最新推荐文章于 2024-10-04 21:43:10 发布

阅读量1.5k

点赞数 19

分类专栏： python网络爬虫文章标签： python 开发语言爬虫正则表达式

本文链接：https://blog.csdn.net/ywsydwsbn/article/details/138198916

版权

python网络爬虫专栏收录该内容

6 篇文章 2 订阅

订阅专栏

本文介绍了如何在Python网络爬虫中使用正则表达式进行数据清洗，包括查找匹配项、替换文本、提取多个匹配项以及从HTML中提取链接。正则表达式是高效处理文本数据的重要工具。

摘要由CSDN通过智能技术生成

🔗 运行环境：PYTHON

🚩 撰写作者：左手の明天

🥇 精选专栏：《python》

🔥 推荐专栏：《算法研究》

#### 防伪水印——左手の明天 ####

💗 大家好🤗🤗🤗，我是左手の明天！好久不见💗

💗今天更新系列【python网络爬虫】——python网络爬虫入门💗

📆 最近更新：2024 年 04月 29 日，左手の明天的第 328 篇原创博客

📚 更新于专栏：python网络爬虫

#### 防伪水印——左手の明天 ####

在Python网络爬虫的数据清洗与处理过程中，正则表达式是一个非常强大的工具，它可以帮助我们从复杂的文本数据中提取出所需的信息。在Python中，re模块提供了对正则表达式的支持。

下面是如何在Python网络爬虫数据清洗与处理中使用正则表达式的示例：

导入re模块

import re

查找匹配项

使用re.search()或re.match()函数查找文本中的匹配项。

text = "The price is 100 dollars"
match = re.search(r'\d+', text)  # 查找数字
if match:
    print(match.group())  # 输出匹配到的数字

替换文本

使用re.sub()函数替换文本中的匹配项。

text = "Hello, World!"
cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)  # 去除非字母和非空格字符
print(cleaned_text)  # 输出：Hello World

提取多个匹配项

使用re.findall()函数提取文本中所有匹配项。

text = "Prices are 100, 200, 300 dollars"
prices = re.findall(r'\d+', text)  # 提取所有数字
print(prices)  # 输出：['100', '200', '300']

分割文本

使用re.split()函数根据正则表达式分割文本。

text = "apple, banana, cherry"
fruits = re.split(r', ', text)  # 根据逗号和空格分割文本
print(fruits)  # 输出：['apple', 'banana', 'cherry']

示例：从HTML中提取文本

假设我们有一个HTML字符串，并希望提取其中的所有链接：

import re
 
html = '''
<html>
<head></head>
<body>
    <a href="http://example.com/link1">Link 1</a>
    <a href="http://example.com/link2">Link 2</a>
    <a href="http://example.com/link3">Link 3</a>
</body>
</html>
'''
 
# 使用正则表达式提取href属性中的链接
links = re.findall(r'<a href="([^"]+)">', html)
 
for link in links:
    print(link)

这个例子中，正则表达式<a href="([^"]+)">会匹配<a href="...">...</a>格式的HTML链接，其中([^"]+)是一个捕获组，用于提取href属性值。