Python中利用正则表达式做数据清洗（re）

最新推荐文章于 2024-05-27 19:51:01 发布

若如初见kk

最新推荐文章于 2024-05-27 19:51:01 发布

阅读量5.2k

点赞数 3

分类专栏： NLP python 文章标签： python 正则表达式数据清洗

本文链接：https://blog.csdn.net/Artificial_idiots/article/details/108719898

版权

python 同时被 2 个专栏收录

31 篇文章 34 订阅

订阅专栏

NLP

3 篇文章 0 订阅

订阅专栏

1.常用正则表达式

Python中常用正则表达式

2.正则表达式做数据清洗

2.1 从网页HTML标签中提取文本

import re
text = "<div><p>\n你好\nPython：</p><p>Python是一种跨平台的计算机程序设计语言。 </p><p><br></p><p>是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本语言。</p><p>最初被设计用于编写自动化脚本(shell)，随着版本的不断更新和语言新功能的添加，越多被用于独立的、大型项目的开发。&nbsp;</p><br><a>快来学习Python吧！</a></div>"
result = re.sub(r"<.*?>|&nbsp;|\n", "", text)
print(result)

输出：

在这里插入图片描述

2.2 去掉英文文章中标点符号，提取词汇

import re
text = "This isn't to be alarmist. (Optimists point out that technological upheaval has benefited workers in the past.) The Industrial Revolution didn't go so well for Luddites whose jobs were displaced by mechanized looms, but it eventually raised living standards and created more jobs than it destroyed. Likewise, automation should eventually boost productivity, stimulate demand by driving down prices, and free workers from hard, boring work. But in the medium term, middle-class workers may need a lot of help adjusting."
result = re.sub(r"[^A-Za-z]", " ", text)
print(result)

输出：

在这里插入图片描述

2.3 提取以.com结尾的邮箱

import re
text = "adjhuw_@163.com, qux3349@qq.com, nihaomatt@126.xxx, asdfghj@sina.com, abc@139.org"
result = re.findall(r"[a-zA-Z0-9_]+@[a-zA-Z0-9_]+\.com", text)
print(result)

输出：

在这里插入图片描述

若如初见kk

关注

3
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
Python中利用正则表达式做数据清洗（re）

目录1.常用正则表达式Python中常用正则表达式2.正则表达式做数据清洗2.1 从网页HTML标签中提取文本2.2 去掉英文文章中标点符号，提取词汇2.3 提取以.com结尾的邮箱1.常用正则表达式Python中常用正则表达式2.正则表达式做数据清洗2.1 从网页HTML标签中提取文本import retext = "<div><p>\n你好\nPython：</p><p>Python是一种跨平台的计算机程序设计语言。 </p>&lt
复制链接

扫一扫

专栏目录