1.常用正则表达式
Python中常用正则表达式
2.正则表达式做数据清洗
2.1 从网页HTML标签中提取文本
import re
text = "<div><p>\n你好\nPython:</p><p>Python是一种跨平台的计算机程序设计语言。 </p><p><br></p><p>是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本语言。</p><p>最初被设计用于编写自动化脚本(shell),随着版本的不断更新和语言新功能的添加,越多被用于独立的、大型项目的开发。 </p><br><a>快来学习Python吧!</a></div>"
result = re.sub(r"<.*?>| |\n", "", text)
print(result)
输出:
2.2 去掉英文文章中标点符号,提取词汇
import re
text = "This isn't to be alarmist. (Optimists point out that technological upheaval has benefited workers in the past.) The Industrial Revolution didn't go so well for Luddites whose jobs were displaced by mechanized looms, but it eventually raised living standards and created more jobs than it destroyed. Likewise, automation should eventually boost productivity, stimulate demand by driving down prices, and free workers from hard, boring work. But in the medium term, middle-class workers may need a lot of help adjusting."
result = re.sub(r"[^A-Za-z]", " ", text)
print(result)
输出:
2.3 提取以.com结尾的邮箱
import re
text = "adjhuw_@163.com, qux3349@qq.com, nihaomatt@126.xxx, asdfghj@sina.com, abc@139.org"
result = re.findall(r"[a-zA-Z0-9_]+@[a-zA-Z0-9_]+\.com", text)
print(result)
输出: