文本清洗

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/weixin_40639095/article/details/90260265

去掉html标签

from bs4 import BeautifulSoup

html_text = """
    <div id="app">
		<h3>第一个h3标签</h3>
		<h3>h3标签</h3>
		<input type="text" name="" v-color="'red'">
	</div>
"""

clean_html_target = BeautifulSoup(html_text, 'html.parser').get_text()

print(clean_html_target)

### 输出结果
第一个h3标签
h3标签

去掉标点符号

import re

text = "use, python. re/ clean! punctuation"

clean_punctuation = re.sub(r"[^a-zA-Z]", " ", text)

print(clean_punctuation)

### 输出结果
use  python  re  clean  punctuation

去掉停用词

stop_words = dict.fromkeys([line.rstrip() for line in open("./stopwords.txt")])

text = "python demo clean a b 1 2 3"
text_list = text.split(" ")

clean_stop_words = [w for w in text_list if w not in stop_words]
print(clean_stop_words)

### 输出结果
['python', 'demo', 'clean']

获取最后的文件名字

os.path.basename(C:/demo/a.txt)
# 这里就会返回a.txt
# 如果路径是是 / 结尾的 就会返回一个空值
展开阅读全文

没有更多推荐了,返回首页