文本预处理

最新推荐文章于 2022-11-08 09:50:32 发布

马克波罗的鸡腿

最新推荐文章于 2022-11-08 09:50:32 发布

阅读量426

点赞数 3

分类专栏：机器学习文章标签：文本预处理

本文链接：https://blog.csdn.net/weixin_43845795/article/details/96329692

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

预处理英文文本
1、把英文缩写转化：如it’s = it is。使用 text.replace(“it’s”,“it is”)
2、转化为小写字母：text.lower()
3、删除标点符号、数字和特殊字符。减少数据的维度特征。

import re
text = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)
#删除多余的空格
' '.join(text.split()) 
结果：
'disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both'

4、分词；
如果句子中单词和标点符号用空格隔开了，可用split()方法，如：text.split();（空格隔开成独立的元素）
如果句子中单词和标点符号没有用空格隔开，可调用nltk库。如：word_tokenize(text)
5、词干提取：
可用：snowballStemmer
from nltk.stem import SnowballStemmer stemmer = SnowballStemmer("english") stemmer.stem("countries") '# 输出countri
6、删除停用词：

from nltk.corpus import stopwords
stop_words = stopwords.words("english")
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness"
words = [w for w in text.split() if w not in stop_words]
' '.join(words)

输出结果：
‘part charm satin rouge avoids obvious humour lightness’

7、中文文本预处理：
1、去除标点符号等字符。
类似英文的方法。
text = re.sub("[不符合的标点符号]"，“”，text)

可使用 jieba.cut 和 jieba.cut_for_search 方法进行分词，两者所返回的结构都是一个可迭代的 generator，可使用 for 循环来获得分词后得到的每一个词语（unicode），或者直接使用 jieba.lcut 以及 jieba.lcut_for_search 直接返回 list。

import jieba
sentence = 
list(jieba.cut(sentence , cut_all = False)) 
当cut_all = False时，为精准模式；
当cut_all = True时，为全模式；
搜索引擎模式：
jieba.cut_for_search(sentence)

2、删除停用词：
stop_words = []
切分
当词语在切分出来的列表，又不在停用词列表时，则删除。