文本预处理函数定义(dataframe格式下使用apply)

最新推荐文章于 2024-02-27 10:52:13 发布

只想做打工人

最新推荐文章于 2024-02-27 10:52:13 发布

阅读量1.3k

点赞数 1

分类专栏：数据分析学习文章标签： python 编程语言

本文链接：https://blog.csdn.net/weixin_43848469/article/details/113110172

版权

学习同时被 2 个专栏收录

63 篇文章 3 订阅

订阅专栏

数据分析

11 篇文章 0 订阅

订阅专栏

本文是在2020美赛建模C题的文件下使用，如果有想要数据集的朋友可以自行下载或留言

##导入库
import nltk
from nltk.corpus import stopwords

对于预处理函数使用了3中预处理方法，比较粗糙，入门可以使用这个函数进行处理
1.delete stop words(example:)
2.lower all the alpha
3.stem the word(还原词根)
4.split() function to get all the words

###函数定义
def preprocess(text):           #训练可以加入别的东西，自己选择
    tokens = []
    words = text.split(' ')#也就是这个东西如果不是在stop_words中的话，进行一部分操作
    for token in words:  # for token in words,也就是如果不存在一个东西可以使用，
        token = token.lower()
        if token not in stop_words:
                tokens.append(stemmer.stem(token))
    return " ".join(tokens)

上述为函数定义。

而使用apply函数可以对每一行进行预处理

stop_words = stopwords.words("english")
wnl = nltk.WordNetLemmatizer()
stemmer = nltk.SnowballStemmer("english")  #也就是
tokens = []
print(stop_words)
#预处理函数  进行停顿词删除与词根提取
def preprocess(text):           #训练可以加入别的东西，自己选择
    tokens = []
    words = text.split(' ')#也就是这个东西如果不是在stop_words中的话，进行一部分操作
    for token in words:  # for token in words,也就是如果不存在一个东西可以使用，
        token = token.lower()
        if token not in stop_words:
                tokens.append(stemmer.stem(token))
    return " ".join(tokens)
    #得到了一堆列表的列表
print(hair['review_body'][11467])  #测试使用
hair['review_body'] = hair['review_body'].apply(lambda x:preprocess(x)) #apply函数
print(hair['review_body'][11467])  #测试使用

而使用之后的输出大致为此：
在这里插入图片描述
可以看到this,is,it等函数已被删除，已经完成基础的预处理功能

只想做打工人

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
文本预处理函数定义(dataframe格式下使用apply)

本文是在2020美赛建模C题的文件下使用，如果有想要数据集的朋友可以自行下载或留言##导入库import nltkfrom nltk.corpus import stopwords对于预处理函数使用了3中预处理方法，比较粗糙，入门可以使用这个函数进行处理1.delete stop words(example:)2.lower all the alpha3.stem the word(还原词根)4.split() function to get all the words###函数定义d
复制链接

扫一扫

专栏目录