文本分类——预处理

最新推荐文章于 2024-02-24 02:28:47 发布

VIP文章初学者wwl

最新推荐文章于 2024-02-24 02:28:47 发布

阅读量1.3k

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/qq_43708315/article/details/104038924

版权

前言：

文本分类，NLP领域比较经典的使用场景；文本分类一般分为：特征工程+分类器+结果评价与反馈。
特征工程分为：文本预处理+特征提取+文本表示。

本文主要是文本预处理，分词——文本标准化——便于对文本的后序操作，再进行词频统计。

一、代码：

import nltk
from nltk.corpus import stopwords  #从nltk语料库中调用停用词语库
from nltk.tokenize import sent_tokenize#从nltk.tokenize中调用sen_tokenize函数实例来将短文分成句子
def read_file(filename):
    """
    #读取文档内容
    #:param filename:文档名称
    #:return: 文本数据字符串
    """
    with open(filename,'r',encoding='UTF-8') as obj_file:#不加encoding='UTF-8'有时会出现UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 11: illeg
        contents = obj_file.read()#从文本中读取数据存到变量contents中，字符串
    return contents

def del_stop_words(pre_words):
    """
    #文本标准化——删除停用词——去停用词
    #:param filename: 待处理单词所存储的列表
    #:return: 已删除停用词后的列表
    """
    stop_words = set(stopwords.words('english'))#set()返回的是集合————元素具有互异性到变量stop_words
    words = pre_words
    wordsed = [word for word in words if word not in stop_words]#停用词去掉后的单词存到列表中，再赋值给变量wordsed
    return wordsed

def word_fenci_to_words(pre_words):
    """
    #文本标准化——将文本内容分成单词——分词
    #:param filename:待处理字符串
    #:return: 返回存放单词的列表
    """

最低0.47元/天解锁文章

初学者wwl

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
文本分类——预处理

前言：文本分类，NLP领域比较经典的使用场景；文本分类一般分为：特征工程+分类器+结果评价与反馈。特征工程分为：文本预处理+特征提取+文本表示。本文主要是文本预处理；先文本标准化——便于对文本的后序操作，再...
复制链接

扫一扫