分词（jieba）_词向量&词袋（doc2bow_tfidf_）_主题模型（lda_lsi）的使用规范

最新推荐文章于 2023-01-03 17:50:57 发布

依概率收敛

最新推荐文章于 2023-01-03 17:50:57 发布

阅读量1.7k

点赞数

本文链接：https://blog.csdn.net/weixin_41341999/article/details/95888236

版权

本文介绍了如何使用jieba进行中文分词，包括正则清洗、停用词处理和词性筛选。接着，讨论了词袋模型doc2bow和TF-IDF变换，以及LDA和LSI主题模型的应用，展示了如何计算文本相似度和主题分布。

摘要由CSDN通过智能技术生成

分词：

1、读入待处理的中文样本数据，正则匹配，清洗数据

data.content = data.content.str.replace("[^\u4e00-\u9fa50-9]","")
# [\u4e00-\u9fa5]正则匹配所有中文 [0-9]正则匹配所有数字
# [\u4e00-\u9fa50-9] 匹配所有中文和所有数字
# [^\u4e00-\u9fa50-9] 匹配非中文和数字的所有字符， ^代表非

2、停用词+删除各种标点符号

#停用词；'stopwords1.txt', 'stopwords2.txt', 'custom_stopwords.txt是常用停用词txt
import codecs
stopwords_file = ['stopwords1.txt', 'stopwords2.txt', 'custom_stopwords.txt']
stopwords = []
for file in stopwords_file:
with codecs.open('./stopwords/'+file, 'r', encoding='utf-8') as f: #在python3中以utf-8的格式打开若干文本文件，并存储到一个list
stopwords.extend([word.rstrip() for word in f.readlines()])

#标点符号

import string
eng_punc = string.punctuation # string.punctuation是打印了字符串中所有标点 zh_punc定义了个性化标点 punc结合了两者。
zh_punc = '！，。、‘’“”【】|·￥……（）？ '
punc = eng_punc + zh_punc

3、根据停用词、标签符号，以及使用增强过的jieba分词，按照词性保留想要的分词结果

def by_cut(data, pos=None): #pos是词性的意思
    import jieba
#jieba.load_userdict('custom_dict.txt') 若有手动标注的，可以增强jieba分词能力
import jieba.posseg as pseg
    res = []
    for line in data:
        try:
            segs = pseg.cut(line)
            for word, p in segs: #word是对line分割过后的短句；p是对每个短句词性的标注；p[0]就是第一个词的词性
#                 print(seg)
                if p[0] in p