文本主题模型之LDA在搜狐新闻数据集上的实践

最新推荐文章于 2024-09-20 20:13:44 发布

qq_17854471

最新推荐文章于 2024-09-20 20:13:44 发布

阅读量2.5k

点赞数 2

分类专栏：深度学习机器学习文章标签： LDA模型文本分类数据挖掘

本文链接：https://blog.csdn.net/qq_17854471/article/details/90139469

版权

LDA用于文本的主题提取，关于它的理论知识看了很多，现在想在python环境下做一个实践。实践的数据集，英文的主要是希拉里的邮件数据集：
准备工作需要：
1、搭建python 环境
2、pip install gensim
3、安装nltk语言包
4、下载希拉里邮件数据集文件：HillaryEmails.csv
有币的同学可以在csdn里面找到。

#coding=utf8
import numpy as np
import pandas as pd
import re
from gensim import corpora, models, similarities
import gensim
from nltk.corpus import stopwords

df = pd.read_csv("./input/HillaryEmails.csv")
df = df[[‘Id‘, ‘ExtractedBodyText‘]].dropna()

def clean_email_text(text):
    text = text.replace(‘\n‘," ") #新行，我们是不需要的
    text = re.sub(r"-", " ", text) #把 "-" 的两个单词，分开。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text) #日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text) #邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) #网址，没意义
    pure_text = ‘‘
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==‘ ‘:
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ‘ ‘.join(word for word in pure_text.split() if len(word)>1)
    return text

docs = df[‘ExtractedBodyText‘]
docs = docs.apply(lambda s: clean_email_text(s))
doclist = docs.values
stopwords = set(stopwords.words(‘english‘))

texts = [[word for word in doc.lower().split() if word not in stopwords] for doc in doclist]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)


print lda.print_topics(num_topics=20, num_words=5)

但是对于国内的应用来说，还