sklearn实现lda模型_LDA模型实战常用知识点

本文介绍了如何使用Python的sklearn库实现LDA模型,并通过GridSearchCV进行参数调优。内容包括数据预处理、文档词频矩阵构建、LDA模型构建、模型评估与优化,以及如何预测新文本的话题。文章提供了详细代码示例,展示了最佳参数下的话题模型效果。
摘要由CSDN通过智能技术生成

2019 Stata & Python 实证计量与爬虫分析暑期工作坊还有几天就要开始了。之前在公众号里分享过好几次LDA话题模型的,但考虑的问题都比较简单。这次我将分享在这个notebook中,将会对以下问题进行实战:

提取话题的关键词

gridsearch寻找最佳模型参数

可视化话题模型

预测新输入的文本的话题

如何查看话题的特征词组

如何获得每个话题的最重要的n个特征词

1.导入数据

这里我们使用的20newsgroups数据集

import

pandas

as

pd

df

=

pd

.

read_json

(

'newsgroups.json'

)

df

.

head

()

查看target_names有哪些类别

df

.

target_names

.

unique

()

Run

array

([

'rec.autos'

,

'comp.sys.mac.hardware'

,

'rec.motorcycles'

,

'misc.forsale'

,

'comp.os.ms-windows.misc'

,

'alt.atheism'

,

'comp.graphics'

,

'rec.sport.baseball'

,

'rec.sport.hockey'

,

'sci.electronics'

,

'sci.space'

,

'talk.politics.misc'

,

'sci.med'

,

'talk.politics.mideast'

,

'soc.religion.christian'

,

'comp.windows.x'

,

'comp.sys.ibm.pc.hardware'

,

'talk.politics.guns'

,

'talk.religion.misc'

,

'sci.crypt'

],

dtype

=

object

)

2.英文清洗数据

使用正则表达式去除邮件和换行等多余空白字符

使用gensim库的simple_preprocess分词,得到词语列表

注意:

nltk和spacy安装配置比较麻烦,可以看这篇文章。

自然语言处理库nltk、spacy安装及配置方法其中nltk语料库和spacy的英文模型均已放置在教程文件夹内~

import

nltk

import

gensim

from

nltk

import

pos_tag

import

re

from

nltk

.

corpus

import

stopwords

#导入spacy的模型

nlp

=

spacy

.

load

(

'en_core_web_sm'

,

disable

=[

'parser'

,

'ner'

])

def

clean_text

(

text

,

allowed_postags

=[

'NOUN'

,

'ADJ'

,

'VERB'

,

'ADV'

]):

text

=

re

.

sub

(

'\S*@\S*\s?'

,

''

,

text

)

#去除邮件

text

=

re

.

sub

(

'\s+'

,

' '

,

text

)

#将连续空格、换行、制表符 替换为 空格

#deacc=True可以将某些非英文字母转化为英文字母,例如

#"Šéf chomutovských komunistů dostal poštou bílý prášek"转化为

#u'Sef chomutovskych komunistu dostal postou bily prasek'

words

=

gensim

.

utils

.

simple_preprocess

(

text

,

deacc

=

True

)

#可以在此处加入去停词操作

stpwords

=

stopwords

.

words

(

'english'

)

#保留词性为'NOUN', 'ADJ', 'VERB', 'ADV'词语

doc

=

nlp

(

' '

.

join

(

words

))

text

=

" "

.

join

([

token

.

lemma_

if

token

.

lemma_

not

in

[

'-PRON-'

]

else

''

for

token

in

doc

if

token

.

pos_

in

allowed_postags

])

return

text

test

=

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

clean_text

(

test

)

Run

'where thing subject car be nntp post host rac wam umd edu organization university maryland college park line be wonder anyone out there could enlighten car see other day be door sport car look be late early be call bricklin door be really small addition front bumper be separate rest body be know anyone can tellme model name engine spec year production where car be make history info have funky look car mail thank bring neighborhood lerxst'

将将数据content列进行批处理(数据清洗clean_text)

df

.

content

=

df

.

content

.

apply

(

clean_text

)

df

.

head

()

3. 构建文档词频矩阵 document-word matrix

from

sklearn

.

feature_extraction

.

text

import

TfidfVectorizer

,

CountVectorizer

#vectorizer = TfidfVectorizer(min_df=10)#单词至少出现在10个文档中

vectorizer

=

CountVectorizer

(

analyzer

=

'word'

,

min_df

=

10

,

# minimum reqd occurences of a word

lowercase

=

True

,

# convert all words to lowercase

token_pattern

=

'[a-zA-Z0-9]{3,}'

,

# num chars > 3

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值