结构主题模型（一）stm包工作流

思想在拧紧

已于 2023-02-18 10:30:28 修改

阅读量1.6w

点赞数 29

分类专栏：文本分析 R 文章标签：文本分析结构主题模型 R

于 2022-02-11 22:37:24 首次发布

本文链接：https://blog.csdn.net/what_how_why2020/article/details/122889758

版权

文本分析同时被 2 个专栏收录

5 篇文章

订阅专栏

2 篇文章

订阅专栏

前言

对论文（stm: An R Package for Structural Topic Models）中 stm 模型的工作流进行梳理，总体结构参考论文原文，但对部分代码执行的顺序提出个人想法。因时间有限，存在未能解决的问题（如选择合适的主题数、论文后半部分梳理过于简略等），后续有时间将会补充。
若有朋友能提出有效的修改建议和解决方案，博主将在第一时间做出反馈。最后，希望对使用STM结构主题模型的朋友们有帮助😁

论文复现过程中的相关问题汇总：结构主题模型（二）复现

论文原文、数据及代码：stm: An R Package for Structural Topic Model

stm库官方文档

3.0 读取数据

样例数据 poliblogs2008.csv 为一个关于美国政治的博文集，来自 CMU2008 年政治博客语料库：American Thinker, Digby, Hot Air, Michelle Malkin, Think Progress, and Talking Points Memo。每个博客论坛都有自己的政治倾向，所以每篇博客都有写作日期和政治意识形态的元数据。

建议读取 xlsx 文件，因为 csv 文件以逗号作为分隔符，读取时可能会出现问题。

import pandas as pd
csv_file = pd.read_csv('poliblogs2008.csv', low_memory=False, encoding='utf-8')
csv_file.to_excel('poliblogs2008.xlsx', index=False, encoding='gbk')

# data <- read.csv("./poliblogs2008.csv", sep =",", quote = "", header = TRUE, fileEncoding = "UTF-8")
data <- read_excel(path = "./poliblogs2008.xlsx", sheet = "Sheet1", col_names = TRUE)

若数据为中文，可参考以下文章对中文文本进行分词等预处理操作后，再进行后续步骤

以3.0为开始序号是为了和论文原文保持一致

3.1 Ingest: Reading and processing text data

提取数据：将原始数据处理成STM可以分析的三块内容（分别是documents，vocab ，meta），用到的是textProcessor或readCorpus这两个函数。

textProcessor()函数旨在提供一种方便快捷的方式来处理相对较小的文本，以便使用软件包进行分析。它旨在以简单的形式快速摄取数据，例如电子表格，其中每个文档都位于单个单元格中。

# 调用textProcessor算法，将 data$document、data 作为参数
processed <- textProcessor(documents = data$documents, metadata = data, wordLengths = c(1, Inf))

textProcessor()函数中的参数wordLengths = c(3, Inf)表示：短于最小字长（默认为3字符）或长于最大字长（默认为inf）的字数将被丢弃，[用户@qq_39172034]建议设置该参数为wordLengths = c(1, Inf)，以避免避免单个汉字被删除

论文中提到，textProcessor()可以处理多种语言，需设置变量language = "en", customstopwords = NULL,。截至0.5支持的版本“丹麦语、荷兰语、英语、芬兰语、法语、德语、匈牙利语、意大利语、挪威语、葡萄牙语、罗马尼亚语、俄语、瑞典语、土耳其语”，不支持中文
详见：textProcessor function - RDocumentation

3.2 Prepare: Associating text with metadata

数据预处理：转换数据格式，根据阈值删除低频单词等，用到的是prepDocuments()和plotRemoved()两个函数

plotRemoved()函数可绘制不同阈值下删除的document、words、token数量

pdf("output/stm-plot-removed.pdf")
plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100))
dev.off()

根据此pdf文件的结果（output/stm-plot-removed.pdf），确定prepDocuments()中的参数lower.thresh的取值，以此确定变量docs、vocab、meta

论文中提到如果在处理过程中发生任何更改，PrepDocuments还将重新索引所有元数据/文档关系。例如，当文档因为含有低频单词而在预处理阶段被完全删除，那么PrepDocuments()也将删除元数据中的相应行。因此在读入和处理文本数据后，检查文档的特征和相关词汇表以确保它们已被正确预处理是很重要的。

# 去除词频低于15的词汇
out <- prepDocuments(documents = processed$documents, vocab = processed$vocab, meta = processed$meta, lower.thresh = 15)

docs <- out$documents
vocab <- out$vocab
meta <- out$meta

docs：documents。包含单词索引及其相关计数的文档列表
vocab：a vocab character vector。包含与单词索引关联的单词
meta：a metadata matrix。包含文档协变量

以下表示两篇短文章documents：第一篇文章包含5个单词，每个单词出现在vocab vector的第21、23、87、98、112位置上，除了第一个单词出现两次，其余单词都仅出现一次。第二篇文章包含3个单词，解释同上。

[[1]]
	[,1]	[,2]	[,3]	[,4]	[,5]
[1,]	21	23	87	98	112
[2,]	2	1	1	1	1

[[2]]	[,1]	[,2]	[,3]
[1,]	16	61	90
[2,]	1	1	1

3.3 Estimate: Estimating the structural topic model

STM的关键创新是它将元数据合并到主题建模框架中。在STM中，元数据可以通过两种方式输入到主题模型中：**主题流行度（topical prevalence）**和主题内容（topical content）。主题流行度中的元数据协变量允许观察到的元数据影响被讨论主题的频率。主题内容中的协变量允许观察到的元数据影响给定主题内的词率使用——即如何讨论特定主题。对主题流行率和主题内容的估计是通过stm()函数进行的。

主题流行度（topical prevalence）表示每个主题对某篇文档的贡献程度，因为不同的文档来自不同的地方，所以自然地希望主题流行度能随着元数据的变化而变化。

具体而言，论文将变量rating（意识形态，Liberal，Conservative）作为主题流行度的协变量，除了意识形态，还可以通过+号增加其他协变量，如增加原始数据中的day”变量（表示发帖日期）

s(day)中的s()为spline function，a fairly flexible b-spline basis

day这个变量是从2008年的第一天到最后一天，就像panel data一样，如果带入时序设置为天（365个penal），则会损失300多个自由度，所以引入spline function解决自由度损失的问题。

The stm package also includes a convenience functions(), which selects a fairly flexible b-spline basis. In the current example we allow for the variabledayto be estimated with a spline.

poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, K = 20, prevalence = ~rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral")

R中主题流行度协变量prevalence能表示为含有多个斜变量和阶乘或连续协变量的公式，在spline包中还有其他的标准转换函数：log()、ns()、bs()

随着迭代的进行，如果bound变化足够小，则认为模型收敛converge了。

3.4 Evaluate: Model selection and search

Model initialization for a fixed number of topics 为指定数量的主题数创建初始化模型

因为混合主题模型的后验往往非凸和难以解决，模型的确定取决于参数的起始值（例如，特定主题的单词分布）。两种实现模型初始化的方式：

spectral initialization。init.type="Spectral"。优先选取此方式
a collapsed Gibbs sampler for LDA

poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, K = 20, prevalence = ~rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral")

Model selection for a fixed number of topics 为指定数量的主题数选择模型

poliblogSelect <- selectModel(out$documents, out$vocab, K = 20, prevalence = ~rating + s(day), max.em.its = 75, data = out$meta, runs = 20, seed = 8458159)

selectModel()首先建立一个运行模型的网络（net），并依次将所有模型运行（小于10次）E step和M step，抛弃低likelihood的模型，接着仅运行高likelihood的前20%的模型，直到收敛（convergence）或达到最大迭代次数（max.em.its）

通过plotModels()函数显示的语义一致性（semantic coherence）和排他性（exclusivity）选择合适的模型，semcoh和exclu越大则模型越好

# 绘制图形平均得分每种模型采用不同的图例
plotModels(poliblogSelect, pch=c(1,2,3,4), legend.position="bottomright")
# 选择模型3
selectedmodel <- poliblogSelect$runout[[3]]

Model search across numbers of topics 确定合适的主题数

storage <- searchK(out$documents, out$vocab, K = c(7, 10), prevalence = ~rating + s(day), data = meta)

# 借助图表可视化的方式直观选择主题数
pdf("stm-plot-ntopics.pdf")
plot(storage)
dev.off()

# 借助实际数据选择主题数
t <- storage$out[[1]]
t <- storage$out[[2]]

对比两种或多个主题数，通过对比语义连贯性SemCoh和排他性Exl确定合适的主题数

3.5 Understand: Interpreting the STM by plotting and inspecting results

选择好模型后，就是通过stm包中提供的一些函数来展示模型的结果。为与原论文保持一致，使用初始模型poliblogPrevFit作为参数，而非SelectModel

每个主题下的高频单词排序：labelTopics()、sageLabels()

两个函数都将与每个主题相关的单词输出，其中sageLabels()仅对于包含内容协变量的模型使用。此外，sageLabels()函数结果比labelTopics()更详细，而且默认输出所有主题下的高频词等信息

# labelTopics() Label topics by listing top words for selected topics 1 to 5.
labelTopicsSel <- labelTopics(poliblogPrevFit, c(1:5))
sink("output/labelTopics-selected.txt", append=FALSE, split=TRUE)
print(labelTopicsSel)
sink()

# sageLabels() 比 labelTopics() 输出更详细
sink("stm-list-sagelabel.txt", append=FALSE, split=TRUE)
print(sageLabels(poliblogPrevFit))
sink()

TODO：两个函数输出结果存在差异

列出与某个主题高度相关的文档：findthoughts()

shortdoc <- substr(out$meta$documents, 1, 200)
# 参数 'texts=shortdoc' 表示输出每篇文档前200个字符，n表示输出相关文档的篇数
thoughts1 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=1)$docs[[1]]
pdf("findThoughts-T1.pdf")
plotQuote(thoughts1, width=40, main="Topic 1")
dev.off()

# how about more documents for more of these topics?
thoughts6 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=6)$docs[[1]]
thoughts18 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=18)$docs[[1]]
pdf("stm-plot-find-thoughts.pdf")
# mfrow=c(2, 1)将会把图输出到2行1列的表格中
par(mfrow = c(2, 1), mar = c(.5, .5, 1, .5))
plotQuote(thoughts6, width=40, main="Topic 6")
plotQuote(thoughts18, width=40, main="Topic 18")
dev.off()

估算元数据和主题/主题内容之间的关系：estimateEffect

out$meta$rating<-as.factor(out$meta$rating)
# since we're preparing these coVariates by estimating their effects we call these estimated effects 'prep'
# we're estimating Effects across all 20 topics, 1:20. We're using 'rating' and normalized 'day,' using the topic model poliblogPrevFit. 
# The meta data file we call meta. We are telling it to generate the model while accounting for all possible uncertainty. Note: when estimating effects of one covariate, others are held at their mean
prep <- estimateEffect(1:20 ~ rating+s(day), poliblogPrevFit, meta=out$meta, uncertainty = "Global")
summary(prep, topics=1)
summary(prep, topics=2)
summary(prep, topics=3)
summary(prep, topics=4)

uncertainty有"Global", “Local”, "None"三个选择，The default is “Global”, which will incorporate estimation uncertainty of the topic proportions into the uncertainty estimates using the method of composition. If users do not propagate the full amount of uncertainty, e.g., in order to speed up computational time, they can choose uncertainty = “None”, which will generally result in narrower confidence intervals because it will not include the additional estimation uncertainty.

summary(prep, topics=1)输出结果：

Call:
estimateEffect(formula = 1:20 ~ rating + s(day), stmobj = poliblogPrevFit, 
    metadata = meta, uncertainty = "Global")


Topic 1:

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.068408   0.011233   6.090 1.16e-09 ***
ratingLiberal -0.002513   0.002588  -0.971  0.33170    
s(day)1       -0.008596   0.021754  -0.395  0.69276    
s(day)2       -0.035476   0.012314  -2.881  0.00397 ** 
s(day)3       -0.002806   0.015696  -0.179  0.85813    
s(day)4       -0.030237   0.013056  -2.316  0.02058 *  
s(day)5       -0.026256   0.013791  -1.904  0.05695 .  
s(day)6       -0.010658   0.013584  -0.785  0.43269    
s(day)7       -0.005835   0.014381  -0.406  0.68494    
s(day)8        0.041965   0.016056   2.614  0.00897 ** 
s(day)9       -0.101217   0.016977  -5.962 2.56e-09 ***
s(day)10      -0.024237   0.015679  -1.546  0.12216    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

3.6 Visualize: Presenting STM results

Summary visualization

主题占比条形图

# see PROPORTION OF EACH TOPIC in the entire CORPUS. Just insert your STM output
pdf("top-topic.pdf")
plot(poliblogPrevFit, type = "summary", xlim = c(0, .3))
dev.off()

Metadata/topic relationship visualization

主题关系对比图

pdf("stm-plot-topical-prevalence-contrast.pdf")
plot(prep, covariate = "rating", topics = c(6, 13, 18),
     model = poliblogPrevFit, method = "difference",
     cov.value1 = "Liberal", cov.value2 = "Conservative",
     xlab = "More Conservative ... More Liberal",
     main = "Effect of Liberal vs. Conservative",
     xlim = c(-.1, .1), labeltype = "custom",
     custom.labels = c("Obama/McCain", "Sarah Palin", "Bush Presidency"))
dev.off()

主题6、13、18自定义标签为"Obama/McCain"、“Sarah Palin”、“Bush Presidency”，主题6、主题13的意识形态偏中立，既不是保守，也不是自由，主题18的意识形态偏向于保守。

主题随着时间变化的趋势图

pdf("stm-plot-topic-prevalence-with-time.pdf")
plot(prep, "day", method = "continuous", topics = 13, 
     model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month")
monthnames <- months(monthseq)
# There were 50 or more warnings (use warnings() to see the first 50)
axis(1, at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames)
dev.off()

运行报错，但可以输出以下图片，原因不明

topic content

显示某主题中哪些词汇与一个变量值与另一个变量值的关联度更大。

# TOPICAL CONTENT.
# STM can plot the influence of covariates included in as a topical content covariate.
# A topical content variable allows for the vocabulary used to talk about a particular 
# topic to vary. First, the STM must be fit with a variable specified in the content option.
# Let's do something different. Instead of looking at how prevalent a topic is in a class of documents categorized by meta-data covariate... 
# ... let's see how the words of the topic are emphasized differently in documents of each category of the covariate
# First, we we estimate a new stm. It's the same as the old one, including prevalence option, but we add in a content option
poliblogContent <- stm(out$documents, out$vocab, K = 20, 
                       prevalence = ~rating + s(day), content = ~rating, 
                       max.em.its = 75, data = out$meta, init.type = "Spectral")
pdf("stm-plot-content-perspectives.pdf")
plot(poliblogContent, type = "perspectives", topics = 10)
dev.off()

主题10与古巴有关。它最常用的词是“拘留、监禁、法庭、非法、酷刑、强制执行、古巴”。上显示了自由派和保守派对这个主题的不同看法，自由派强调“酷刑”，而保守派则强调“非法”和“法律”等典型的法庭用语

原文：Its top FREX words were “detaine, prison, court, illeg, tortur, enforc, guantanamo”中的tortur应为torture

绘制主题间的词汇差异

pdf("stm-plot-content-perspectives-16-18.pdf")
plot(poliblogPrevFit, type = "perspectives", topics = c(16, 18))
dev.off()

Plotting covariate interactions

# Interactions between covariates can be examined such that one variable may ??moderate??
# the effect of another variable.
###Interacting covariates. Maybe we have a hypothesis that cities with low $$/capita become more repressive sooner, while cities with higher budgets are more patient 
##first, we estimate an STM with the interaction
poliblogInteraction <- stm(out$documents, out$vocab, K = 20,
                           prevalence = ~rating * day, max.em.its = 75,
                           data = out$meta, init.type = "Spectral")
# Prep covariates using the estimateEffect() function, only this time, we include the 
# interaction variable. Plot the variables and save as pdf files.
prep <- estimateEffect(c(16) ~ rating * day, poliblogInteraction,
                       metadata = out$meta, uncertainty = "None")
pdf("stm-plot-two-topic-contrast.pdf")
plot(prep, covariate = "day", model = poliblogInteraction,
     method = "continuous", xlab = "Days", moderator = "rating",
     moderator.value = "Liberal", linecol = "blue", ylim = c(0, 0.12),
     printlegend = FALSE)
plot(prep, covariate = "day", model = poliblogInteraction,
     method = "continuous", xlab = "Days", moderator = "rating",
     moderator.value = "Conservative", linecol = "red", add = TRUE,
     printlegend = FALSE)
legend(0, 0.06, c("Liberal", "Conservative"),
       lwd = 2, col = c("blue", "red"))
dev.off()

上图描绘了时间（博客发帖的日子）和评分（自由派和保守派）之间的关系。主题16患病率以时间的线性函数绘制，评分为0（自由）或1（保守）。

3.7 Extend: Additional tools for interpretation and visualization

绘制词云图

pdf("stm-plot-wordcloud.pdf")
cloud(poliblogPrevFit, topic = 13, scale = c(2, 0.25))
dev.off()

主题相关性

# topicCorr().
# STM permits correlations between topics. Positive correlations between topics indicate
# that both topics are likely to be discussed within a document. A graphical network
# display shows how closely related topics are to one another (i.e., how likely they are
# to appear in the same document). This function requires 'igraph' package.
# see GRAPHICAL NETWORK DISPLAY of how closely related topics are to one another, (i.e., how likely they are to appear in the same document) Requires 'igraph' package
mod.out.corr <- topicCorr(poliblogPrevFit)
pdf("stm-plot-topic-correlations.pdf")
plot(mod.out.corr)
dev.off()

stmCorrViz

stmCorrViz软件包使用分层聚类方法将主题分组，然后导出到 JSON 对象并使用 D3.js 进行可视化主题相关性。

存在乱码问题未解决

# The stmCorrViz() function generates an interactive visualisation of topic hierarchy/correlations in a structural topicl model. The package performs a hierarchical
# clustering of topics that are then exported to a JSON object and visualised using D3.
# corrViz <- stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", documents_raw=data$documents, documents_matrix=out$documents)

stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", 
           documents_raw=data$documents, documents_matrix=out$documents)

4 Changing basic estimation defaults

此部分为解释如何更改stm包的估算命令中的默认设置

首先讨论如何在初始化模型参数的不同方法中进行选择，然后讨论如何设置和评估收敛标准，再描述一种在分析包含数万个或更多文档时加速收敛的方法，最后，讨论内容协变量模型的一些变化，这些变化允许用户控制模型的复杂性。

补充

在Ingest部分，作者提到其他用于文本处理的quanteda包，该包可以方便地导入文本和相关元数据，准备要处理的文本，并将文档转换为文档术语矩阵（document-term matrix）。另一个包，readtext包含非常灵活的工具，用于读取多种文本格式，如纯文本、XML和JSON格式，可以轻松地从中创建语料库。

为从其他文本处理程序中读取数据，可使用txtorg，此程序可以创建三种独立的文件：a metadata file, a vocabulary file, and a file with the original documents。默认导出格式为LDA-C sparse matrix format，可以用readCorpus()设置"ldac"option以读取