在fomc会议中建模主题趋势

最新推荐文章于 2024-10-07 01:36:35 发布

weixin_26713457

最新推荐文章于 2024-10-07 01:36:35 发布

阅读量596

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/modeling-topic-trends-in-fomc-meetings-a10cf3d8bac5

版权

补习班(HANDS-ON TUTORIAL)

The Federal Open Market Committee (FOMC) is an important part of the US financial system. It meets 8 times per year and the minutes from these meetings are scrutinized the world over. Using topic modeling, an area of natural language processing, you can analyze trends in FOMC minutes over time. In this article, I show you how.

联邦公开市场委员会(FOMC)是美国金融体系的重要组成部分。 每年召开八次会议，这些会议的会议记录在全世界受到严格审查。 使用主题建模(自然语言处理的一个领域)，您可以分析FOMC分钟内的时间趋势。 在本文中，我向您展示了如何。

The Federal Open Market Committee (FOMC) sets monetary policy in the US. It has 12 members who meet 8 times per year to discuss interest rates and other economic matters. Investors pay close attention to the outcomes from these meetings — they can have significant consequences for US and global financial markets.

联邦公开市场委员会(FOMC)在美国制定货币政策。它有12个成员，每年开会8次，讨论利率和其他经济问题。投资者密切关注这些会议的成果-它们可能对美国和全球金融市场产生重大影响。

The minutes of FOMC meetings are released three weeks after each meeting. Through the minutes, investors can get a better understanding of the content of FOMC meetings. This helps with interpreting FOMC decisions and understanding the possible consequences for financial markets.

FOMC会议纪要在每次会议后三周发布。通过会议记录，投资者可以更好地了解FOMC会议的内容。这有助于解释FOMC的决策并了解对金融市场的可能后果。

In this article I show how topic modeling, an area of natural language processing (NLP), can help to analyze the content of FOMC meetings. I use Latent Dirichlet Allocation (LDA), a popular topic modeling approach, to identify the key themes, or topics, discussed in the meeting minutes.

在本文中，我将展示主题建模(自然语言处理(NLP)的一个领域)如何帮助分析FOMC会议的内容。我使用潜在的主题建模方法Latent Dirichlet Allocation(LDA)来确定会议纪要中讨论的关键主题或主题。

主题建模和LDA (Topic modeling and LDA)

Topic modeling is a form of unsupervised learning that can be applied to unstructured text data. It identifies groups of words or phrases that have similar meaning — topics — using statistical techniques.

主题建模是一种无监督学习的形式，可以应用于非结构化文本数据。它使用统计技术识别具有相似含义(主题)的单词或短语组。

LDA works by assuming that each document has a mix of underlying (latent) topics, and that each topic is made up of words from a specified dictionary. By observing the words within a set of documents, LDA infers the topics that fit with those words based on a probabilistic framework.

LDA的工作原理是假设每个文档都包含基础(潜在)主题，并且每个主题均由指定词典中的单词组成。通过观察一组文档中的单词，LDA根据概率框架推断出与这些单词相符的主题。

The mix of topics in a chronological series of text documents, such as FOMC minutes, changes over time. With LDA, you can observe this changing mix. The changing mix of FOMC topics is of interest to investors and market observers as it indicates areas of relative focus in FOMC discussions.

时间顺序不同的文本文档(例如FOMC会议纪要)中的主题组合随时间而变化。使用LDA，您可以观察到这种变化的混合。 FOMC主题不断变化的组合吸引了投资者和市场观察家的兴趣，因为它表明了FOMC讨论中相对关注的领域。

To learn more about topic modeling and LDA, including a hands-on example, see this introductory article.

要了解有关主题建模和LDA的更多信息(包括动手示例)，请参阅此介绍性文章。

分析方法 (Analysis approach)

I analyze trends in the FOMC minutes using the following approach:

我使用以下方法分析FOMC会议纪要：

Collect the FOMC minutes to be analyzed
收集要分析的FOMC分钟
Prepare the minutes for analysis
准备分析分钟
Run an LDA model on the minutes
在几分钟内运行LDA模型
Extract the changing mix of topics over time
随时间提取不断变化的主题组合

The LDA model requires:

LDA模型要求：

The set of minutes transcripts being analyzed — the corpus — which we’ll use for training the model and for topic analysis
正在分析的会议记录集-语料库-我们将使用它来训练模型和进行主题分析
The dictionary of words to form the model vocabulary — this can be derived from the corpus
构成模型词汇的单词词典-可以从语料库中导出

I implement LDA using the gensim package in Python. This is a powerful yet accessible package for topic modeling.

我使用gensim包在Python中实现LDA。这是用于主题建模的功能强大但可访问的软件包。

My analysis draws upon the work of academic and industry research into FOMC topic modeling (Jegadeesh and Wu [1] and Saret and Mitra [2]). Researchers and practitioners are increasingly using NLP approaches to gain better insights into financial market dynamics. This article presents one such approach.

我的分析基于对FOMC主题建模的学术研究和行业研究(Jegadeesh和Wu [1]和Saret和Mitra [2])。研究人员和从业人员越来越多地使用NLP方法来更好地了解金融市场动态。本文介绍了一种这样的方法。

实作 (Implementation)

In the following, I step through and explain the key sections of code to implement the analysis in Python (v3.7.7).

在下文中，我将逐步介绍并解释代码的关键部分，以在Python(v3.7.7)中实现分析。

For a full listing of the code, please see the expanded version of this article.

有关代码的完整列表，请参见本文的扩展版本。

导入库 (Import libraries)

We’ll need libraries for requesting and parsing minutes transcripts from the FOMC website (requests and BeautifulSoup), text pre-processing (regular expressions and SpaCy), analyzing and displaying results (pandas, numpy, wordcloud and matplotlib) and LDA modeling (gensim).

我们将需要库来从FOMC网站请求和解析会议记录(请求和BeautifulSoup)，文本预处理(正则表达式和SpaCy)，分析和显示结果(熊猫，numpy，wordcloud和matplotlib)以及LDA建模(gensim) )。

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim import models
import matplotlib.pyplot as plt
import spacy
from pprint import pprint
from wordcloud import WordCloudnlp = spacy.load("en_core_web_lg")
nlp.max_length = 1500000 # Ensure sufficient memory

采购FOMC分钟 (Sourcing FOMC minutes)

You can source FOMC minutes transcripts directly from the FOMC website.

您可以直接从FOMC网站获取FOMC会议记录。

Screenshot of FOMC website for sourcing minutes transcripts — Minutes transcripts on the FOMC website. Screenshot by Author.

For this analysis, I sourced minutes of FOMC meetings from October 2007 to July 2020 (the period for which current HTML formatting applies). This provides a good sample — from the aftermath of the 2007 financial crisis to the period following the 2020 COVID-19 pandemic.

为了进行此分析，我提供了从2007年10月至2020年7月(适用当前HTML格式的时期)的FOMC会议纪要。这提供了一个很好的样本-从2007年金融危机的后果到2020年COVID-19大流行之后的时期。

I set up variables for the first part of the URL path (URLPath) and the URL extension (URLExt), as these are common elements of the URL path for all the minutes transcripts. I then create a list of dates for the minutes being analyzed (MinutesList).

我为URL路径的第一部分( URLPath )和URL扩展名( URLExt )设置了变量，因为这些是所有分钟记录的URL路径的通用元素。然后，我为要分析的分钟创建一个日期列表( MinutesList )。

# Define URLs for the specific FOMC minutes
URLPath = r'https://www.federalreserve.gov/monetarypolicy/fomcminutes' # From 2008 onward
URLExt = r'.htm'


# List for FOMC minutes from 2007 onward
MinutesList = ['20071031', '20071211', # 2007 FOMC minutes (part-year on new URL format)
           '20080130', '20080318', '20080430', '20080625', '20080805', '20080916', '20081029', '20081216', # 2008 FOMC minutes     
           '20090128', '20090318', '20090429', '20090624', '20090812', '20090923', '20091104', '20091216', # 2009 FOMC minutes 
           '20100127', '20100316', '20100428', '20100623', '20100810', '20100921', '20101103', '20101214', # 2010 FOMC minutes 
           '20110126', '20110315', '20110427', '20110622', '20110809', '20110921', '20111102', '20111213', # 2011 FOMC minutes 
           '20120125', '20120313', '20120425', '20120620', '20120801', '20120913', '20121024', '20121212', # 2012 FOMC minutes 
           '20130130', '20133020', '20130501', '20130619', '20130731', '20130918', '20131030', '20131218', # 2013 FOMC minutes 
           '20140129', '20140319', '20140430', '20140618', '20140730', '20140917', '20141029', '20141217', # 2014 FOMC minutes                   
           '20150128', '20150318', '20150429', '20150617', '20150729', '20150917', '20151028', '20151216', # 2015 FOMC minutes    
           '20160127', '20160316', '20160427', '20160615', '20160727', '20160921', '20161102', '20161214', # 2016 FOMC minutes
           '20172001', '20170315', '20170503', '20170614', '20170726', '20170920', '20171101', '20171213', # 2017 FOMC minutes
           '20180131', '20180321', '20180502', '20180613', '20180801', '20180926', '20181108', '20181219', # 2018 FOMC minutes
           '20190130', '20190320', '20190501', '20190619', '20190731', '20190918', '20191030', '20191211', # 2019 FOMC minutes
           '20200129', '20200315', '20200429', '20200610', '20200729'] # 2020 FOMC minutes

设置语料库(Setting up the corpus)

The corpus is the collection of FOMC meeting transcripts that we’re analyzing. It’s also used for training our LDA model.

语料库是我们正在分析的FOMC会议笔录的集合。它也用于训练我们的LDA模型。

There are a number of steps in setting up the corpus:

设置语料库的步骤很多：

Text pre-processing — we clean the transcripts by removing special characters, extra spaces, stop words and punctuation, then lemmatizing and selecting the parts-of-speech that we wish to retain (nouns, adjectives and verbs) (see this explanatory article to learn more about text pre-processing in natural language workflows)
文本预处理-我们通过删除特殊字符，多余的空格，停用词和标点符号，然后对它们进行词形化和选择我们希望保留的词性(名词，形容词和动词)来清理成绩单(请参阅本说明性文章，以了解有关自然语言工作流程中文本预处理的更多信息)
Extract paragraphs from each of the transcripts — analysis by paragraph lends to better LDA analysis, but we ignore small paragraphs as they don’t add much value (I use a variable called minparalength to set the minimum length of paragraph extracted)
从每个笔录中提取段落-按段落进行分析有助于进行更好的LDA分析，但是我们忽略了较小的段落，因为它们并没有增加太多价值(我使用名为minparalength的变量来设置提取的段落的最小长度)
Set up lists — containing (i) the paragraphs from all the FOMC minutes (FOMCMinutes) as a ‘list of lists’, where each sub-list is a paragraph — this is a format suitable for input into the LDA model, (ii) a single list of all the paragraphs (FOMCWordCloud), ie. not a ‘list of lists’, which is used for generating a word cloud of the corpus, and (iii) a list containing the date and ‘weight’ of each paragraph in the corpus (FOMCTopix) — this is needed to aggregate the topic mixes of each paragraph into a combined topic mix for each meeting (see later).
设置列表-包含(i)所有FOMC会议记录( FOMCMinutes )中的FOMCMinutes作为“列表列表”，其中每个子列表都是一个段落-这种格式适合于输入LDA模型，(ii)所有段落的单个列表( FOMCWordCloud )，即不是“列表列表”，用于生成语料库的词云，以及(iii)包含语料库中每个段落的日期和“权重”的列表( FOMCTopix )-汇总主题需要将每个段落的内容混合为每个会议的组合主题内容(请参阅下文)。

To calculate the weight of each paragraph, I first calculate the total number of characters in each minutes transcript and store this in a variable called cum_paras. Then, for each paragraph I calculate the paragraph’s length (number of characters len(para)) and divide it by cum_paras to arrive at the weight for that paragraph in the minutes transcript. I store the result in FOMCTopix.

为了计算每个段落的权重，我首先计算每分钟成绩单中的字符总数，并将其存储在一个名为cum_paras的变量中。然后，对于每个段落，我计算该段落的长度(字符数len(para) )，然后将其除以cum_paras ，以在分钟记录中获得该段落的权重。我将结果存储在FOMCTopix 。

Note that the total weights of all paragraphs in a given minutes transcript will sum to 1.

请注意，在给定的分钟记录中，所有段落的总权重为1。

To set up the corpus, I define a function called PrepareCorpus which steps through each of the minutes transcripts, sources it and applies the above steps.

为了设置语料库，我定义了一个名为PrepareCorpus的函数，该函数逐步执行每个分钟的记录，为其提供源代码并应用上述步骤。

FOMCMinutes = [] # A list of lists to form the corpus
FOMCWordCloud = [] # Single list version of the corpus for WordCloud
FOMCTopix = [] # List to store minutes ID (date) and weight of each para


# Define function to prepare corpus
def PrepareCorpus(urlpath, urlext, minslist, minparalength):


    fomcmins = []
    fomcwordcloud = []
    fomctopix = []
    
    for minutes in minslist:
        
        response = requests.get(urlpath + minutes + urlext) # Get the URL response
        soup = BeautifulSoup(response.content, 'lxml') # Parse the response
        
        # Extract minutes content and convert to string
        minsTxt = str(soup.find("div", {"id": "content"})) # Contained within the 'div' tag
                    
        # Clean text - stage 1
        minsTxt = minsTxt.strip()  # Remove white space at the beginning and end
        minsTxt = minsTxt.replace('\r', '') # Replace the \r with null
        minsTxt = minsTxt.replace('&nbsp;', ' ') # Replace "&nbsp;" with space. 
        minsTxt = minsTxt.replace('&#160;', ' ') # Replace "&#160;" with space.
        while '  ' in minsTxt:
            minsTxt = minsTxt.replace('  ', ' ') # Remove extra spaces


        # Clean text - stage 2, using regex (as SpaCy incorrectly parses certain HTML tags)    
        minsTxt = re.sub(r'(<[^>]*>)|' # Remove content within HTML tags
                         '([_]+)|' # Remove series of underscores
                         '(http[^\s]+)|' # Remove website addresses
                         '((a|p)\.m\.)', # Remove "a.m" and "p.m."
                         '', minsTxt) # Replace with null


        # Find length of minutes document for calculating paragraph weights
        minsTxtParas = minsTxt.split('\n') # List of paras in minsTxt, where minsTxt is split based on new line characters
        cum_paras = 0 # Set up variable for cumulative word-count in all paras for a given minutes document
        for para in minsTxtParas:
            if len(para)>minparalength: # Only including paragraphs larger than 'minparalength'
                cum_paras += len(para)
        
        # Extract paragraphs
        for para in minsTxtParas:
            if len(para)>minparalength: # Only extract paragraphs larger than 'minparalength'
                
                topixTmp = [] # Temporary list to store minutes date & para weight tuple
                topixTmp.append(minutes) # First element of tuple (minutes date)
                topixTmp.append(len(para)/cum_paras) # Second element of tuple (para weight), NB. Calculating weights based on pre-SpaCy-parsed text
                            
                # Parse cleaned para with SpaCy
                minsPara = nlp(para)
                
                minsTmp = [] # Temporary list to store individual tokens
                
                # Further cleaning and selection of text characteristics
                for token in minsPara:
                    if token.is_stop == False and token.is_punct == False and (token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ =="VERB"): # Retain words that are not a stop word nor punctuation, and only if a Noun, Adjective or Verb
                        minsTmp.append(token.lemma_.lower()) # Convert to lower case and retain the lemmatized version of the word (this is a string object)
                        fomcwordcloud.append(token.lemma_.lower()) # Add word to WordCloud list
                fomcmins.append(minsTmp) # Add para to corpus 'list of lists'
                fomctopix.append(topixTmp) # Add minutes date & para weight tuple to list for storing
            
    return fomcmins, fomcwordcloud, fomctopix

We can now call the above function to prepare our corpus.

现在，我们可以调用上述函数来准备我们的语料库。

# Prepare corpus
FOMCMinutes, FOMCWordCloud, FOMCTopix = PrepareCorpus(urlpath=URLPath, urlext=URLExt, minslist=MinutesList, minparalength=200)

I set minparalength to 200 — this seems to be a good length for capturing paragraphs which have meaningful content. Shorter paragraphs tend to contain greetings or administrative comments.

我将minparalength设置为200-这对于捕获包含有意义内容的段落来说似乎是一个不错的长度。较短的段落往往包含问候语或行政评论。

检查语料库 (Inspecting the corpus)

We can see what the prepared corpus looks like by generating a word cloud.

通过生成词云，我们可以看到准备好的语料库是什么样的。

# Generate and plot WordCloud for full corpus
wordcloud = WordCloud(background_color="white").generate(','.join(FOMCWordCloud)) # NB. 'join' method used to convert the list to text formatplt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Image for post — Word cloud of FOMC minutes corpus. Image by Author.

We see a number of words in the corpus that are typical of FOMC meetings, including “federal”, “fund rate”, “market” and “monetary policy”.

我们在FOMC会议中看到很多典型的词语，包括“联邦”，“资金利率”，“市场”和“货币政策”。

训练LDA模型 (Training the LDA model)

To train our LDA model, we first need to form a dictionary from our corpus. We map the corpus to word IDs, then convert the words using a bag-of-words approach and finally apply TF-IDF to the result. This process is called text representation.

要训练我们的LDA模型，我们首先需要从语料库中形成字典。我们将语料库映射到单词ID，然后使用“词袋”方法转换单词，最后将TF-IDF应用于结果。此过程称为文本表示。

# Form dictionary by mapping word IDs to words
ID2word = corpora.Dictionary(FOMCMinutes)# Set up Bag of Words and TFIDF
corpus = [ID2word.doc2bow(doc) for doc in FOMCMinutes] # Apply Bag of Words to all documents in corpus
TFIDF = models.TfidfModel(corpus) # Fit TF-IDF model
trans_TFIDF = TFIDF[corpus] # Apply TF-IDF model

We use TF-IDF as it produces better results than using bag-of-words alone. TF-IDF adjusts for words that appear frequently but have low semantic value, relative to words that appear infrequently but with higher semantic value. This favors words which have more meaning in the context of the corpus.

我们使用TF-IDF，因为它产生的效果比单独使用单词袋更好。相对于不频繁出现但具有较高语义值的单词，TF-IDF会针对频繁出现但具有较低语义值的单词进行调整。这有利于在语料库中具有更多含义的单词。

For more information about bag-of-words, TF-IDF and other text representations, see this explanatory article.

有关词袋，TF-IDF和其他文本表示形式的更多信息，请参阅此说明性文章。

We next select the model parameters that we wish to use and run the model.

接下来，我们选择要使用的模型参数并运行模型。

SEED = 130 # Set random seed
NUM_topics = 8 # Set number of topics
ALPHA = 0.15 # Set alpha
ETA = 1.25 # Set eta# Train LDA model using the corpus
lda_model = gensim.models.LdaMulticore(corpus=trans_TFIDF, num_topics=NUM_topics, id2word=ID2word, random_state=SEED, alpha=ALPHA, eta=ETA, passes=100)

The parameters shown above are key inputs for an LDA model. NUM_topics, in particular, needs to be set by the user. The other parameters (SEED, ALPHA and ETA) help to produce better results with fine-tuning.

上面显示的参数是LDA模型的关键输入。 NUM_topics尤其需要设置NUM_topics 。其他参数( SEED ， ALPHA和ETA )通过微调有助于产生更好的结果。

To learn more about LDA parameters and the process for fine-tuning them, please see the sections on ‘Model evaluation’ and ‘Model improvement’ in this introductory article.

要了解有关LDA参数及其微调过程的更多信息，请参阅此介绍性文章中的“模型评估”和“模型改进”部分。

How do we evaluate the results of our LDA model?

我们如何评估LDA模型的结果？

There are quantitative approaches for doing this, but when applied to a text corpus it’s helpful to produce results that have a sensible human interpretation. This requires judgement.

有一些定量的方法可以做到这一点，但是当将其应用于文本语料库时，产生对人类有意义的解释的结果会很有帮助。这需要判断。

In terms of quantitative measures, a common measure for evaluating LDA models is the coherence score. This measures the semantic similarity (likeness of meaning) between the words in each topic of an LDA model. All else equal, a higher coherence score is better.

在定量度量方面，评估LDA模型的常用度量是一致性得分。这可以衡量LDA模型每个主题中单词之间的语义相似度(含义相似度)。在所有其他条件相同的情况下，较高的连贯分数会更好。

We can measure the coherence of our model using the CoherenceModel in gensim.

我们可以使用gensim中的CoherenceModel来测量模型的CoherenceModel性。

# Set up coherence model
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=FOMCMinutes, dictionary=ID2word, coherence='c_v')# Calculate coherence
coherence_lda = coherence_model_lda.get_coherence()

The coherence score for our model is:

我们模型的一致性得分为：

Coherence Score: 0.63779581

相干分数：0.63779581

This is a fairly good result, based on careful selection of the above parameters.

基于对以上参数的仔细选择，这是一个相当不错的结果。

How did we arrive at our parameters?

我们如何得出参数？

In choosing the number of topics (NUM_topics), I was guided by Jegadeesh and Wu [1], who choose 8 topics for their LDA model — I chose the same, as it tended to produce sensible topic interpretations.

在选择主题数( NUM_topics )时，我受到Jegadeesh和Wu [1]的指导，他们为他们的LDA模型选择了8个主题-我选择了相同的主题，因为它倾向于产生合理的主题解释。

To select the other parameters, you can explore the effect of changing parameter values on model coherence. The following code does this for the ALPHA parameter — you can use this as a template to fine-tune any of the other parameters.

要选择其他参数，您可以探索更改参数值对模型一致性的影响。以下代码对ALPHA参数执行此操作-您可以将其用作模板来微调任何其他参数。

# Coherence values for varying alpha
def compute_coherence_values_ALPHA(corpus, dictionary, num_topics, seed, eta, texts, start, limit, step):
    coherence_values = []
    model_list = []
    for alpha in range(start, limit, step):
        model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=seed, eta=eta, alpha=alpha/20, passes=100)
        model_list.append(model)
        coherencemodel = gensim.models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


model_list, coherence_values = compute_coherence_values_ALPHA(dictionary=ID2word, corpus=trans_TFIDF, num_topics=NUM_topics, seed=SEED, eta=ETA, texts=FOMCMinutes, start=1, limit=20, step=1)


# Plot graph of coherence values by varying alpha
limit=20; start=1; step=1;
x_axis = []
for x in range(start, limit, step):
    x_axis.append(x/20)
plt.plot(x_axis, coherence_values)
plt.xlabel("Alpha")
plt.ylabel("Coherence score")
plt.legend(("coherence"), loc='best')
plt.show()

The output of the above code is:

上面代码的输出是：

The above plot suggests that an alpha of 0.15 produces a high coherence — this is the value of alpha adopted.

上图表明，α值为0.15会产生较高的相干性-这是所采用的α值。

Note that by setting the SEED parameter, we ensure that the model will produce the same results with repeated runs.

请注意，通过设置SEED参数，我们确保模型在重复运行时将产生相同的结果。

分析主题组合 (Analyzing the topic mix)

Since we’re using the same corpus for training and analysis, we don’t need to run the trained LDA model on a new set of documents. Instead, we can go straight to analyzing the corpus.

由于我们使用相同的语料库进行训练和分析，因此我们不需要在一组新的文档上运行经过训练的LDA模型。相反，我们可以直接分析语料库。

Our goal is to calculate the topic mix for each of the minutes transcripts in the corpus. By plotting the topic mix over the chronological sequence of the meetings (since each meeting occurred on a given date), we can observe any trends or changes in the topic mix over time.

我们的目标是为语料库中的每个分钟成绩单计算主题组合。通过按会议的时间顺序绘制主题组合(由于每次会议都在给定的日期进行)，我们可以观察到主题组合随时间推移的任何趋势或变化。

Recall that we divided our corpus into individual paragraphs, hence our model will produce a topic mix for each paragraph of each transcript in the corpus. The topic mix is a proportionate allocation to each of the (NUM_topics) topics in the model for each paragraph. The proportions across all topics in the paragraph will sum to 1.

回想一下，我们将语料库分为各个段落，因此我们的模型将为语料库中每个笔录的每个段落产生一个主题混合。主题组合是按比例分配给模型中每个段落的每个( NUM_topics )个主题。该段中所有主题的比例总和为1。

We now need to convert these paragraph-level topic mixes to document-level topic mixes (ie. to create an aggregate topic mix for each meeting).

现在，我们需要将这些段落级别的主题组合转换为文档级别的主题组合(即为每个会议创建汇总主题组合)。

Once again I was guided by Jegadeesh and Wu [1]. They calculate document-level topic mixes (each document being a minutes transcript in our case) using a weighted sum of paragraph-level topic mixes — I adopt the same approach.

我再次受到Jegadeesh和Wu的指导[1]。他们使用段落级主题混合的加权总和来计算文档级主题混合(在我们的情况下，每个文档都是分钟记录)–我采用相同的方法。

We’ve already calculated each paragraph’s weight when we set up our corpus and stored it in FOMCTopix. Let’s call these weights the ‘document-weight’ of each paragraph. Using this, we generate aggregate topic mixes as follows:

设置语料库并将其存储在FOMCTopix时，我们已经计算了每个段落的权重。我们将这些权重称为每个段落的“文档权重”。使用此方法，我们可以生成汇总的主题混合，如下所示：

Extract the topic mix for each paragraph and store the result in FOMCTopix alongside the paragraph’s meeting date and its document-weight
提取每个段落的主题组合，并将结果与段落的开会日期及其文档权重一起存储在FOMCTopix
Multiply each topic proportion in the paragraph’s topic mix by the paragraph’s document-weight
将段落主题组合中的每个主题比例乘以段落的文档权重
For each document (ie. individual meeting transcript), sum the topic proportions, topic by topic, across all topics for all paragraphs in the transcript
对于每个文档(即单个会议笔录)，应将笔录中所有段落的所有主题的主题比例，逐个主题地相加

We will end up with the weighted topic mixes for each minutes transcript, where the weights are based on the relative lengths of the paragraphs within each transcript.

我们将以每分钟成绩单的加权主题混合为最终结果，其中权重基于每个成绩单中段落的相对长度。

I extract each paragraph’s topic mix using the get_document_topics method of gensim and append the results to FOMCTopix, which is in a list format. I then convert this list to a data frame object and call the pivot_table method to sum the topic proportions across all paragraphs within each minutes transcript.

我使用get_document_topics方法提取每个段落的主题组合，并将结果附加到列表格式的FOMCTopix中。然后，我将此列表转换为数据框对象，并调用pivot_table方法对每分钟笔录内所有段落中的主题比例求和。

# Generate weighted topic proportions across all paragraphs in the corpus
para_no = 0 # Set document counter
for para in FOMCTopix:
    TFIDF_para = TFIDF[corpus[para_no]] # Apply TFIDF model to individual minutes documents
    # Generate and store weighted topic mix for each para
    for topic_weight in lda_model.get_document_topics(TFIDF_para): # List of tuples ("topic number", "topic proportion") for each para, where 'topic_weight' is the (iterating) tuple
        FOMCTopix[para_no].append(FOMCTopix[para_no][1]*topic_weight[1]) # Weights are the second element of the pre-appended list, topic proportions are the second element of each tuple
    para_no += 1


# Generate aggregate topic mix for each minutes transcript
# Form dataframe of weighted topic proportions (paragraphs) - include any chosen topic names
FOMCTopixDF = pd.DataFrame(FOMCTopix, columns=['Date', 'Weight', 'Inflation', 'Topic 2', 'Consumption', 'Topic 4', 'Market', 'Topic 6', 'Topic 7', 'Policy'])


# Aggregate topic mix by minutes documents (weighted sum of paragraphs)
TopixAggDF = pd.pivot_table(FOMCTopixDF, values=['Inflation', 'Topic 2', 'Consumption', 'Topic 4', 'Market', 'Topic 6', 'Topic 7', 'Policy'], index='Date', aggfunc=np.sum)

Note that I have assigned names to some of the topics in the above code — I’ll discuss this in the next section.

请注意，我已经为上述代码中的某些主题分配了名称-我将在下一节中讨论。

口译主题 (Interpreting topics)

Although LDA topic modeling is a quantitative process, the identified topics may not always lend themselves to easy interpretation. You can apply judgment in how you label and select topics depending on your analysis.

尽管LDA主题建模是一个定量过程，但是识别出的主题可能并不总是易于解释。您可以根据分析对如何标记和选择主题进行判断。

You can explore topic contents by generating word clouds.

您可以通过生成词云来探索主题内容。

topic = 0 # Initialize counter
while topic < NUM_topics:
    # Get topics and frequencies and store in a dictionary structure
    topic_words_freq = dict(lda_model.show_topic(topic, topn=50)) 
    topic += 1
    
    # Generate Word Cloud for topic using frequencies
    wordcloud = WordCloud(background_color="white").generate_from_frequencies(topic_words_freq) 
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

In our case, I had set 8 topics in the modeling (NUM_topics = 8) based on guidance from Jegadeesh and Wu [1]. This resulted in topics that had a high coherence and good interpretability. Not all of the 8 topics were entirely useful however — some of them were a bit ‘noisy’ and didn’t add much to the analysis. I therefore chose 4 topics on which to focus.

在我们的案例中，我根据Jegadeesh和Wu [1]的指导，在建模中设置了8个主题( NUM_topics = 8)。这导致主题具有高度的连贯性和良好的可解释性。但是，这8个主题并非全部都是有用的-其中一些有点“嘈杂”，并没有为分析增加太多。因此，我选择了4个主题作为重点。

Saret and Mitra [2] also used 8 topics in their modeling but reconfigured this into a smaller number of topics (6 in their analysis), based on what they considered to be sensible interpretations.

Saret和Mitra [2]在他们的建模中也使用了8个主题，但根据他们认为明智的解释，将其重新配置为较少的主题(在分析中为6个)。

For the 4 topics which I chose, I assigned labels based on the words in the topics and also in comparison with labels used by Jegadeesh and Wu [1]. The topics and labels that I chose are:

对于我选择的4个主题，我根据主题中的单词分配标签，并与Jegadeesh和Wu [1]使用的标签进行比较。我选择的主题和标签是：

Inflation
通货膨胀
Consumption
消费
Market
市场
Policy
政策

The word clouds for these topics are shown below.

这些主题的词云如下所示。

主题组合结果 (Topic mix results)

We’re now ready to plot the topic mix of our chosen topics over time.

现在，我们可以随时绘制所选主题的主题组合。

# Plot results - select which topics to print
TopixAggDF.plot(y=['Inflation', 'Consumption', 'Market', 'Policy'], kind='line', use_index=True)

Let’s see what this looks like:

让我们看看它是什么样的：

The shaded region (2008–09) represents a US recession period (NBER-designated). The x-axis shows the date of FOMC meetings and the y-axis shows the proportion of each meeting that each topic makes up, based on our analysis.

阴影区域(2008-09)表示美国经济衰退时期( NBER指定)。根据我们的分析，x轴显示FOMC会议的日期，y轴显示每个主题组成的每次会议的比例。

解释结果 (Interpreting the results)

We observe the following from our plot of the FOMC topic mix over time:

随着时间的推移，我们从FOMC主题组合图中观察到以下内容：

There is a noticeable variation in the topic mix over time, particularly for the policy and consumption topics
随着时间的推移，主题组合会有明显的变化，尤其是对于政策和消费主题
There was a significant increase in the time allocated to discussing policy in the period following the 2008–09 recession, and this coincided with a significant reduction in the time spent discussing consumption
在2008-09年经济衰退后的这段时间里，分配给政策讨论的时间显着增加，与此同时，花费在讨论消费上的时间也大大减少了。
Time spent discussing inflation and the market has been more stable, albeit with a higher proportion during the recession and some divergence during 2014–2018 (ie. more time on inflation and less time on the market)
讨论通胀和市场所花费的时间更加稳定，尽管在经济衰退期间所占比例更高，并且在2014-2018年间存在一些分歧(例如，通胀时间更长，市场时间更少)

These observations appear to make sense in light of historical circumstances.

根据历史情况，这些观察似乎是有意义的。

Policy discussions in particular have been an important area for the FOMC since the 2008–09 recession. This reflects the role that monetary policy has had on US financial markets since then, and the significant policy decisions that the FOMC has made (eg. changing the federal funds target rate from over 5% in 2007 to below 0.25% by 2009).

自2008-09年经济衰退以来，政策讨论尤其成为FOMC的重要领域。这反映了自那时以来货币政策对美国金融市场的作用，以及联邦公开市场委员会做出的重要政策决定(例如，将联邦基金的目标利率从2007年的5％调整到2009年的低于0.25％)。

The increase in time spent discussing inflation since 2014 (until recently) reflects the degree of concern the FOMC has had around the management of inflation (which has been persistently low).

自2014年以来(直到最近)花费在讨论通货膨胀上的时间有所增加，反映了联邦公开市场委员会对通货膨胀管理的关注程度(一直很低)。

Similarly, it’s not surprising that more time than usual has been spent discussing financial markets during the recession and also more recently (starting from a low level prior to 2007). There were important market operations that were initiated or amended during these times.

同样，在经济衰退期间以及最近(从2007年之前的低位开始)开始讨论金融市场也花费了比平常更多的时间也就不足为奇了。在这段时间里，有一些重要的市场运作被启动或修改。

While subject to some judgment and interpretation, the LDA model has provided a useful quantitative synopsis of the FOMC topic mix over time. This illustrates the potential for NLP techniques to assist the money management process.

尽管需要做出一些判断和解释，但LDA模型为FOMC主题组合随时间推移提供了有用的定量提要。这说明了NLP技术有助于资金管理过程的潜力。

In the words of Saret and Mitra [2] (pp. 3–4, with slight modification):

用Saret和Mitra [2]的话(第3-4页，稍作修改)：

“Market observers trying to glean insights from [FOMC] meeting minutes once needed to rely on the subjective interpretation of so-called expert “Fed watchers” or their own interpretation. Now, asset allocators can apply natural language processing techniques to extract insights from the FOMC’s published meeting minutes, turning qualitative inputs into more easily analyzed, quantitative data.”

“曾经试图从[FOMC]会议纪要中收集见解的市场观察者，需要依靠对所谓的“美联储观察员”的主观解释或他们自己的解释。现在，资产分配者可以运用自然语言处理技术从FOMC公开的会议纪要中提取见解，从而将定性输入转化为更易于分析的定量数据。”

结论 (Conclusion)

Topic modeling is a versatile and evolving area of natural language processing. In this article, we’ve seen how topic modeling can be used to observe the changing mix of topics in a text corpus over time.

主题建模是自然语言处理的一个广泛而不断发展的领域。在本文中，我们已经了解了主题建模如何用于观察文本语料库中主题随时间变化的情况。

Such use cases have the potential for a range of real-world applications. In relation to money management, topic modeling is already being added to the toolkit for making sense of financial markets (as evidenced by Saret and Mitra [2]).

这样的用例具有在一系列实际应用中的潜力。关于资金管理，已经将主题建模添加到工具包中以了解金融市场(如Saret和Mitra [2]所证明)。

This article was originally published in the High Demand Skills blog

本文最初发表在“高需求技能”博客上