lda隐含狄利克雷分布_潜在狄利克雷分配lda的收益电话主题建模

lda隐含狄利克雷分布

Most listed US companies host earnings calls every quarter. These are conference calls where management discusses financial performance and company updates with analysts, investors and the media. Earnings calls are important — they highlight valuable information for investors and provide an opportunity for interaction through Q&A sessions.

大多数美国上市公司每个季度都会召开一次收益电话会议 。 这些是电话会议,管理层与分析师,投资者和媒体讨论财务绩效和公司更新。 收益电话很重要-它们可以为投资者突出有价值的信息,并通过问答环节提供互动的机会。

There are hundreds of earnings calls held each quarter, often with the release of detailed transcripts. But the sheer volume of those transcripts makes analyzing them a daunting task.

每个季度都会举行数百次收益电话会议,通常会发布详细的成绩单。 但是,这些成绩单的数量之多使得对其进行分析是一项艰巨的任务。

Topic modeling is a way to streamline this analysis. It’s an area of natural language processing that helps to make sense of large volumes of text data by identifying the key topics or themes within the data.

主题建模是简化此分析的一种方法。 这是自然语言处理的一个领域,它通过识别数据中的关键主题或主题来帮助理解大量文本数据。

In this article, I show how to apply topic modeling to a set of earnings call transcripts. I use a popular topic modeling approach called Latent Dirichlet Allocation and implement the model using Python.

在本文中,我将展示如何将主题建模应用于一组收益电话记录。 我使用一种流行的主题建模方法,称为Latent Dirichlet Allocation,并使用Python实施该模型。

I also show how topic modeling can require some judgement, and how you can achieve better results by adjusting key parameters.

我还将展示主题建模如何需要一些判断,以及如何通过调整关键参数来获得更好的结果。

什么是主题建模? (What is topic modeling?)

Topic modeling is a form of unsupervised learning that can be applied to unstructured data. In the case of text documents, it identifies words or phrases that have a similar meaning and groups them into ‘topics’ using statistical techniques.

主题建模是一种无监督学习的形式,可以应用于非结构化数据。 对于文本文档,它识别具有相似含义的单词或短语,并使用统计技术将它们分组为“主题”。

Topic modeling is useful for organizing text documents based on the topics within them, and for identifying the words that make up each topic. It can be helpful in automating a process for classifying documents or for uncovering concealed meaning (hidden semantic structures) within text data.

主题建模对于根据文本文档中的主题组织文本文档以及识别组成每个主题的单词非常有用。 它有助于自动化文档分类过程或揭示文本数据中的隐藏含义(隐藏的语义结构)。

When applied to natural language, topic modeling requires interpretation of the identified topics — this is where judgment plays a role. The goal is to ensure that the topics and their allocations make sense for the context and purpose of the modeling exercise.

当应用于自然语言时,主题建模需要对识别出的主题进行解释-这是判断力所起作用的地方。 目的是确保主题及其分配对于建模练习的上下文和目的有意义。

潜在狄利克雷分配(LDA) (Latent Dirichlet Allocation (LDA))

Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.

潜在狄利克雷分配 (LDA)是用于主题建模的流行方法。 它通过识别一组文本文档中的关键主题以及组成每个主题的关键字来工作。

Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up. In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.

在LDA下,假定每个文档都包含基础(潜在)主题,每个主题在文档中都有一定的发生概率。 因此,各个文本文档可以由组成它们的主题来表示。 这样,LDA主题建模可用于基于文档的主题内容对文档进行分类或分类。

Each LDA topic model requires:

每个LDA主题模型都要求:

  • A set of documents for training the model — the training corpus

    一组用于训练模型的文档-训练语料库
  • A dictionary of words to form the vocabulary used in the model — this can be derived from the training corpus

    构成模型中使用的词汇的单词词典-这可以从训练语料库中得出

Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.

训练模型后,可以将其应用于一组新文档,以标识这些新文档中的主题。

In this article, I show how to implement LDA using the gensim package in Python. This is a powerful yet accessible package for topic modeling.

在本文中,我将展示如何使用python中的gensim包来实现LDA。 这是用于主题建模的功能强大但可访问的软件包。

模型开发,评估和部署 (Model development, evaluation and deployment)

In the following, I step through the process of training, evaluating, refining and applying an LDA topic model, with associated segments of code (Python v3.7.7).

在下文中,我将逐步训练,评估,改进和应用LDA主题模型以及相关的代码段(Python v3.7.7)。

For a full listing of the code, please see the expanded version of this article

有关代码的完整列表,请参见本文的 扩展版本。

导入库 (Importing libraries)

We’ll need to first import libraries for requesting and parsing earnings call transcripts (requests and BeautifulSoup), text pre-processing (SpaCy), displaying results (matplotlib, pprint and wordcloud) and LDA (gensim).

我们将需要首先导入库,以请求和解析收入电话成绩单(请求和BeautifulSoup),文本预处理(SpaCy),显示结果(matplotlib,pprint和wordcloud)和LDA(gensim)。

import requests
from bs4 import BeautifulSoup
import gensim
import gensim.corpora as corpora
from gensim import models
import matplotlib.pyplot as plt
import spacy
from pprint import pprint
from wordcloud import WordCloud
from mpl_toolkits import mplot3d
import matplotlib.pyplot as pltnlp = spacy.load("en_core_web_lg")
nlp.max_length = 1500000 #Ensure sufficient memory

采购收益通话记录 (Sourcing earnings call transcripts)

Earnings call transcripts are available from company websites or through third-party providers. One popular source is the Seeking Alpha website, from which recent transcripts are freely available.

可从公司网站或通过第三方提供商获得收入电话记录。 Seeking Alpha网站是一个受欢迎的资源,可以免费获取最新成绩单。

Screenshot of the Seeking Alpha website for earnings call transcripts
Seeking Alpha earnings call transcripts. Image by Author. 寻求Alpha收益通话记录。 图片由作者提供。

Individual transcripts can be parsed directly through URL links. The following is an example for a Dell earnings call transcript. I store the resulting text in a variable called ECallTxt.

单个成绩单可以直接通过URL链接进行解析。 以下是Dell收入电话会议记录的示例。 我将结果文本存储在名为ECallTxt的变量中。

URL_text = r'
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值