安然数据集分析处理_用自然语言处理分析安然会计丑闻

安然数据集分析处理 介绍 (Intro)Natural Language Processing (NLP) has been gaining tractions in recent years, allowing us to understand unstructured text data in a way that was never possible before. One of the ...
摘要由CSDN通过智能技术生成

安然数据集分析处理

介绍 (Intro)

Natural Language Processing (NLP) has been gaining tractions in recent years, allowing us to understand unstructured text data in a way that was never possible before. One of the promises of NLP is to use relevant techniques to detect fraud in companies and shed light on potential violations in the early phase.

近年来,自然语言处理(NLP)受到越来越多的关注,这使我们能够以前所未有的方式理解非结构化文本数据。 NLP的承诺之一是使用相关技术来检测公司中的欺诈行为,并在早期阶段揭示潜在的违规行为。

关于数据集 (About the dataset)

I’ve only managed to find two earnings call transcripts online. And only one ofthem is readable when converted from PDF to text. You can find the originaldocument here.

我只设法在网上找到两个收入电话会议记录。 从PDF转换为文本时,只有其中之一是可读的。 您可以在此处找到原始文档

The earnings call transcript used in this article is from Enron’s conference call hold on November 14, 2001. Enron filed for bankruptcy on December 2, 2001.

本文使用的收入电话会议记录来自2001年11月14日举行的安然电话会议。安然于2001年12月2日申请破产。

预处理数据集 (Pre-processing the dataset)

As you can see from the original Earnings, call PDF document, the documentis not digital and contains numbers in between the conversations.

从原始收入中可以看到,调用PDF文档,该文档不是数字文档,并且在对话之间包含数字。

Image for post
A snapshot of Enron’s earnings call in PDF format.
PDF格式的Enron收入电话快照。

To pump the spoken sentences into R programming for analysis, I use Robotic Process Automation (RPA) to massage the text data into a more structured format. Below is a snapshot of the organized text data in CSV format.

为了将口语句子输入到R编程中进行分析,我使用了机器人过程自动化(RPA)来将文本数据压缩为更加结构化的格式。 以下是CSV格式的组织文本数据的快照。

Image for post
Enron’s earnings call in CSV format.
Enron的CSV格式的收益电话。

I then tokenize and remove common stop words from the dataset. To make the results more insightful, I also dropped all the numbers and a fewfiller words such as “um,” “uh,” etc. from the dataset. After cleaning the dataset, I wa

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值