ai人工智能语音分析系统_建立一个人工智能系统以增强财务分析-CSDN博客

ai人工智能语音分析系统

In the world of finance, you will never run out of reading material. Dozens of documents, reports, and studies show up in your inbox daily demanding your attention. However, not all of these are relevant to your interests, especially if you are an analyst for a specific product in a specific region. In this project which was completed in Oct’ 19, I was part of a team that trained and deployed a machine learning system to classify tax update documents by topic and location.

在金融世界中，您将永远不会用完阅读材料。每天都有数十个文档，报告和研究报告显示在您的收件箱中，需要您的注意。但是，并非所有这些都与您的兴趣相关，尤其是如果您是特定区域中特定产品的分析师。在这个于10月19日完成的项目中，我是一个团队的一部分，该团队训练和部署了机器学习系统，以按主题和位置对税收更新文档进行分类。

The Problem: More documents than you could shake a stick at

问题：文档数量超出您的承受能力

Our client, GIC, is a sovereign wealth fund established by the Singapore government, with a portfolio spanning dozens of countries and territories. They were one of the project sponsors in AISG’s 100 Experiments (100E) programme. We were engaged by their tax advisory team whose job (among others) was to keep track of changes in the tax code and to study its implications to their portfolio of investments. This was a time consuming task as they had to sift through mountains of documents in their inbox to identify information specific to changes in their specialised tax categories before they could even get started on the analysis.

我们的客户GIC是由新加坡政府设立的主权财富基金，其投资组合遍及数十个国家和地区。他们是AISG的100实验(100E)计划的项目发起人之一。我们与他们的税务咨询团队合作，他们的工作(除其他外)是跟踪税法的变化并研究其对投资组合的影响。这是一项耗时的工作，因为他们不得不筛选收件箱中的大量文件，以识别专门税种变化所特有的信息，然后才能开始分析。

Our solution was to build a document labeling algorithm that would parse a document and identify the specific tax topics it related to, as well as the geographical region it affected. This is known in machine learning as a multi-label classification problem as each document could cover multiple topics.

我们的解决方案是建立一个文档标记算法，该算法将解析文档并识别与之相关的特定税收主题以及受影响的地理区域。这在机器学习中被称为多标签分类问题，因为每个文档都可以涵盖多个主题。

Data drives the solution to every AI problem

数据推动解决每个AI问题

Before we can train our machine learning model, we first need data. Due to content sensitivity, our client could not simply give us a dump of their emails. Instead we worked with them to construct an initial labeled dataset of 200 publicly-available documents. This dataset is too small to perform any significant training, but simply serves as a ‘gold standard’ to help validate our model accuracy, and for us to do some exploratory data analysis.

在训练机器学习模型之前，我们首先需要数据。由于内容敏感，我们的客户不能简单地向我们发送电子邮件。相反，我们与他们一起构建了一个包含200个公开可用文档的初始标签数据集。该数据集太小，无法进行任何重要的训练，而只是充当“黄金标准”以帮助验证我们的模型准确性，并为我们进行一些探索性数据分析。

Our initial exploration of the data identified 10 main categories, and over 100 sub-categories that fell under these 10 categories. In the course of our discovery process, we found that the 10 main categories were easily distinguishable and in fact, the updates the analysts received were already sorted according to these main categories. The real value thus lay in identifying which of the sub-categories each document belong to, and this requires a deeper understanding of each document.

我们对数据的初步探索确定了10个主要类别，以及属于这10个类别的100多个子类别。在发现过程中，我们发现10个主要类别很容易区分，实际上，分析人员收到的更新已经根据这些主要类别进行了排序。因此，真正的价值在于确定每个文档属于哪个子类别，这需要对每个文档有更深入的了解。

To deal with the lack of training data, we went online to find documents for each sub-category. We downloaded all the documents that had words matching that sub-category in the title. This gave us a ‘weakly’ labeled training dataset of several thousand documents.

为了解决培训数据不足的问题，我们在线查找了每个子类别的文档。我们下载了标题中具有与该子类别匹配的单词的所有文档。这给了我们一个“弱”标签的训练数据集，包含数千个文档。

One problem remained: our training data had one label for each document, but we were supposed to build a model that could predict multiple labels per document.

仍然存在一个问题：我们的训练数据为每个文档有一个标签，但是我们应该建立一个可以预测每个文档多个标签的模型。

Training the machine learning model

训练机器学习模型

I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.- Bruce Lee

我不怕练习过10,000次踢球的人，但我怕练习过10,000次踢球的人。-李小龙

The problem of training on a multi-class dataset and creating a multi-label output was treated in the model design: Instead of training one model that could predict 100 labels, I trained 100 models that each predicted one label. Each model will train only on one topic and become the expert at identifying whether that particular topic exists. When a new document is encountered, every model makes a prediction, and the results are collated to retrieve multiple labels for the document. This system had the added benefit of future-proofing the model. If I wanted to add additional categories after the model has been trained, I do not need to retrain the entire model. Instead, I only have to train a model on that additional category, and then add it to the original model.

在模型设计中处理了在多类数据集上进行训练并创建多标签输出的问题：我没有训练一个可以预测100个标签的模型，而是训练了100个可以预测一个标签的模型。每种模型将仅针对一个主题进行训练，并成为确定特定主题是否存在的专家。遇到新文档时，每个模型都会做出预测，然后整理结果以检索该文档的多个标签。该系统具有使模型过时的额外好处。如果我想在训练模型后添加其他类别，则无需重新训练整个模型。相反，我只需要在该附加类别上训练模型，然后将其添加到原始模型即可。

The model itself was actually an ensemble of several models. Some models focused on the number of times each word occurs (known as term frequency-inverse document frequency, or TF-IDF), while others tried to gain a semantic understanding of the document with a language model pre-trained on the English language.

该模型本身实际上是多个模型的集合。一些模型关注每个单词出现的次数(称为术语频率反文档频率，即TF-IDF)，而其他模型则尝试使用在英语上预先训练的语言模型来获得文档的语义理解。

Additional features were generated in the following manner:

其他功能是通过以下方式生成的：

The spaCy matcher was used to highlight certain important keywords identified by subject matter experts
spaCy匹配器用于突出显示主题专家确定的某些重要关键字
An algorithm called k-means clustering was used to automatically group documents into unsupervised categories
一种称为k-means聚类的算法用于将文档自动分组为无监督类别

Will this model be useful?

这个模型有用吗？

We decided to evaluate the performance of the model by a classic human vs. computer comparison, with the target that a satisfactory model should perform no worse than a human analyst. We collected a batch of unseen documents and had the model predict its labels. At the same time, 3 analysts also worked to label these documents.

我们决定通过经典的人机对比计算机来评估模型的性能，目标是令人满意的模型应不逊色于人类分析师。我们收集了一批看不见的文件，并让模型预测了其标签。同时，还有3位分析师为这些文件加上标签。

With these data points, we could look at both how the model performed as compared to humans, and how each analyst compared to each other. The intra-analyst comparison was necessary because at this high level of topic-granularity, many topics overlap and there is some degree of subjectivity to the labels.

有了这些数据点，我们既可以研究模型与人类相比的表现如何，又可以分析每个分析师之间的相互比较。分析人员内部的比较是必要的，因为在如此高的主题粒度下，许多主题重叠并且标签具有一定程度的主观性。

Our model achieved an F1 score of 0.65, which was essentially the same as the intra-analyst F1 score of 0.64. We have successfully built a model that performed no worse than an analyst at identifying document topics!

我们的模型的F1得分为0.65，与分析人员内部的F1得分基本相同，为0.64。我们已经成功建立了一个模型，该模型在确定文档主题方面的表现不比分析师差！

Deploying the model

部署模型

Image for post — All incoming documents are automatically tagged, and feedback given is used to retrain the model

The model is deployed in a Docker container to make it work across different environments, and consists of 3 key services

该模型部署在Docker容器中以使其能够在不同的环境中工作，并且包含3个关键服务

An automated training script that can be used to add additional categories or incorporate user feedback
自动培训脚本，可用于添加其他类别或合并用户反馈
A prediction API that is triggered when a new document is added
添加新文档时触发的预测API
A feedback module, which collects feedback from analysts, accounts for conflicting feedback, and updates the database
反馈模块，该模块收集分析人员的反馈，解释冲突的反馈并更新数据库

Conclusion

结论

Deploying a model in this manner is what makes the system ‘intelligent’ as it is able to get better over time by learning from user feedback and improving itself. This ensures that the model will remain relevant as new tax topics are introduced, or the discussion surrounding a particular topic changes over time (concept drift).

以这种方式部署模型是使系统“智能化”的原因，因为它可以通过从用户反馈中学习和改进自身而随着时间的推移变得更好。这可以确保模型在引入新的税收主题或围绕特定主题的讨论随时间变化(概念漂移)时保持相关性。

The model has proven effective during user acceptance tests and has since been deployed into production for their local and overseas offices. This is just one of the ways that artificial intelligence can be used to augment workflows and improve efficiency. Artificial intelligence is now at a point where many of the techniques originally found in research papers are ready to be adopted by industry.

该模型在用户验收测试中被证明是有效的，并已被部署到其本地和海外办事处的生产中。这只是人工智能可用于扩大工作流程和提高效率的方式之一。现在，人工智能正处在研究论文最初发现的许多技术准备被工业采用的时候。

Check out other cool AI projects and get learn some machine learning at AI Singapore.

查看其他很棒的AI项目，并在AI Singapore上学习一些机器学习。

I’m on linkedin!

我在linkedin上！