python文本结构化处理_在Python中标记非结构化文本数据-CSDN博客

python文本结构化处理

Labelled data has been a crucial demand for supervised machine learning leading to a new industry altogether. This is an expensive and time-consuming activity with an unstructured text data which requires custom made techniques/rules to assign appropriate labels.

标记数据一直是对有监督的机器学习的一项至关重要的需求，从而带动了整个新兴产业的发展。对于非结构化文本数据，这是一项昂贵且费时的活动，需要定制技术/规则来分配适当的标签。

With the advent of state-of-the-art ML models and framework pipelines like Tensorflow and Pytorch, the dependency of data science practitioners has increased upon them for multiple problems. But these can only be consumed if provided with well-labelled training datasets and the cost and quality of this activity are positively correlated with Subject matter experts (SME). These constraints have directed the minds of practitioners towards Weak Supervision — an alternative of labelling training data utilizing high — level supervision from SMEs and some abstraction from noisier inputs using task-specific heuristics and regular expression patterns. These techniques have been employed in some opensource labelling models like Snorkel using Labelling functions and paid proprietaries Groudtruth, Dataturks etc.

随着最先进的ML模型和Tensorflow和Pytorch之类的框架管道的出现，数据科学从业者对多种问题的依赖日益增加。但是，只有在提供了标签明确的培训数据集之后，这些内容才能被消耗，并且这项活动的成本和质量与主题专家(SME)呈正相关。这些限制使从业者的思想转向“弱监督”(Weak Supervision)，这是一种利用中小型企业的高层监督来标记培训数据的方法，并且使用特定于任务的试探法和正则表达式模式从嘈杂的输入中进行抽象。这些技术已被某些开源标签模型采用，例如使用标签功能的Snorkel以及付费所有者Groudtruth，Dataturks等。

The solution proposed is for a Multinational Enterprise Information Technology client that develops a wide variety of hardware components as well as software-related services for consumers & businesses. They deploy a robust Service team that supports customers through after-sales services. The client recognized the need for an in-depth, automated, and near-real-time analysis of customer communication logs. This has several benefits such as enabling proactive identification of product shortcomings and pinpointing improvements in future product releases.

提出的解决方案是针对跨国企业信息技术客户的，该客户为消费者和企业开发了各种各样的硬件组件以及与软件相关的服务。他们部署了一支强大的服务团队，通过售后服务为客户提供支持。客户认识到需要对客户通信日志进行深入，自动化和近乎实时的分析。这具有许多好处，例如能够主动发现产品缺陷并在将来的产品版本中指出改进之处。

We developed a two-phase solution strategy to address the problem at hand.

我们制定了两阶段解决方案策略来解决当前的问题。

The first task was that of a binary classification to segregate customer calls into Operating System (OS) and Non-Operating System (Non-OS) calls. Since labelled data was not available in this case, we resorted to using Regular Expressions for this classification exercise. Using Regex also has the added utility of labelling the data in their respective categories. In the second phase, we targeted the ‘Non-OS’ category to tag other features.

第一项任务是二进制分类的任务，目的是将客户呼叫分为操作系统(OS)和非操作系统(Non-OS)呼叫。由于在这种情况下无法使用标签数据，因此我们使用正则表达式进行分类。使用Regex还具有在其各自类别中标记数据的附加实用程序。在第二阶段，我们以“非操作系统”类别为目标来标记其他功能。

A Stepwise Solution Approach is thus:

因此，逐步解决方案是：

Preprocessing:

预处理：

1. Create a corpus of frequently used OS phrases and abbreviations (ex: windows install, windows activation, Deployment issue, windows, VMware)

1.创建一个常用操作系统短语和缩写的语料库(例如：Windows安装，Windows激活，部署问题，Windows，VMware)

2. Similarly, form a corpus of phrases and words that may occur simultaneously with the OS phrases and may indicate to non-OS calls.

2.类似地，形成短语和单词的语料库，这些短语和单词可能与OS短语同时出现，并可能指示非OS调用。

Core steps:

核心步骤：

1. Standard text cleaning procedures such as:

1.标准的文字清洁程序，例如：

a) Convert text to all lower cases

a)将文本转换为所有小写

b) Remove multiple spaces

b)删除多个空格

c) Remove punctuations and special characters

c)删除标点符号和特殊字符

d) Remove non-ASCII characters

d)删除非ASCII字符

2. In the first search pass, identify OS related words and phrases to tag the relevant calls as OS calls

2.在第一遍搜索中，识别与操作系统相关的词和短语，以将相关呼叫标记为操作系统呼叫

3. In the second search pass, identify non-OS related words and phrases to tag calls related to features other than operating systems. This is needed as most call logs will keep a record of the configuration of the system which may lead to false tagging of the calls as OS

3.在第二遍搜索中，标识与操作系统无关的单词和短语，以标记与操作系统以外的功能相关的调用。这是必需的，因为大多数呼叫日志将保留系统配置的记录，这可能导致错误地将呼叫标记为OS

Details for the phrase and word search:

短语和单词搜索的详细信息：

a) Create a dictionary of all text with the text of each row split into words and save the list of words as an element of the dictionary against the text or unique id.

a)创建所有文本的字典，将每一行的文本分成单词，并将单词列表作为文本或唯一ID的字典元素保存。

b) Now split each phrase of the corpus in words and search for each word of the phrase in each element of the dictionary. If all the words of the phrase are available in a given element of the dictionary, then tag the respective text or unique id accordingly.

b)现在将语料库的每个短语拆分为单词，然后在字典的每个元素中搜索短语的每个单词。如果该短语的所有单词在字典的给定元素中均可用，则相应地标记相应的文本或唯一ID。

c) Similarly, search for the words in the corpus in all the text and tag the successful search calls accordingly.

c)同样，在所有文本中搜索语料库中的单词，并相应地标记成功的搜索调用。

Code Snippets

代码段

Text cleaning:

文字清理：

Phrase search:

词组搜索：

Limitations

局限性

1. Currently, the text is being only searched for the phrases of a single product and having it tagged accordingly. As an improvement, we can expect to include phrases of multiple products and tag the calls in a similar fashion.

1.目前，仅在文本中搜索单个产品的短语并对其进行相应标记。作为改进，我们可以期望包含多个产品的短语，并以类似的方式标记通话。

2. We can also include the language translations for foreign languages and check for spelling mistakes.

2.我们还可以包括外语的语言翻译，并检查拼写错误。

3. Domain experts can help in creating an exclusive set of words and phrases for each product which can make the product more customizable for different industry segments.

3.领域专家可以帮助为每个产品创建一组专有的单词和短语，这可以使产品针对不同的行业细分而更加可定制。

Sample search results

样本搜索结果

1. OS Terms: RHEL, RedHat, OS install, no boot, subscription

1. 操作系统条款： RHEL，RedHat，操作系统安装，不启动，订阅

2. Non-OS Terms: HW (Hardware), Disk Error

2. 非操作系统术语：硬件(硬件)，磁盘错误

Proposed Future Enhancements

拟议的未来增强功能

1. The labelled training data can be consumed into training an NLP based Binary classification model which can classify the call logs into OS and Non-OS classes.

1.标记的训练数据可以用于训练基于NLP的二进制分类模型，该模型可以将呼叫日志分类为OS和Non-OS类。

2. Textual data needs to be converted into vectorized form, which can be achieved by using word embeddings for each token in the sentence. We can use pre-trained open-source embeddings like FastText, BERT, GloVe, etc.

2.文本数据需要转换为矢量化形式，这可以通过对句子中的每个标记使用单词嵌入来实现。我们可以使用经过预先训练的开源嵌入，例如FastText，BERT，GloVe等。

3. Some of the state-of-the-art models, like Neural Nets, can be used for the classification task, with RNN/GRU/LSTM layers to learn representations for text sequences.

3.一些最新的模型，例如神经网络，可以用于分类任务，通过RNN / GRU / LSTM层可以学习文本序列的表示形式。