序列操作工具箱_通过序列挖掘增强文本分类工具箱

最新推荐文章于 2023-09-08 23:34:41 发布

杨_明

最新推荐文章于 2023-09-08 23:34:41 发布

阅读量254

点赞数

文章标签： python java 机器学习算法数据挖掘

原文链接：https://medium.com/@jcrousse/enhance-your-text-classification-toolbox-with-sequence-mining-89fbf38307a7

版权

序列操作工具箱

介绍(Introduction)

When looking at the incredible results obtained by recent language models such as GPT-3, the simple task of text classification may seem like a problem that can easily be solved using pre-trained models.

当查看由最近的语言模型(如GPT-3)获得的令人难以置信的结果时，简单的文本分类任务似乎是一个可以使用预先训练的模型轻松解决的问题。

However, in my experience, I found that for a range of reasons, we often face particular instances of the text classification problem where the most advanced language models are not necessarily practical to use. The reasons could be the number of available labelled examples, their length or their variety. When this happens, some basic tools that are sometimes overlooked can have their merits.

但是，根据我的经验，我发现由于种种原因，我们经常会遇到文本分类问题的某些实例，在这些实例中，最高级的语言模型不一定是实用的。原因可能是可用的带有标签的示例的数量，其长度或种类。发生这种情况时，某些有时会被忽略的基本工具可能会有其优点。

Sequence mining is one such tool which, although it won’t replace your language models, may save you if you find yourself in a situation where state-of-the art methods are not entirely solving your problem.

序列挖掘就是这样一种工具，尽管它不能代替您的语言模型，但是如果您发现自己处于最先进的方法不能完全解决您的问题的情况下，则可以为您节省很多时间。

This is the second post of a two-part series on text classification, following the previous post here.

这是对文本分类的两部分系列的第二篇文章，继之前的帖子在这里。

用例 (Use case)

While working on a document classification tool, our team found that for particular labels, the documents were easily identifiable by a human as well as simple heuristics while more complex models such as RNNs failed.

在使用文档分类工具时，我们的团队发现，对于特定的标签，人类和简单的启发式方法都可以轻松识别文档，而RNN等更复杂的模型却无法使用。

There were two main reasons for this: the number of available labelled examples was quite low for those classes, and the majority of the document text was very similar to the text of other labels, except for a few subtle paragraphs. We were able to improve our classification rate by manually engineering features that would target those specific paragraphs.

造成这种情况的主要原因有两个：这些类的可用带标签示例的数量非常少，并且文档文本的大部分与其他标签的文本非常相似，除了一些微妙的段落。我们能够通过手动设计针对那些特定段落的功能来提高分类率。

If a simple heuristic works, then why would we want to use more complex approaches anyways? (see rule number 1 of Machine Learning ).

如果使用简单的启发式方法，那么为什么我们仍要使用更复杂的方法？ (请参阅机器学习的规则1 )。

One reason we built an ML model to solve this classification problem was to have a solution that would, hopefully, be maintained at reasonable costs by simply re-training the model to adjust for data drifts. By contrast, the heuristics would have had to be frequently checked and maintained using human review.

我们建立ML模型来解决此分类问题的原因之一是，有一个解决方案希望通过简单地重新训练模型以适应数据漂移的方式，以合理的成本维持该解决方案。相比之下，试探法必须经常进行人工检查并加以维护。

Our heuristics were based on word sequences, and the process of identifying the sequences automatically is what sequence mining does. In our cases (low number of examples and large amount of text per example) sequence mining can out-perform RNNs.

我们的启发式方法基于单词序列，而自动识别序列的过程就是序列挖掘的工作。在我们的案例中(示例数量少，每个示例大量文本)，序列挖掘可以胜过RNN。

To avoid confusion, we will use the word ‘sequence’ when referring to the text sequence to be classified, and the word ‘pattern’ to describe the sub-sequence that we are looking for in the text sequence.

为避免混淆，在指要分类的文本序列时，我们将使用“序列”一词，而在描述文本序列中要寻找的子序列时，将使用“模式”一词。

插图 (Illustration)

Here is an illustration based on randomly generated data. Each example is a random sequence of integer tokens. We generate positive and negative examples based on the presence or absence of a certain pattern.

这是基于随机生成的数据的说明。每个示例都是整数令牌的随机序列。我们根据是否存在某种模式生成正面和负面的例子。

正例定义 (Positive example definition)

In our real world example, we found that certain succession of words at a certain maximal distance from each other where a very useful heuristic to classify certain documents (e.g. legal documents such as subpoenas).

在我们的真实示例中，我们发现单词的某些连续性在彼此之间具有最大距离的情况下，这是对某些文档(例如法律文件，例如传票)进行分类的非常有用的启发式方法。

By example: ‘local’, ‘time’, ‘ is’, […] ‘all’, ‘your’, ‘ belongings” […], ‘thank’, ‘ you’ ‘for’, ‘ flying’ can be a pattern of words with certain gaps […] to identify a flight attendant’s landing speech with a high degree of accuracy.

例如：“本地”，“时间”，“是”，[…]“全部”，“您的”，“所有物” […]，“谢谢”，“您”，“为”，“飞行”可以是具有一定间隔的单词模式[…]可以高度准确地识别空姐的着陆讲话。

In this illustration, the randomly generated positive examples contain a certain pattern of predefined tokens in a given order, as well as a maximal distance between each token. The negative examples do not contain this pattern.

在此图示中，随机生成的肯定示例包含给定顺序的特定模式的预定义令牌以及每个令牌之间的最大距离。否定示例不包含此模式。

For example, the sequence of token 10, 11, 12 with between 0 and 5 random tokens in between each of them can represent the pattern we find in a positive example.

例如，令牌10、11、12的序列在每个令牌之间具有0到5个随机令牌，可以代表我们在肯定示例中发现的模式。

RNN结果 (RNN results)

We can try to experiment with different values of vocabulary size, sequence length and gap length within the pattern. In this case, we have 5 tokens pattern in each example.

我们可以尝试在模式中使用不同的词汇量，序列长度和间隔长度值。在这种情况下，每个示例中都有5个令牌模式。

The sequence lengths ranges from 50 to 300, and “gap” sizes between the tokens in the pattern ranges from 2 to 30.

序列长度范围是50到300，模式中标记之间的“间隙”大小范围是2到30。

The RNN performs well on “clean” data, where 100% of the positive examples contain the same pattern, and 100% of the negative examples don’t.

RNN在“干净”数据上表现良好，其中100％的正面示例包含相同的模式，而100％的负面示例则没有。

Once we make the problem a little bit more difficult, (and more similar to our real-world problem) by introducing three different possible patterns in positive examples, as well as a 5% FP and 5% FN rate, the Neural network struggles to learn.

通过在积极的例子中引入三种不同的可能模式以及FP率为5％和FN率为5％，使问题变得更加棘手(并且与我们的现实世界中的问题更相似)之后，神经网络便开始努力学习。

The number of training examples is purposely limited to 200, because that is about the number of examples we had in our real-world case.

培训示例的数量有意限制为200个，因为这与我们在实际案例中获得的示例数量差不多。

On paper, and RNN should be able to pick up those patterns. It has an architecture that is sufficiently complex and flexible to identify them. There is however no guarantee that your initial conditions and hyper parameters will converge to such a solution. Which is why we want to supplement it with additional features, using sequence mining.

从表面上看，RNN应该能够掌握这些模式。它具有足够复杂和灵活的架构来识别它们。但是，不能保证您的初始条件和超参数将收敛到这种解决方案。这就是为什么我们要使用序列挖掘为它补充其他功能。

序列挖掘 (Sequence mining)

We are not necessarily advocating for an outright replacement of RNNs with sequence mining and a classification algorithm. We are however suggesting to use sequence mining to generate additional features, in the same way that in our document classification tasks could use images from the document, document metadata and pre-trained language models are used as high level features for our classification algorithm.

我们不一定提倡用序列挖掘和分类算法彻底替代RNN。但是，我们建议使用序列挖掘来生成其他功能，就像在我们的文档分类任务中可以使用来自文档的图像一样，文档元数据和预训练的语言模型被用作我们分类算法的高级功能。

There are multiple sequence mining algorithms, but the algorithm itself is not the topic of this post. For the purpose of this exercise however, we used a python implementation of the prefixspan algorithm which can be found here.

有多种序列挖掘算法，但是算法本身不是本文的主题。然而，出于本练习的目的，我们使用了prefixspan算法的python实现，可以在此处找到。

In our simplified example, after a bit of cleanup, we quickly re-identify the three patterns that we put in our positive examples. Here is an example of usage:

在我们的简化示例中，经过一些清理后，我们快速重新确定了我们在积极示例中使用的三种模式。这是用法示例：

from prefixspan import PrefixSpan


from data_sources.data_generator import ExamplesGenerator, get_multiple_patterns


VOCAB_SIZE = 1000
SEQ_LEN = 250
multiple_patterns = get_multiple_patterns(10)


NUM_EXAMPLES = 200
MIN_FREQ = 25
MIN_LEN = 5
MIN_DIST = 3


data_generator = ExamplesGenerator(seq_len=SEQ_LEN, vocab_size=VOCAB_SIZE, seed=111,
                                   multiple_patterns=multiple_patterns)


data_sequences = [next(data_generator()) for _ in range(NUM_EXAMPLES)]
positive_sequences = [s[0] for s in data_sequences if s[1] == 1]
negative_sequences = [s[0] for s in data_sequences if s[1] == 0]


positive_seq = PrefixSpan(positive_sequences).frequent(MIN_FREQ)
long_seq = [s for s in positive_seq if len(s[1]) >= MIN_LEN]
seq_by_freq = sorted(long_seq, key=lambda x: x[0], reverse=True)




def distance_from_seqs(s, s_list: list):
    """return distance (in terms of number of different tokens) between the sequence s
    and the list of sequence s_list"""
    if not s_list:
        s_list = [[]]
    dist_per_seq = [len(set(s) - set(s2)) for s2 in s_list]
    return min(dist_per_seq)




most_freq_seq = []
for s in seq_by_freq:
    if distance_from_seqs(s[1], most_freq_seq) >= MIN_DIST:
        most_freq_seq.append(s[1])


print(most_freq_seq[0:10])

Note that the computation time will increase exponentially with the sequence length, so this would not work for long sequences, unless you break them down into smaller chunks. In ou case, we were looking for specific paragraphs, so it was feasible to break down the long sequences.

请注意，计算时间将随着序列长度成指数增长，因此，除非您将它们分解为较小的块，否则这不适用于长序列。在这种情况下，我们正在寻找特定的段落，因此分解长序列是可行的。

The code for the ExamplesGenerator used above and the code for the RNN experiments can be found here and there.

上面使用的ExamplesGenerator的代码和RNN实验的代码可以在这里和那里找到。

翻译自: https://medium.com/@jcrousse/enhance-your-text-classification-toolbox-with-sequence-mining-89fbf38307a7

序列操作工具箱

杨_明

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
序列操作工具箱_通过序列挖掘增强文本分类工具箱

序列操作工具箱介绍(Introduction)When looking at the incredible results obtained by recent language models such as GPT-3, the simple task of text classification may seem like a problem that can easily be solve...
复制链接

扫一扫