拥抱transform_零拥抱文本分类

最新推荐文章于 2024-07-04 08:26:57 发布

weixin_26737625

最新推荐文章于 2024-07-04 08:26:57 发布

阅读量442

点赞数 1

文章标签： python java

原文链接：https://towardsdatascience.com/zero-shot-text-classification-with-hugging-face-7f533ba83cd6

版权

拥抱transform

A few weeks ago I was implementing POC with one of the requirements to be able to detect text sentiment in an unsupervised way (without having training data in advance and building a model). More specifically it was about data extraction. Based on some predefined topics, my task was to automate information extraction from text data. While doing research and checking for the best ways to solve this problem, I found out that Hugging Face NLP supports zero-shot text classification.

几周前，我正在实施POC，其中一项要求是能够以无人监督的方式检测文本情绪(无需事先获得训练数据并建立模型)。更具体地说，它与数据提取有关。基于一些预定义的主题，我的任务是自动从文本数据中提取信息。在进行研究并寻找解决此问题的最佳方法时，我发现Hugging Face NLP支持零击文本分类。

What is zero-shot text classification? Check this post — Zero-Shot Learning in Modern NLP. There is a live demo from Hugging Face team, along with a sample Colab notebook. In simple words, zero-shot model allows us to classify data, which wasn’t used to build a model. What I mean here — the model was built by someone else, we are using it to run against our data.

什么是零击文本分类？查看此帖子— 现代NLP中的零射击学习。 Hugging Face团队提供了一个现场演示，以及一个示例Colab 笔记本。简而言之，零射模型使我们能够对数据进行分类，而这并不是用于构建模型的。我的意思是-该模型是由其他人构建的，我们正在使用它来运行我们的数据。

I thought it would be a useful example, where I fetch Twitter messages and run classification to group messages into topics. This can be used as a starting point for more complex use cases.

我认为这将是一个有用的示例，其中我获取Twitter消息并运行分类以将消息分组为主题。这可以用作更复杂用例的起点。

I’m using GetOldTweets3 library to scrap Twitter messages. Zero-shot classification with transformers is straightforward, I was following Colab example provided by Hugging Face.

我正在使用GetOldTweets3库来剪贴 Twitter消息。变压器的零脉冲分类非常简单，我遵循的是Hugging Face提供的Colab示例。

List of imports:

进口清单：

import GetOldTweets3 as got
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import pipeline

Getting classifier from transformers pipeline:

从变压器管道获取分类器：

classifier = pipeline("zero-shot-classification")

I scrap 500 latest messages from Twitter, based on a predefined query — “climate fight”. We are going to fetch messages related to climate change fight into Pandas data frame and then try to split them into topics using zero-shot classification:

我基于预定义的查询(“气候斗争”)从Twitter抓取了500条最新消息。我们将把与气候变化斗争相关的消息提取到Pandas数据框中，然后尝试使用零快照分类将其划分为主题：

txt = 'climate fight'max_recs = 500tweets_df = text_query_to_df(txt, max_recs)

In zero-shot classification, you can define your own labels and then run classifier to assign a probability to each label. There is an option to do multi-class classification too, in this case, the scores will be independent, each will fall between 0 and 1. I’m going to use the default option, when the pipeline assumes that only one of the candidate labels is true, returning a list of scores for each label which adds up to 1.

在零镜头分类中，您可以定义自己的标签，然后运行分类器为每个标签分配概率。也有一个选项可以进行多类别分类，在这种情况下，分数将是独立的，每个分数都将介于0和1之间。当管道假设只有一个候选者时，我将使用默认选项。标签为true，则返回每个标签的得分列表，总计为1。

Candidate labels for topics — this would allow us to understand what are people actually talking about climate change fight. Some messages are simple adverts, we would like to ignore them. Zero-shot classification is able to detect adverts pretty well, this helps to clean the data:

主题的候选标签-这将使我们能够了解人们实际上在谈论气候变化斗争。有些邮件是简单的广告，我们希望忽略它们。零镜头分类能够很好地检测广告，这有助于清理数据：

candidate_labels = ["renewable", "politics", "emission", "temperature", "emergency", "advertisment"]

I’m going in the loop and classifying each message:

我进入循环并对每个消息进行分类：

res = classifier(sent, candidate_labels)

Then I’m checking the classification result. It is enough to check the first label, as I’m using the default option when pipeline assumes only one of the candidate labels is true. If the classification score is greater than 0.5, I’m logging it for further processing:

然后，我正在检查分类结果。检查第一个标签就足够了，因为当管道假设只有一个候选标签为真时，我将使用默认选项。如果分类得分大于0.5，则将其记录下来以进行进一步处理：

if res['labels'][0] == 'renewable' and res['scores'][0] > 0.5:
    candidate_results[0] = candidate_results[0] + 1

From the result, we can see that political topic dominates climate change fight discussion, perhaps as expected. Topics related to emission and emergency are close to each other by popularity. There were around 20 cases of adverts from scrapped 500 messages:

从结果可以看出，政治话题在气候变化斗争讨论中占主导地位，也许与预期的一样。与排放和紧急情况相关的主题在人气方面彼此接近。大约有20则来自500条已删除邮件的广告：

Image for post — Author: Andrej Baranovskij

Let’s see some examples, for each topic.

让我们为每个主题看一些示例。

renewable
可再生

Eco-friendly Hydrogen: The clean fuel of the future Germany is promoting the use of #eco-friendly hydrogen in the fight against climate change. Hydrogen can replace fossil fuels in virtually every situation, in an engine or fuel cell!

politics
政治

This is so crazy and wrong. It’s as if the ACA isn’t better than what we had before, that the fight for voting rights doesn’t matter, or equal pay for women, or marriage equality, or the Paris climate agreement. Just because Biden isn’t what we want doesn’t mean Dems = GOP

emission
排放

A simpler, more useful way to tax carbon to fight climate change - Vox

temperature
温度

I've noticed any time someone tries to tell me global warming is not a big deal and how climate change has happened before, my body goes into fight or flight.

emergency
紧急情况

(+ the next few years are CRUCIAL in the fight against climate change. if we don't address it, we'll pass the point of IRREVERSIBLE damage. biden supports the green new deal. trump... well, ya know.)

advertisement
广告

What is your favorite party game? Have a look on @ClumsyRush https://www.nintendo.com/games/detail/clumsy-rush-switch/ #party #game #NintendoSwitch

Classification results are very good, I think Hugging Face zero-shot model does a really good job. Sample sentences from above didn't have direct mention of the topic label and still, they were classified correctly.

分类结果非常好，我认为Hugging Face零射模型确实做得很好。上面的示例句子没有直接提及主题标签，但它们仍被正确分类。

Conclusion

结论

Unsupervised text classification with zero-shot model allows us to solve text sentiment detection tasks when you don’t have training data to train the model. Instead, you rely on a large trained model from transformers. For specialized use cases, when text is based on specific words or terms — is better to go with a supervised classification model, based on the training set. But for general topics, zero-shot model works amazingly well.

零镜头模型的无监督文本分类使我们能够在没有训练数据来训练模型的情况下解决文本情感检测任务。相反，您依赖于来自变压器的大型训练模型。对于特殊的用例，当文本基于特定的单词或术语时，最好采用基于训练集的监督分类模型。但是对于一般主题，零镜头模型效果很好。

Source code

源代码