Kaggle: Jigsaw Multilingual Toxic Comment Classification Top Solutions 金牌思路总结

最新推荐文章于 2024-08-16 08:29:34 发布

Jay_Tang

最新推荐文章于 2024-08-16 08:29:34 发布

阅读量2.5k

点赞数 2

分类专栏：比赛/项目经验分享文章标签：自然语言处理 pytorch

本文链接：https://blog.csdn.net/Jay_Tang/article/details/107926075

版权

本文总结了Kaggle Jigsaw多语言毒性评论分类比赛的顶级解决方案，包括集成策略、伪标签、多语言XLM-Roberta模型、单语种Transformer训练和后处理等技术。第一和第四名队伍使用了迭代融合、伪标签增强、针对特定语言的预训练模型和优化后的后处理方法，显著提高了预测性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

往期文章链接目录

Before we start

Two of my previous post might be helpful in getting a general understanding of the top solutions of this competition. Please feel free to check them out.

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages.

Overview of the competition

Jigsaw Multilingual Toxic Comment Classification is the 3rd annual competition organized by the Jigsaw team. It follows Toxic Comment Classification Challenge, the original 2018 competition, and Jigsaw Unintended Bias in Toxicity Classification, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use English only training data to run toxicity predictions on foreign languages (tr, ru, it, fr, pt, es).

Kagglers are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a 0.0. In the test set, all comments are classified as either a 1.0 or a 0.0. The whole test set was visible in this competition.

Data

jigsaw-toxic-comment-train.csv: from Jigsaw Toxic Comment Classification Challenge (2018).

	`0.0`	`1.0`	total
count	202165	21384	223549

jigsaw-unintended-bias-train.csv: from Jigsaw Unintended Bias in Toxicity Classification (2019)

	`0.0`	`1.0`	total
count	1789968	112226	1902194

validation.csv: comments from Wikipedia talk pages in different non-English languages
test.csv: comments from Wikipedia talk pages in different non-English languages

Here is the the value counts in valid data:

	`0.0`	`1.0`	total
es	2078	422	2500
it	2012	488	2500
tr	2680	320	3000

Here is the the value counts in test data:

As you can see, the test set comments contains 6 non-English languages (tr, ru, it, fr, pt, es) and the validation set contains only three non-English comments (es, it, tr).

1st place solution

Ensembling to mitigate Transformer training variability

Since the performance of Transformer models is impacted heavily by initialization and data order, they went with an iterative blending approach, refining the test set predictions across submissions with a weighted average of the previous best submission and the current model’s predictions. They began with a simple average, and gradually increased the weight of the previous best submission.

Note: The predictions are an exponential moving average of all past model predictions and the current model’s prediction.

Pseudo-Labeling

They observed a performance improvement when they used test-set predictions as training data - the intuition being that it helps models learn the test set distribution. Using all test-set predictions as soft-labels worked better than any other version of pseudo-labelling (e.g.</

最低0.47元/天解锁文章