Kaggle: Jigsaw Multilingual Toxic Comment Classification Top Solutions 金牌思路总结

本文总结了Kaggle Jigsaw多语言毒性评论分类比赛的顶级解决方案,包括集成策略、伪标签、多语言XLM-Roberta模型、单语种Transformer训练和后处理等技术。第一和第四名队伍使用了迭代融合、伪标签增强、针对特定语言的预训练模型和优化后的后处理方法,显著提高了预测性能。
摘要由CSDN通过智能技术生成

往期文章链接目录

Before we start

Two of my previous post might be helpful in getting a general understanding of the top solutions of this competition. Please feel free to check them out.

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages.

Overview of the competition

Jigsaw Multilingual Toxic Comment Classification is the 3rd annual competition organized by the Jigsaw team. It follows Toxic Comment Classification Challenge, the original 2018 competition, and Jigsaw Unintended Bias in Toxicity Classification, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use English only training data to run toxicity predictions on foreign languages (tr, ru, it, fr, pt, es).

Kagglers are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a 0.0. In the test set, all comments are classified as either a 1.0 or a 0.0. The whole test set was visible in this competition.

Data

  • jigsaw-toxic-comment-train.csv: from Jigsaw Toxic Comment Classification Challenge (2018).
0.0 1.0 total
count 202165 21384 223549
  • jigsaw-unintended-bias-train.csv: from Jigsaw Unintended Bias in Toxicity Classification (2019)
0.0 1.0 total
count 1789968 112226 1902194
  • validation.csv: comments from Wikipedia talk pages in different non-English languages
  • test.csv: comments from Wikipedia talk pages in different non-English languages

Here is the the value counts in valid data:

0.0 1.0 total
es 2078 422 2500
it 2012 488 2500
tr 2680 320 3000

Here is the the value counts in test data:

As you can see, the test set comments contains 6 non-English languages (tr, ru, it, fr, pt, es) and the validation set contains only three non-English comments (es, it, tr).

1st place solution

Ensembling to mitigate Transformer training variability

Since the performance of Transformer models is impacted heavily by initialization and data order, they went with an iterative blending approach, refining the test set predictions across submissions with a weighted average of the previous best submission and the current model’s predictions. They began with a simple average, and gradually increased the weight of the previous best submission.

Note: The predictions are an exponential moving average of all past model predictions and the current model’s prediction.

Pseudo-Labeling

They observed a performance improvement when they used test-set predictions as training data - the intuition being that it helps models learn the test set distribution. Using all test-set predictions as soft-labels worked better than any other version of pseudo-labelling (e.g.</

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值