文章目录
往期文章链接目录
Before we start
Two of my previous post might be helpful in getting a general understanding of the top solutions of this competition. Please feel free to check them out.
- Knowledge Distillation clearly explained
- Common Multilingual Language Modeling methods (M-Bert, LASER, MultiFiT, XLM)
Jigsaw Multilingual Toxic Comment Classification
Use TPUs to identify toxicity comments across multiple languages.
Overview of the competition
Jigsaw Multilingual Toxic Comment Classification is the 3rd annual competition organized by the Jigsaw team. It follows Toxic Comment Classification Challenge, the original 2018 competition, and Jigsaw Unintended Bias in Toxicity Classification, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use English only training data to run toxicity predictions on foreign languages (tr, ru, it, fr, pt, es).
Kagglers are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0
. A benign, non-toxic comment would receive a 0.0
. In the test set, all comments are classified as either a 1.0
or a 0.0
. The whole test set was visible in this competition.
Data
jigsaw-toxic-comment-train.csv
: from Jigsaw Toxic Comment Classification Challenge (2018).
0.0 |
1.0 |
total | |
---|---|---|---|
count | 202165 | 21384 | 223549 |
jigsaw-unintended-bias-train.csv
: from Jigsaw Unintended Bias in Toxicity Classification (2019)
0.0 |
1.0 |
total | |
---|---|---|---|
count | 1789968 | 112226 | 1902194 |
validation.csv
: comments from Wikipedia talk pages in different non-English languagestest.csv
: comments from Wikipedia talk pages in different non-English languages
Here is the the value counts in valid data:
0.0 |
1.0 |
total | |
---|---|---|---|
es | 2078 | 422 | 2500 |
it | 2012 | 488 | 2500 |
tr | 2680 | 320 | 3000 |
Here is the the value counts in test data:
As you can see, the test set comments contains 6 non-English languages (tr, ru, it, fr, pt, es) and the validation set contains only three non-English comments (es, it, tr).
1st place solution
Ensembling to mitigate Transformer training variability
Since the performance of Transformer models is impacted heavily by initialization and data order, they went with an iterative blending approach, refining the test set predictions across submissions with a weighted average of the previous best submission and the current model’s predictions. They began with a simple average, and gradually increased the weight of the previous best submission.
Note: The predictions are an exponential moving average of all past model predictions and the current model’s prediction.
Pseudo-Labeling
They observed a performance improvement when they used test-set predictions as training data - the intuition being that it helps models learn the test set distribution. Using all test-set predictions as soft-labels worked better than any other version of pseudo-labelling (e.g.</