Kaggle: Tweet Sentiment Extraction 方法总结 Part 1/2: 常用方法总结

往期文章目录链接

Note

This post is the first part of overall summarization of the competition. The second half is here.

Before we start

I attended two NLP competition in June, Tweet Sentiment Extraction and Jigsaw Multilingual Toxic Comment Classification, and I’m happy to be a Kaggle Expert from now on 😃

Tweet Sentiment Extraction

Goal:

The objective in this competition is to “Extract support phrases for sentiment labels”. More precisely, this competition asks kagglers to construct a model that can figure out what word or phrase best supports the given tweet from the labeled sentiment. In other word, kagglers’re attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc).

For example:

text         : "Sooo SAD I will miss you here in San Diego!!!"
sentiment    : negative
selected_text: "Sooo SAD"

In this competition, the state-of-the-art (SOTA) transformer models were not so bad in extracting the selected_text. The main problem was to capture the “noise” in the dataset.

The organizer of this competition did not introduce the “noise” (magic of the competition) on purpose but probably by some regex mistake (I’ll talk about the “noise” in the next section). When I analyzed the data, I found some weird selected_text like most other teams did. For example,

text         : "  ROFLMAO for the funny web portal  =D"
sentiment    : positive
selected_text: "e funny"
---------------------------------------------------------------
text         : " yea i just got outta one too....i want him 
                 back tho  but i feel the same way...i`m cool 
                 on dudes for a lil while"
sentiment    : positive
selected_text: "m cool"

However, most teams (including my team) did not strive to figure out how such weird selected_text are selected or just treated it as a mistake such that they chose to ignore it in the consideration of overfitting if trying to correct it.

This turns out to be a watershed of this competition. Teams solved this problem were still among the top ranked positions in the private leaderboard but those who did not fix this problem had shakes on their ranks to some degree. I found that the scores of top 30 30 30 teams are mostly stable but other than those, the private LB had a huge shake that was out of my expectation. The fun part is: I know there would be a shake in the competition, so I gave my team the name Hope no shake, but it didn’t help at all 😦 . My team was in the silver medal range in the public LB but our rank dropped to almost 800 800 800th in the private LB! What a shame! There are teams even more unfortunate than us and dropped their ranks from top 50 to bottom 200…

Anyway, there are still some fancy and interesting solution among top ranked teams and their solutions can be divided into three categories:

  • Solution with character-level model only (First place solution! Awesome!).
  • solution with well-designed post-processing.
  • Bolution with both character-level model and well-designed post-processing.

After the competition, I spend one week trying to understand their solutions and unique ideas and really learned a lot. So here I would like to share their ideas to those who are interested.

In the rest of the post, I made a summary of the top solutions and also add some of my understanding. The reference are at the bottom. So let’s get started!

What is the MAGIC?

This is just a bug introduced when the competition holder created this task. Here shows a representative example and we call this the “noise” in the labels.

The given original annotation is “onna” but it is too weird. The true annotation should be “miss” (this is a negative sentence). We think that the host applied a wrong slice obtained on the normalized text without consequence spaces for the original text with plenty of spaces, emojis, or emoticons.

Here is how to solve it theoretically:

  • Recover true annotation from the buggy annotation (pre-processing).
  • Train model with true annotation.
  • Predict the right annotation.
  • Project back the right annotation to the buggy annotation (post-processing).

Here is the visualization:

Common Methods

Most Kagglers use the following model structure (from public notebook) as a baseline and here is the illustration (I copied it from the author, the link in at the bottom of the post). This is the tensorflow version:

We are given text, selected_text, and sentiment. For roBERTa model, we prepare question answer as <s> text </s></s> sentiment </s>. Note that roBERTa tokenizer sometimes creates more than 1 token for 1 word. Let’s take the example “Kaggle is a fun webplace!”. The word “webplace” will be split into two tokens “[web][place]” by roBERTa tokenizer.

After converting text and selected_text into tokens, we can then determine the start index and end index of selected_text within text. We will one hot encode these indices. Below are the required inputs and targets for roBERTa. In this example, we have chosen roBERTa with max_len=16, so our input_ids have 2 <pad> tokens.

We begin with vanilla TFRobertaModel. This model takes our 16 (max_len) input_ids and outputs 16 16 16 vectors each of length 768. Each vector is a representation of one input token.

We then apply tf.keras.layers.Conv1D(filters=1, kernel_size=1) which transforms this matrix of size ( 768 , 16 768, 16 768,16) into a vector of size (1, 16). Next we apply softmax to this length 16 16 16 vector and get a one hot encoding of our start_index. We build another head for our end_index.

Most top ranked kagglers implemented the following two methods: Label Smoothing and Multi-sample Dropout. SO I would like to talk about this methods first before I go forward.

Label Smoothing

When we apply the cross-entropy loss to a classification task, we’re expecting true labels to have 1, while the others 0. In other words, we have no doubts that the true labels are true, and the others are not. Is that always true in our case? As a result, the ground truth labels we have had perfect beliefs on are possibly wrong.

One possible solution to this is to relax our confidence on the labels. For instance, we can slightly lower the loss target values from 1 to, say, 0.9. And naturally, we increase the target value of 0 for the o

  • 5
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值