Kaggle: Tweet Sentiment Extraction 方法总结 Part 1/2: 常用方法总结

最新推荐文章于 2023-08-20 16:46:33 发布

Jay_Tang

最新推荐文章于 2023-08-20 16:46:33 发布

阅读量1.9k

点赞数 5

分类专栏：比赛/项目经验分享文章标签：自然语言处理 tensorflow pytorch

本文链接：https://blog.csdn.net/Jay_Tang/article/details/107060211

版权

文章目录

往期文章目录链接

Note

This post is the first part of overall summarization of the competition. The second half is here.

Before we start

I attended two NLP competition in June, Tweet Sentiment Extraction and Jigsaw Multilingual Toxic Comment Classification, and I’m happy to be a Kaggle Expert from now on 😃

Tweet Sentiment Extraction

Goal:

The objective in this competition is to “Extract support phrases for sentiment labels”. More precisely, this competition asks kagglers to construct a model that can figure out what word or phrase best supports the given tweet from the labeled sentiment. In other word, kagglers’re attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc).

For example:

text         : "Sooo SAD I will miss you here in San Diego!!!"
sentiment    : negative
selected_text: "Sooo SAD"

In this competition, the state-of-the-art (SOTA) transformer models were not so bad in extracting the selected_text. The main problem was to capture the “noise” in the dataset.

The organizer of this competition did not introduce the “noise” (magic of the competition) on purpose but probably by some regex mistake (I’ll talk about the “noise” in the next section). When I analyzed the data, I found some weird selected_text like most other teams did. For example,

text         : "  ROFLMAO for the funny web portal  =D"
sentiment    : positive
selected_text: "e funny"
---------------------------------------------------------------
text         : " yea i just got outta one too....i want him 
                 back tho  but i feel the same way...i`m cool 
                 on dudes for a lil while"
sentiment    : positive
selected_text: "m cool"

However, most teams (including my team) did not strive to figure out how such weird selected_text are selected or just treated it as a mistake such that they chose to ignore it in the consideration of overfitting if trying to correct it.

This turns out to be a watershed of this competition. Teams solved this problem were still among the top ranked positions in the private leaderboard but those who did not fix this problem had shakes on their ranks to some degree. I found that the scores of top $30$ teams are mostly stable but other than those, the private LB had a huge shake that was out of my expectation. The fun part is: I know there would be a shake in the competition, so I gave my team the name Hope no shake, but it didn’t help at all 😦 . My team was in the silver medal range in the public LB but our rank dropped to almost $800$ th in the private LB! What a shame! There are teams even more unfortunate than us and dropped their ranks from top 50 to bottom 200…

Anyway, there are still some fancy and interesting solution among top ranked teams and their solutions can be divided into three categories:

Solution with character-level model only (First place solution! Awesome!).
solution with well-designed post-processing.
Bolution with both character-level model and well-designed post-processing.

After the competition, I spend one week trying to understand their solutions and unique ideas and really learned a lot. So here I would like to share their ideas to those who are interested.

In the rest of the post, I made a summary of the top solutions and also add some of my understanding. The reference are at the bottom. So let’s get started!

What is the MAGIC?

This is just a bug introduced when the competition holder created this task. Here shows a representative example and we call this the “noise” in the labels.

The given original annotation is “onna” but it is too weird. The true annotation should be “miss” (this is a negative sentence). We think that the host applied a wrong slice obtained on the normalized text without consequence spaces for the original text with plenty of spaces, emojis, or emoticons.

Here is how to solve it theoretically:

Recover true annotation from the buggy annotation (pre-processing).
Train model with true annotation.
Predict the right annotation.
Project back the right annotation to the buggy annotation (post-processing).

Here is the visualization:

Common Methods

Most Kagglers use the following model structure (from public notebook) as a baseline and here is the illustration (I copied it from the author, the link in at the bottom of the post). This is the tensorflow version:

We are given text, selected_text, and sentiment. For roBERTa model, we prepare question answer as <s> text </s></s> sentiment </s>. Note that roBERTa tokenizer sometimes creates more than 1 token for 1 word. Let’s take the example “Kaggle is a fun webplace!”. The word “webplace” will be split into two tokens “[web][place]” by roBERTa tokenizer.

After converting text and selected_text into tokens, we can then determine the start index and end index of selected_text within text. We will one hot encode these indices. Below are the required inputs and targets for roBERTa. In this example, we have chosen roBERTa with max_len=16, so our input_ids have 2 <pad> tokens.

We begin with vanilla TFRobertaModel. This model takes our 16 (max_len) input_ids and outputs $16$ vectors each of length 768. Each vector is a representation of one input token.

We then apply tf.keras.layers.Conv1D(filters=1, kernel_size=1) which transforms this matrix of size ( $768, 16$ ) into a vector of size (1, 16). Next we apply softmax to this length $16$ vector and get a one hot encoding of our start_index. We build another head for our end_index.

Most top ranked kagglers implemented the following two methods: Label Smoothing and Multi-sample Dropout. SO I would like to talk about this methods first before I go forward.

Label Smoothing

When we apply the cross-entropy loss to a classification task, we’re expecting true labels to have 1, while the others 0. In other words, we have no doubts that the true labels are true, and the others are not. Is that always true in our case? As a result, the ground truth labels we have had perfect beliefs on are possibly wrong.

One possible solution to this is to relax our confidence on the labels. For instance, we can slightly lower the loss target values from 1 to, say, 0.9. And naturally, we increase the target value of 0 for the o

最低0.47元/天解锁文章

Jay_Tang

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
Kaggle: Tweet Sentiment Extraction 方法总结 Part 1/2: 常用方法总结

文章目录往期文章目录链接NoteBefore we startTweet Sentiment ExtractionWhat is the MAGIC?Common MethodsLabel SmoothingImplementation of Label SmoothingIn tensorflowIn pytorchMulti-sample dropoutImplementationStochastic Weight Averaging (SWA)Different learning rate setti
复制链接

扫一扫