Kaggle: Tweet Sentiment Extraction 方法总结 Part 2/2: 金牌思路总结

最新推荐文章于 2021-03-07 14:19:27 发布

Jay_Tang

最新推荐文章于 2021-03-07 14:19:27 发布

阅读量1.1k

点赞数 5

分类专栏：比赛/项目经验分享文章标签：自然语言处理 tensorflow pytorch

本文链接：https://blog.csdn.net/Jay_Tang/article/details/107059971

版权

文章目录

往期文章目录链接

Note

This post is the second part of overall summarization of the competition. The first half is here.

Noteworthy ideas in 1st place solution

Idea

First step:

Use transformers to extract token level start and end probabilities.

Second step:

Feed these probabilities to a character level model. This step gives the team a huge improve on the final score since it handled the “noise” in the data properly.

Last step:

Ensemble.

Second level models Architectures

The following three Char-NN architectures uses character-level probabilities as input. The first level models output token-level probabilities and the following code convert token-level probabilities to character-level probabilities. The idea in the following cide is to assigning each character the probability of the corresponding token.

def token_level_to_char_level(text, offsets, preds):
    probas_char = np.zeros(len(text))
    for i, offset in enumerate(offsets):
        if offset[0] or offset[1]: # remove padding and sentiment
            probas_char[offset[0]:offset[1]] = preds[i]
    
    return probas_char

Things you need to know for nn.Embedding

The following architectures all train the embedding from scratch. Here we want to shortly discuss how nn.Embedding works.

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e., of (the size of the vocabulary, the dimension of each vector embedding), and a method that does the lookup. When you create an embedding layer, the Tensor is initialised randomly.

You can also add pretrained weights with the command nn.Embedding.from_pretrained(weight).

Architecture 1: RNN

In the following, the parameter len_voc is calculated by

tokenizer.fit_on_texts(df_train['text'].values)
len_voc = len(tokenizer.word_index) + 1

Compare the following code with the figure above.

class TweetCharModel(nn.Module):
    # check the config in the original code post
    def __init__(self, len_voc, use_msd=True,
                 embed_dim=64, lstm_dim=64, char_embed_dim=32, sent_embed_dim=32, ft_lstm_dim=32, n_models=1):
        super().__init__()
        self.use_msd = use_msd
        
        self.char_embeddings = nn.Embedding(len_voc, char_embed_dim)
        self.sentiment_embeddings = nn.Embedding(3, sent_embed_dim) # 3 sentiments
        
        self.proba_lstm = nn.LSTM(n_models * 2, ft_lstm_dim, batch_first=True, bidirectional=True)
        
        self.lstm = nn.LSTM(char_embed_dim + ft_lstm_dim * 2 + sent_embed_dim, lstm_dim, batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(lstm_dim * 2, lstm_dim, batch_first=True, bidirectional=True)

        self.logits = nn.Sequential(
            nn.Linear(lstm_dim *  4, lstm_dim),
            nn.ReLU(),
            nn.Linear(lstm_dim, 2))
        
        self.high_dropout = nn.Dropout(p=0.5)
    
    def forward(self, tokens, sentiment, start_probas, end_probas):
        bs, T = tokens.size()
        
        probas = torch.cat([start_probas, end_probas], -1)
        probas_fts, _ = self.proba_lstm(probas)

        char_fts = self.char_embeddings(tokens)
        
        sentiment_fts = self.sentiment_embeddings(sentiment).view(bs, 1, -1)
        sentiment_fts = sentiment_fts.repeat((1, T, 1))
        
        features = torch.cat([char_fts, sentiment_fts, probas_fts], -1)
        features, _ = self.lstm(features)
        features2, _ = self.lstm2(features)
        
        features = torch.cat([features, features2], -1)
        
        # Multi-sample dropout (MSD)
        if self.use_msd and self.training:
            logits = torch.mean(
                torch.stack(
                    [self.logits(self.high_dropout(features)) for _ in range(5)],
                    dim=0),
                dim=0)
        else:
            logits = self.logits(features)

        start_logits, end_logits = logits[:, :, 0], logits[:, :, 1]

        return start_logits, end_logits

Architecture 2: CNN

class ConvBlock(nn.Module):
    # check the config in the original code post
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, dilation=1, padding="same", use_bn=True):
        super().__init__()
        if padding == "same":
            padding = kernel_size // 2 * dilation
        
        if use_bn:
            self.conv = nn.Sequential(
                nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding, stride=stride, dilation=dilation),
                nn.BatchNorm1d(out_channels),
                nn.ReLU())
        else:
            self.conv = nn.Sequential(
                nn.Conv1d(in_channels, out_channels, kernel_size, padding=

最低0.47元/天解锁文章

Jay_Tang

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
9
评论
Kaggle: Tweet Sentiment Extraction 方法总结 Part 2/2: 金牌思路总结

Before we startI attended two NLP competition in June, Tweet Sentiment Extraction and Jigsaw Multilingual Toxic Comment Classification, and I’m happy to be a Kaggle Expert from now on ????Tweet Sentiment ExtractionGoal:The objective in this competitio
复制链接

扫一扫