Kaggle: Tweet Sentiment Extraction 方法总结 Part 2/2: 金牌思路总结

往期文章目录链接

Note

This post is the second part of overall summarization of the competition. The first half is here.

Noteworthy ideas in 1st place solution

Idea

First step:

Use transformers to extract token level start and end probabilities.

Second step:

Feed these probabilities to a character level model. This step gives the team a huge improve on the final score since it handled the “noise” in the data properly.

Last step:

Ensemble.

Second level models Architectures

The following three Char-NN architectures uses character-level probabilities as input. The first level models output token-level probabilities and the following code convert token-level probabilities to character-level probabilities. The idea in the following cide is to assigning each character the probability of the corresponding token.

def token_level_to_char_level(text, offsets, preds):
    probas_char = np.zeros(len(text))
    for i, offset in enumerate(offsets):
        if offset[0] or offset[1]: # remove padding and sentiment
            probas_char[offset[0]:offset[1]] = preds[i]
    
    return probas_char

Things you need to know for nn.Embedding

The following architectures all train the embedding from scratch. Here we want to shortly discuss how nn.Embedding works.

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e., of (the size of the vocabulary, the dimension of each vector embedding), and a method that does the lookup. When you create an embedding layer, the Tensor is initialised randomly.

You can also add pretrained weights with the command nn.Embedding.from_pretrained(weight).

Architecture 1: RNN

In the following, the parameter len_voc is calculated by

tokenizer.fit_on_texts(df_train['text'].values)
len_voc = len(tokenizer.word_index) + 1

Compare the following code with the figure above.

class TweetCharModel(nn.Module):
    # check the config in the original code post
    def __init__(self, len_voc, use_msd=True,
                 embed_dim=64, lstm_dim=64, char_embed_dim=32, sent_embed_dim=32, ft_lstm_dim=32, n_models=1):
        super().__init__()
        self.use_msd = use_msd
        
        self.char_embeddings = nn.Embedding(len_voc, char_embed_dim)
        self.sentiment_embeddings = nn.Embedding(3, sent_embed_dim) # 3 sentiments
        
        self.proba_lstm = nn.LSTM(n_models * 2, ft_lstm_dim, batch_first=True, bidirectional=True)
        
        self.lstm = nn.LSTM(char_embed_dim + ft_lstm_dim * 2 + sent_embed_dim, lstm_dim, batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(lstm_dim * 2, lstm_dim, batch_first=True, bidirectional=True)

        self.logits = nn.Sequential(
            nn.Linear(lstm_dim *  4, lstm_dim),
            nn.ReLU(),
            nn.Linear(lstm_dim, 2))
        
        self.high_dropout = nn.Dropout(p=0.5)
    
    def forward(self, tokens, sentiment, start_probas, end_probas):
        bs, T = tokens.size()
        
        probas = torch.cat([start_probas, end_probas], -1)
        probas_fts, _ = self.proba_lstm(probas)

        char_fts = self.char_embeddings(tokens)
        
        sentiment_fts = self.sentiment_embeddings(sentiment).view(bs, 1, -1)
        sentiment_fts = sentiment_fts.repeat((1, T, 1))
        
        features = torch.cat([char_fts, sentiment_fts, probas_fts], -1)
        features, _ = self.lstm(features)
        features2, _ = self.lstm2(features)
        
        features = torch.cat([features, features2], -1)
        
        # Multi-sample dropout (MSD)
        if self.use_msd and self.training:
            logits = torch.mean(
                torch.stack(
                    [self.logits(self.high_dropout(features)) for _ in range(5)],
                    dim=0),
                dim=0)
        else:
            logits = self.logits(features)

        start_logits, end_logits = logits[:, :, 0], logits[:, :, 1]

        return start_logits, end_logits

Architecture 2: CNN
class ConvBlock(nn.Module):
    # check the config in the original code post
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, dilation=1, padding="same", use_bn=True):
        super().__init__()
        if padding == "same":
            padding = kernel_size // 2 * dilation
        
        if use_bn:
            self.conv = nn.Sequential(
                nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding, stride=stride, dilation=dilation),
                nn.BatchNorm1d(out_channels),
                nn.ReLU())
        else:
            self.conv = nn.Sequential(
                nn.Conv1d(in_channels, out_channels, kernel_size, padding=
  • 5
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 9
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值