文章目录
往期文章目录链接
Note
This post is the second part of overall summarization of the competition. The first half is here.
Noteworthy ideas in 1st place solution
Idea
First step:
Use transformers to extract token level start and end probabilities.
Second step:
Feed these probabilities to a character level model. This step gives the team a huge improve on the final score since it handled the “noise” in the data properly.
Last step:
Ensemble.
Second level models Architectures
The following three Char-NN architectures uses character-level probabilities as input. The first level models output token-level probabilities and the following code convert token-level probabilities to character-level probabilities. The idea in the following cide is to assigning each character the probability of the corresponding token.
def token_level_to_char_level(text, offsets, preds):
probas_char = np.zeros(len(text))
for i, offset in enumerate(offsets):
if offset[0] or offset[1]: # remove padding and sentiment
probas_char[offset[0]:offset[1]] = preds[i]
return probas_char
Things you need to know for nn.Embedding
The following architectures all train the embedding from scratch. Here we want to shortly discuss how nn.Embedding
works.
nn.Embedding
holds a Tensor of dimension (vocab_size, vector_size), i.e., of (the size of the vocabulary, the dimension of each vector embedding), and a method that does the lookup. When you create an embedding layer, the Tensor is initialised randomly.
You can also add pretrained weights with the command nn.Embedding.from_pretrained(weight)
.
Architecture 1: RNN
In the following, the parameter len_voc
is calculated by
tokenizer.fit_on_texts(df_train['text'].values)
len_voc = len(tokenizer.word_index) + 1
Compare the following code with the figure above.
class TweetCharModel(nn.Module):
# check the config in the original code post
def __init__(self, len_voc, use_msd=True,
embed_dim=64, lstm_dim=64, char_embed_dim=32, sent_embed_dim=32, ft_lstm_dim=32, n_models=1):
super().__init__()
self.use_msd = use_msd
self.char_embeddings = nn.Embedding(len_voc, char_embed_dim)
self.sentiment_embeddings = nn.Embedding(3, sent_embed_dim) # 3 sentiments
self.proba_lstm = nn.LSTM(n_models * 2, ft_lstm_dim, batch_first=True, bidirectional=True)
self.lstm = nn.LSTM(char_embed_dim + ft_lstm_dim * 2 + sent_embed_dim, lstm_dim, batch_first=True, bidirectional=True)
self.lstm2 = nn.LSTM(lstm_dim * 2, lstm_dim, batch_first=True, bidirectional=True)
self.logits = nn.Sequential(
nn.Linear(lstm_dim * 4, lstm_dim),
nn.ReLU(),
nn.Linear(lstm_dim, 2))
self.high_dropout = nn.Dropout(p=0.5)
def forward(self, tokens, sentiment, start_probas, end_probas):
bs, T = tokens.size()
probas = torch.cat([start_probas, end_probas], -1)
probas_fts, _ = self.proba_lstm(probas)
char_fts = self.char_embeddings(tokens)
sentiment_fts = self.sentiment_embeddings(sentiment).view(bs, 1, -1)
sentiment_fts = sentiment_fts.repeat((1, T, 1))
features = torch.cat([char_fts, sentiment_fts, probas_fts], -1)
features, _ = self.lstm(features)
features2, _ = self.lstm2(features)
features = torch.cat([features, features2], -1)
# Multi-sample dropout (MSD)
if self.use_msd and self.training:
logits = torch.mean(
torch.stack(
[self.logits(self.high_dropout(features)) for _ in range(5)],
dim=0),
dim=0)
else:
logits = self.logits(features)
start_logits, end_logits = logits[:, :, 0], logits[:, :, 1]
return start_logits, end_logits
Architecture 2: CNN
class ConvBlock(nn.Module):
# check the config in the original code post
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, dilation=1, padding="same", use_bn=True):
super().__init__()
if padding == "same":
padding = kernel_size // 2 * dilation
if use_bn:
self.conv = nn.Sequential(
nn.Conv1d(in_channels, out_channels, kernel_size, padding=padding, stride=stride, dilation=dilation),
nn.BatchNorm1d(out_channels),
nn.ReLU())
else:
self.conv = nn.Sequential(
nn.Conv1d(in_channels, out_channels, kernel_size, padding=