1 Introduction
Figure 1 atchitecture of multi expert model
just as the shown diagram, which categorize the mention-pairs into 6 types, and there will be a trainable scoring function for each category.
2 six types mention pairs
- PRON-PRON-C: compatible pronouns based on their attributes such as gender, number and animacy;
- PRON-PRON-NC: incompatible pronouns;
- ENT-PRON: a pronoun and another span
- MATCH: non-pronoun spans with the same content words;(not pronoun and the content words are same)
- CONTAINS: one contains the content words of the other;(It's a little hard to understand this one, check the following example)
- OTHER: all other pairs.
CONTAINS example
He reportedly showed DeLay a videotape that made him weep. Tom DeLay then ...
3. mathematical expression
shared scoring functions(to calculate pairwise scores)
where
- candidate span
- query span which appears before in the sentence
- means null antecedent
- (scoring each individual span)
- (scoring the pairwise interaction), and are parameterized functions.
- is the pairwise scoring function for the candidate span and the mention
- means shared function, used to score all the mention pairs. but this paper didn't provide the mathematical expression of and .
sum of probabilities
For each possible mention , the learning objective optimizes the sum of probabilities over the true antecedent of .
where
- means sum of probabilities.
- is the set of all candidate antecedents with a null antecedent .
- is the set of the true antecedents of .
- represents the probability that a candidate span is the correct antecedent given the mention .
where
- is a specific candidate span,
- represents all other candidate spans considered for mention .
LINGMESS
pairwise scoring function
where
- is a deterministic, rule-based function to determine the category of the pair
- scoring as the antecedent of
Training
where
- means the model optimizes the objective function.
Training objective by also training each “expert” separately
where
- stand for model's predictions for the correct antecedents of mention under a specific type
- is in {PRON-PRON-C, PRON-PRON-NC, ENT-PRON, MATCH, CONTAINS, OTHER}
Final objective for each mention span is
where
means the sum of probabilities over the type .
4 Experiment
5 Debug
Call multiple GPUs
An A6000(48GB) is not enough to run the project, so we use nn.DataParallel to call multiple GPUs to train the model, if input batch_size=20, a single GPU will work on total batch_size, but with nn.DataParallel, GPU:0 could work in batch_size=8, GPU: 1 work on batch_size=8 and GPU:3 work on batch_size=4, which could release the pressure of single GPU.
To deploy that, use
model = nn.DataParallel(model)
there are some details of it, debug in the project could be better.
note: The shape output from every device should be same(batch_size could be different), otherwise the gather operation(which happened when you got the outputs from the model) will fail, the errors will occur. example:
# ok
torch.Size([]) torch.Size([1, 30, 30]) torch.Size([1, 30, 30]) cuda:2
torch.Size([]) torch.Size([2, 30, 30]) torch.Size([2, 30, 30]) cuda:1
torch.Size([]) torch.Size([2, 30, 30]) torch.Size([2, 30, 30]) cuda:0
# gather failed
torch.Size([]) torch.Size([1, 482, 482]) torch.Size([1, 482, 482]) cuda:1
torch.Size([]) torch.Size([1, 451, 451]) torch.Size([1, 451, 451]) cuda:0