有维护
A/B:榜单
榜单 | 榜 | top1 模型 | em(exact match) | f1 | acc | mrr | score |
---|---|---|---|---|---|---|---|
GrailQA | Overall | ReTraCk | 58.136 | 65.285 | |||
- | Compositional Generalization | ReTraCk | 61.499 | 70.911 | |||
- | Zero-shot Generalization | ArcaneQ | 49.964 | 58.844 | |||
PubMedQA | - | Baseline Model | 52.72 | 68.08 | |||
AmbigQA | Standard setting | Refuel | 44.3(all) 34.8(multi) 15.9(bleu) 10.1 | ||||
- | Zero-shot setting | SpanSeqGen | 42.2 | 30.8(all) 20.7(multi) | |||
DREAM | - | ALBERT-xxlarge + DUMA + Multi-Task Learning | 91.8 | ||||
MathQA | - | Seq2Prog+Cat | 37.4 | ||||
LC-QuAD 2.0 | |||||||
ComQA | - | 22.4 | |||||
QASC | - | UnifiedQA | 0.8957 | ||||
Quoref | - | CorefRoBERTa | 0.8061 | 0.8670 | |||
Physical IQA | - | UNICORN | 0.9013 | ||||
Social IQA | - | UNICORN | 0.8315 | ||||
CoQA | - | RoBERTa + AT + KD | 91.4(in-domain) 89.2(out-of-domain) 90.7(overall) | ||||
DROP | - | QDGAT - ALBERT | 0.8704 | 0.9010 | |||
ARC | - | UnifiedQA + ARC MC/DA + IR | 0.8140 | ||||
CommonsenseQA | |||||||
ComplexWebQuestions | |||||||
HotpotQA | Distractor Setting | S2G+ | 70.72{ans) 64.30(sup) 48.60(joint) | 83.53(ans) 88.72(sup) 75.45(joint) | |||
- | Fullwiki Setting | TPRR | 66.95(ans) 59.43(sup) 44.37(joint) | 79.50(ans) 84.25(sup) 70.83(joint) | |||
OpenBookQA | - | UnifiedQA | 0.872 | ||||
ProPara Dataset | - | KOALA | 0.704 | 0.777 | |||
QuAC | - | RoR | 74.9 | ||||
RACE | - | ALBERT-SingleChoice + transfer learning | 91.4 | ||||
ReCoRD | - | LUKE | 90.64 | 91.21 | |||
QAngaroo | WikiHop | RealFormer-large | 84.4 | ||||
- | MedHop | MedKGQA | 64.8 | ||||
ShARC | End-to-end Task | DGM | 0.774(micro) 0.812(macro) | ||||
SWAG | - | DeBERTa | 0.9171 | ||||
SQuAD | 2.0 | FPNet | 90.871 | 93.183 | |||
- | 1.1 | LUKE | 90.202 | 95.379 | |||
TriviaQA | |||||||
Who-did-What | - | GA with word features | 0.712(who-did-what) 0.77(cnn) |