方法类型:自监督学习*self supervised (训练使用数据:UNlabeled dataset)
1.问题来源
bert得到的embedding是collapse的。(不是各项异性的球体)
BERT-derived native sentence representations are somehow collapsed (Chen and He, 2020), which means almost all sentences are mapped into a small area and therefore produce high similarity.
2. 文章贡献
- 使用contrastive learning得到更加的semantic representation
- 使用数据增强策略,得到数据集的正负例。(文中提到的data argument strategy 有四种:adversarial attack , token shuffling, cutoff and dropout)
在摘要中提到了一句对模型效果的说明:With only 1,000 unlabeled texts
drawn from the target distribution (which is easy to collect in real-world applications), we achieve 35% relative performance gain over BERT
3. 论文方法
直接看这篇文吧:https://blog.csdn.net/Hekena/article/details/129828996