【NLP】数据集准备GSAP-NER-CSDN博客

本文链接：https://blog.csdn.net/Lily_2002_/article/details/135298816

GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

论文连接

四个主要贡献 four key contributions

人工注释的数据集manually annotated dataset

数据集原材料的选择Publication Sampling

根据标签集标注后的总结和评估

F1 score（西瓜书上的截图）

细颗粒度的标签集fine-grained tag

编辑

基线评估evaluation of baseline models

最小标注数量minimum number of annotation

结论Conclusion

我的收获

四个主要贡献 four key contributions

• 人工注释的数据集，其中包含100篇计算机科学的全文出版物，包含25,857个句子中的超过54,000个实体提及 (Section 3)。

• 细粒度的标签集，设计用于检测学术实体和概念，以反映科学出版物中机器学习模型和数据集的使用和展示方式 (Section 3.1)。

• 对定义的十个实体类型的基线模型进行了全面的性能评估 (Section 4 and 5)。

• 我们探索了在我们的细粒度学术NER任务中实现满意性能所需的最少数量的注释出版物，这可以指导未来的注释项目 (Section 6.2)。

人工注释的数据集manually annotated dataset

数据集原材料的选择Publication Sampling

(1)考虑因素1：流行程度popularity

在Huggingface上找

——最常使用的模型most frequently used models

——并找到他们对应发表的论文上links to publications

(2)考虑因素2：多样性diversity

在arXiv上找

——关键词匹配keyword match (i.e., “cs.LG: Machine Learning”)

——时间匹配time frame (i.e.,first upload between 2018 and 2022).

根据标签集标注后的总结和评估

数据集原材料的14%由三位标注员联合进行标注，下图左侧表格是人工标注的F1score（一个标注者是基准标注，另一个是预测标注，然后反转它们的角色。）partial-match：部分重叠的跨度视为匹配，不将不同的标注边界视为错误。

下图右侧表格是标注好后数据集信息的统计。unique表示整个数据集中只有一个，比如ABCmodel在所有文章中只出现了一次，那么它就是unique。

F1 score（西瓜书上的截图）

细颗粒度的标签集fine-grained tag

标签集如表格所示，示例如下图所示

基线评估evaluation of baseline models

有了数据集后需要为数据集定基线，那么如何选择baseline models呢，根据调查现在大多数还是采用“pre-train, fine-tune, predict”- paradigm这样的形式，而另一个流行趋势the “pre-train, prompt, predict”-paradigm这样的形式的模型表现都没有前面好，所以选择了“pre-train, fine-tune, predict”- paradigm。

那么pre-train model怎么选呢？本篇文章主要研究的是有关“模型”和“数据集”的实体标注，这些都出现在学术文章中，所以作者选择了表格中这些预训练模型。得出了以下两个分数，发现SciDeBERTa-CS表现很好，为什么啊？这是因为它是在计算机科学领域预训练的，相比其他的预训练模型，它与本文的研究对象更相符合，所以它表现好也不奇怪。

最小标注数量minimum number of annotation

如图，很显然

结论Conclusion

把ML models从methods中区分出来并且把datasets从materials中区分出来（distinguish ML models from methods and datasets from materials

First, our work suffers from low interrater agreement on certain entity types, and thus, the model performs poor on those types.

Second, 局限在机器学习领域并且缺少infrequent publication types

the paper selection is conducted within the machine learning domain and does not include infrequent publication types, such as surveys or reproducibility studies.

我的收获

知道了F1 score是什么东西
知道了数据集准备的过程：制定标签（这个好像很复杂，因为有些真的很难界定）、选择raw material、进行标注、进行评估、再遇到什么问题再进行探讨以及进一步商定。。。
对于标注的评估可以好几个人同时标注一篇文章然后把一个人当作ground truth另一个人当作待评估对象计算F1 score