title: ’ 命名实体识别学习-数据集介绍-conll03 ’
date: 2020-07-14 22:46:05
tags:
命名实体识别学习-数据集介绍-conll03
conll 2003 是命名实体中最常见的公开数据集。其官网: https://www.clips.uantwerpen.be/conll2003/ner/
有很详细的介绍。
一 类别个数
Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:
[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .
The shared task of CoNLL-2003 concerns language-independent named entity recognition.
We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.
上文来自官网,高亮部分介绍其所要分的类别。总共四类:persons, locations, organizations ,miscellaneous entities
二 数据集样例
这是其训练集中某个部分。
通过其官网介绍,可知改数据集第一例是单词,第二列是词性,第三列是语法快,第四列是实体标签。在NER任务中,只关心第一列和第四列。实体类别标注采用BIO标注法,前面博客介绍这种标注法。
以下是官网的介绍:
The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the sa