命名实体识别学习-数据集介绍-conll03

最新推荐文章于 2025-05-07 14:29:51 发布

StarLib

最新推荐文章于 2025-05-07 14:29:51 发布

阅读量1.2w

点赞数 11

分类专栏： NLP python

本文链接：https://blog.csdn.net/StarLib/article/details/107350559

版权

本文详细介绍了命名实体识别任务中常用的CoNLL03数据集，包括类别（人名、地名、组织名、杂项实体）、数据集样例、数据结构（BIO标注法）以及预处理步骤，旨在为NER系统的学习和开发提供基础。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

title: ’ 命名实体识别学习-数据集介绍-conll03 ’
date: 2020-07-14 22:46:05
tags:

命名实体识别学习-数据集介绍-conll03

文章目录

命名实体识别学习-数据集介绍-conll03

conll 2003 是命名实体中最常见的公开数据集。其官网： https://www.clips.uantwerpen.be/conll2003/ner/

有很详细的介绍。

一类别个数

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.

上文来自官网，高亮部分介绍其所要分的类别。总共四类：persons, locations, organizations ,miscellaneous entities

二数据集样例

这是其训练集中某个部分。

通过其官网介绍，可知改数据集第一例是单词，第二列是词性，第三列是语法快，第四列是实体标签。在NER任务中，只关心第一列和第四列。实体类别标注采用BIO标注法，前面博客介绍这种标注法。

以下是官网的介绍：

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the sa