命名实体识别学习-数据集介绍-conll03

本文详细介绍了命名实体识别任务中常用的CoNLL03数据集,包括类别(人名、地名、组织名、杂项实体)、数据集样例、数据结构(BIO标注法)以及预处理步骤,旨在为NER系统的学习和开发提供基础。
摘要由CSDN通过智能技术生成

title: ’ 命名实体识别学习-数据集介绍-conll03 ’
date: 2020-07-14 22:46:05
tags:


命名实体识别学习-数据集介绍-conll03

文章目录


conll 2003 是命名实体中最常见的公开数据集。其官网: https://www.clips.uantwerpen.be/conll2003/ner/

有很详细的介绍。

一 类别个数

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.

上文来自官网,高亮部分介绍其所要分的类别。总共四类:persons, locations, organizations ,miscellaneous entities

二 数据集样例

image-20200714231551083

这是其训练集中某个部分。

通过其官网介绍,可知改数据集第一例是单词,第二列是词性,第三列是语法快,第四列是实体标签。在NER任务中,只关心第一列和第四列。实体类别标注采用BIO标注法,前面博客介绍这种标注法。

以下是官网的介绍:

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the sa

  • 11
    点赞
  • 40
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值