使用Keras双向LSTM的命名实体识别(NER)

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

命名实体识别 ( NER ) (也称为实体标识实体分块实体提取 )是信息提取的子任务,旨在将非结构化文本中提到的命名实体定位和分类为预定义类别,例如人员姓名,组织,位置,医疗代码,时间表达,数量,货币价值,百分比等。

In this project, we will work with a NER dataset provided by kaggle. The dataset can be accessed here. This dataset is the extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. Dataset also includes one additional feature (POS) that can be used in classification. In this project, however we are working only with one feature sentence.

在此项目中,我们将使用kaggle提供的NER数据集。 数据集可在此处访问。 该数据集是GMB语料库的摘录,经过标记,注释和构建,专门用于训练分类器预测诸如名称,位置等命名实体。数据集还包括一个可用于分类的附加功能(POS)。 但是,在此项目中,我们仅使用一个功能语句。

1.加载数据集 (1. Load the dataset)

Lets begin by loading and visualising the dataset. To download ner_dataset.csv go to this link in kaggle.

让我们从加载和可视化数据集开始。 要下载ner_dataset.csv,请转至kaggle中的此链接

We will have to use encoding = ‘unicode_escape’ while loading the data. This function takes a parameter to toggle the addition of the wrapping quotes and escaping of that quote in a string.

加载数据时,我们将必须使用encoding ='unicode_escape'。 此函数使用一个参数来切换包装引号和字符串中引号的转义。

import pandas as pd
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()
Image for post

From the dataset we can see the sentences are already broken into tokens in the column ‘Word’ which will be our feature (X). The column ‘sentence #’ displays the sentence number once and then prints NaN till the next sentence begins. The ‘Tag’ column will be our label (y).

从数据集中,我们可以看到句子已在“单词”列中被分解为标记,这将成为我们的特征(X)。 “句子#”列一次显示句子编号,然后打印NaN直到下一个句子开始。 “标签”列将是我们的标签(y)。

2.提取神经网络所需的映射 (2. Extract mappings required for the neural network)

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值