Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
命名实体识别 ( NER ) (也称为实体标识 , 实体分块和实体提取 )是信息提取的子任务,旨在将非结构化文本中提到的命名实体定位和分类为预定义类别,例如人员姓名,组织,位置,医疗代码,时间表达,数量,货币价值,百分比等。
In this project, we will work with a NER dataset provided by kaggle. The dataset can be accessed here. This dataset is the extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. Dataset also includes one additional feature (POS) that can be used in classification. In this project, however we are working only with one feature sentence.
在此项目中,我们将使用kaggle提供的NER数据集。 数据集可在此处访问。 该数据集是GMB语料库的摘录,经过标记,注释和构建,专门用于训练分类器预测诸如名称,位置等命名实体。数据集还包括一个可用于分类的附加功能(POS)。 但是,在此项目中,我们仅使用一个功能语句。
1.加载数据集 (1. Load the dataset)
Lets begin by loading and visualising the dataset. To download ner_dataset.csv go to this link in kaggle.
让我们从加载和可视化数据集开始。 要下载ner_dataset.csv,请转至kaggle中的此链接 。
We will have to use encoding = ‘unicode_escape’ while loading the data. This function takes a parameter to toggle the addition of the wrapping quotes and escaping of that quote in a string.
加载数据时,我们将必须使用encoding ='unicode_escape'。 此函数使用一个参数来切换包装引号和字符串中引号的转义。
import pandas as pd
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/91686fe8d0e5994cfecd323aa8a5ab91.png)
From the dataset we can see the sentences are already broken into tokens in the column ‘Word’ which will be our feature (X). The column ‘sentence #’ displays the sentence number once and then prints NaN till the next sentence begins. The ‘Tag’ column will be our label (y).
从数据集中,我们可以看到句子已在“单词”列中被分解为标记,这将成为我们的特征(X)。 “句子#”列一次显示句子编号,然后打印NaN直到下一个句子开始。 “标签”列将是我们的标签(y)。