NLP---信息抽取

最新推荐文章于 2024-07-19 11:28:29 发布

子颠三号倒四

最新推荐文章于 2024-07-19 11:28:29 发布

阅读量1k

点赞数

文章标签：自然语言处理

本文链接：https://blog.csdn.net/weixin_45629601/article/details/106040879

版权

Information Extraction简介

抽取实体(entities):
·通用性：人(person), 地名(location),时间(time)
·专业性：医疗领域(蛋白质，疾病，药物)

抽取关系(relations)
·位于（located in), 工作在(work at), 部分(is part of)
基于规则的方法，基于监督学习的方法，boostrap方法，distant-supervision方法，
无监督学习
·实现消歧 ·实现统一 ·指代消解 ·句法分析 ·CKY算法

applications:
·知识库的搭建 ·Google Scholar，CiteSeerX
·用户库：Repleaf,Spoke ·购物引擎，产品搜索
·专利分析·证券分析·问答系统

NER(Named Entity Recongntion)

命名实体识别，也称为"专名识别"，是指识别文本中具有特定意义的实体，主要包括人名，地名，机构名，专有名词

applications:
Chat bot中的"domain-intent-slots"在意图识别中要抽取实体。

English Toolkits:
·NLTK NE ·Spacy·Stanford Parser

Chinese Toolkits:
·Han NLP ·HIT NLP ·Fudan NLP ·or yours(自己搭建特定领域使用)

Creat NER Recognizer
·定义实体种类 ·准备训练数据 ·训练NER

Evaluate NER Recognizer: Precision/Recall/F_1score

Methods for NER:
·利用规则(比如正则)
·投票模型(Majority Voting)
·利用分类模型
$\quad$ ·非时序模型：逻辑回归，SVM…
$\quad$ ·时序模型：HMM，CRF，LSTM-CRF

Rule-based Approach
数据集

import pandas as pd
import numpy as np
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.tail(10)
words = list(set(data["Word"].values))
n_words = len(words)
35178

在这里插入图片描述

from sklearn.base import BaseEstimator, TransformerMixin
class MajorityVotingTagger(BaseEstimator, TransformerMixin):    
    def fit(self, X, y):
        """
        X: list of words
        y: list of tags
        """
        word2cnt = {}
        self.tags = []
        for x, t in zip(X, y):
            if t not in self.tags:
                self.tags.append(t)
            if x in word2cnt:
                if t in word2cnt[x]:
                    word2cnt[x][t] += 1
                else:
                    word2cnt[x][t] = 1
            else:
                word2cnt[x] = {t: 1}
        self.mjvote = {}
        
        for k, d in word2cnt.items():
            self.mjvote[k] = max(d, key=d.get)
    
    def predict(self, X, y=None):
        '''
        Predict the the tag from memory. If word is unknown, predict 'O'.
        '''
        return [self.mjvote.get(x, 'O') for x in X]

words = data["Word"].values.tolist()
tags = data["Tag"].values.tolist()
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
pred = cross_val_predict(estimator=MajorityVotingTagger(), X=words, y=tags, cv=5)
report = classification_report(y_pred=pred, y_true=tags)
print(report)