【NLP】spaCy笔记

最新推荐文章于 2024-07-11 16:30:41 发布

YWP_2016

最新推荐文章于 2024-07-11 16:30:41 发布

阅读量2.8k

点赞数 9

分类专栏： Python 文章标签：自然语言处理人工智能 nlp

本文链接：https://blog.csdn.net/ywp_2016/article/details/102914361

版权

Python 专栏收录该内容

47 篇文章 8 订阅

订阅专栏

参考

快速掌握spacy在python中进行自然语言处理（附代码&链接）

简介

基本介绍

spaCy的架构

spaCy 是一个号称工业级的自然语言处理工具包，最核心的数据结构是Doc和Vocab。Doc对象包含Token的序列和Token的注释（Annotation），Vocab对象是spaCy使用的词汇表（vocabulary），用于存储语言中共享的数据，spaCy通过集中存储字符串，单词向量和词汇属性（lexical attribute）等，避免存储数据的多个副本。

spaCy模块有4个非常重要的类：

Doc：访问语言注释的容器
Span：Doc对象的一个切片
Token：单独的Token，例如，单词，符号，空格等
Vocab：存储词汇表和语言共享的数据

Doc对象由Tokenizer构造，然后由管道（pipeline）的组件进行适当的修改。 Language对象协调这些组件，它接受原始文本并通过管道发送，返回带注释（Annotation）的文档。文本注释（Text Annotation）被设计为单一来源：Doc对象拥有数据，Span是Doc对象的视图。

spaCy的重要类

token

在自然语言处理中，把一个单词，一个标点符号，一个空格等叫做一个token。

通常属性是成对存在的，不带下划线的是属性的ID形式，带下划线的是属性的文本形式。

Token · spaCy API Documentation

doc

对一个文本数据进行分词之后，doc对象是token的序列，Span对象是Doc对象的一个切片。

Span

Vocab

Vocab对象用于存储词汇表和语言共享的数据，可以在不同的Doc对象之间共享数据，词汇表使用Lexeme对象和StringStore对象来表示。

Lexeme类型

An entry in the vocabulary A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

Lexeme对象是词汇表Vocab中的一个词条（entry），可以通过similarity()函数计算两个词条的相似性：

import spacy
nlp=spacy.load('en_core_web_lg')#注意：加载的不同

apple=nlp.vocab['apple']
# print(apple) apple是一个Lexeme对象>>><spacy.lexeme.Lexeme object at 0x0000020DFF08EEE8>
orange=nlp.vocab['orange']
pig=nlp.vocab['pig']
apple_orange=apple.similarity(orange)
apple_pig=apple.similarity(pig)
#print(apple_orange) 0.56189173
#print(apple_pig) 0.31820506

StringStore类型

StringStore类是一个string-to-int的对象，通过64位的哈希值来查找词汇，或者把词汇映射到64位的哈希值：

import spacy
from spacy.strings import StringStore
nlp=spacy.load("en_core_web_lg")
stringstore=StringStore(['apple'])
apple_hash=stringstore['apple']
#print(apple_hash) 8566208034543834098

apple_id=nlp.vocab.strings['apple']
#Vocab的strings属性是一个StringStore对象，用于存储共享的词汇数据：
#print(apple_id) 8566208034543834098

Vocab类

在初始化Vocab类时，传递参数strings是list或者StringStore对象，得到Vocab对象：

from spacy.vocab import Vocab
vocab=Vocab(strings=['apple'])
print(vocab.strings['apple']) #8566208034543834098

实践

加载语言模型

spaCy使用的语言模型是预先训练的统计模型，能够预测语言特征，对于英语，共有en_core_web_sm、en_core_web_md和en_core_web_lg三种语言模型，还有一种语言模型：en，需要以管理员权限运行以下命令来安装en模型。

import spacy
nlp=spacy.load('en_core_web_sm')
#该nlp变量是您通向spaCy的入口，并装载了en_core_web_sm英文模型

创建doc

首先，我们从文本创建一个doc（spaCy中的一种数据结构）文档，它是一个容器，存放了文档以及文档对应的标注。然后我们遍历文档，看看spaCy解析了什么。将这个句子的spaCy解析结果 格式化为pandas库的 dataframe。

import pandas as pd
import spacy
nlp=spacy.load('en_core_web_sm')
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

cols=('text','lemma','POS','explain','stopword')
rows=[]
for t in doc:
    row=[t.text,t.lemma_,t.pos_,spacy.explain(t.pos_),t.is_stop]
    rows.append(row)
#rows先被初始化为空列表，随后，在处理的过程中添加内容row
df=pd.DataFrame(rows,columns=cols)
print(df)

tokenize功能

text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)
for token in doc:
    print(token)

'''
The
rain
in
Spain
falls
mainly
on
the
plain
.
'''

词干化（Lemmatize)

for token in doc:
    print(token,token.lemma_)
'''
The the
rain rain
in in
Spain Spain
falls fall
mainly mainly
on on
the the
plain plain
. .
'''

词性标注(POS Tagging)

for token in doc:
    print(token,token.pos_)
'''
The DET
rain NOUN
in ADP
Spain PROPN
falls VERB
mainly ADV
on ADP
the DET
plain NOUN
. PUNCT
'''

名词短语提取（noun_chunks）

text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)
for chunk in doc.noun_chunks:
    print(chunk.text)

'''
The rain
Spain
the plain
'''

命名实体识别（Named Entity Recognition,NER）

在文本中标识命名实体，即专有名词。

如果你正在使用知识图谱(https://www.akbc.ws/2019/)的应用程序和其他关联数据(http://linkeddata.org/),那么构建文档中的命名实体和其他相关信息的联系就是一种挑战,即文本链接(http://nlpprogress.com/english/entity_linking.html)。识别文档中的命名实体是这类型AI工作的第一步。

text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text,ent.label_)

#Spain GPE

spaCy - WordNet

WordNet (https://wordnet.princeton.edu/)，它为英语提供了一个词汇数据库——换句话说，它是一个可计算的近义词典。

有一个针对WordNet的spaCy集成，名为spaCy - WordNet (https://github.com/recognai/spacy-wordnet)，作者是Daniel Vila Suero(https://twitter.com/dvilasuero)，他是自然语言和知识图谱研究的专家。

个人：暂不详述。