spaCy：处理流程

最新推荐文章于 2023-10-22 15:33:22 发布

jsgang9

最新推荐文章于 2023-10-22 15:33:22 发布

阅读量389

点赞数

分类专栏： spaCy 文章标签： python json

本文链接：https://blog.csdn.net/jsglzj/article/details/130899099

版权

spaCy 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章详细介绍了spaCy库在自然语言处理中的工作流程，包括词性标注、依存关系解析、命名实体识别等原生组件。同时，讨论了如何通过自定义组件来扩展流程，以满足特定需求，如添加文本长度计算或匹配实体的功能。此外，还讲解了如何设置和使用扩展属性，以及如何利用nlp.pipe进行大规模文本处理，提高效率。

摘要由CSDN通过智能技术生成

调用nlp时会发生什么？

1	doc = nlp("This is a sentence.")

原生的流程组件

名字	描述	创建结果
tagger	词性标注器	`Token.tag`, `Token.pos`
parser	依存关系标注器	`Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks`
ner	命名实体识别器	`Doc.ents`, `Token.ent_iob`, `Token.ent_type`
textcat	文本分类器	`Doc.cats`

解构后台

流程是依次定义在模型的config.cfg文件里。
原生组件需要二进制数据来做预测。

nlp.pipe_names: 流程组件名的列表

1	print(nlp.pipe_names)

1	['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

nlp.pipeline: (name, component)元组的列表

1	print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.Tok2Vec>),

('tagger', <spacy.pipeline.Tagger>),

('parser', <spacy.pipeline.DependencyParser>),

('ner', <spacy.pipeline.EntityRecognizer>),

('attribute_ruler', <spacy.pipeline.AttributeRuler>),

('lemmatizer', <spacy.pipeline.Lemmatizer>)]

定制化流程组件

为什么要用定制化组件？

使得一个函数在我们调用nlp时被自动执行
为文档document和词符token增加我们自己的元数据
更新一些原生的属性比如doc.ents

函数用来读取一个doc，修改和返回它。
用Language.component装饰器来注册。
我们可以用nlp.add_pipe来添加组件。

from spacy.language import Language

@Language.component("custom_component")

def custom_component_function(doc):

# 对doc做一些处理

return doc

nlp.add_pipe("custom_component")

解构组件

@Language.component("custom_component")

def custom_component_function(doc):

# 对doc做一些处理

return doc

nlp.add_pipe("custom_component")

参数	说明	例子
`last`	如果为`True`则加在最后面	`nlp.add_pipe("component", last=True)`
`first`	如果为`True`则加在最前面	`nlp.add_pipe("component", first=True)`
`before`	加在指定组件之前	`nlp.add_pipe("component", before="ner")`
`after`	加在指定组件之后	`nlp.add_pipe("component", after="tagger")`

简单组件

用doc长度来完成组件函数。
加入"length_component"到现有的流程中，作为其第一个组件。
试用这个新的流程，用nlp实例来处理一段任意的文本，比如”这是一个句子。“。

import spacy

from spacy.language import Language

# 定义定制化组件

@Language.component("length_component")

def length_component_function(doc):

# 获取doc的长度

doc_length = len(doc)

print(f"This document is {doc_length} tokens long.")

# 返回这个doc

return doc

# 读取小规模的中文流程

nlp = spacy.load("zh_core_web_sm")

# 将组件加入到流程的最前面，打印流程组件名

nlp.add_pipe("length_component", first=True)

print(nlp.pipe_names)

# 处理一段文本

doc = nlp("这是一个句子。")

返回结果如下

1 2	['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner'] This document is 4 tokens long.

复杂组件

定义这个定制化组件，在doc上面应用matcher。
给每一个匹配结果创建一个Span，添加"ANIMAL"的标签ID，然后用这些新的span覆盖doc.ents。
处理文本，打印doc.ents中所有实体的实体文本和实体标签。

import spacy

from spacy.language import Language

from spacy.matcher import PhraseMatcher

from spacy.tokens import Span

nlp = spacy.load("zh_core_web_sm")

animals = ["金毛犬", "猫", "乌龟", "老鼠"]

animal_patterns = list(nlp.pipe(animals))

print("animal_patterns:", animal_patterns)

matcher = PhraseMatcher(nlp.vocab)

matcher.add("ANIMAL", animal_patterns)

# 定义定制化组件

@Language.component("animal_component")

def animal_component_function(doc):

# 把matcher应用到doc上

matches = matcher(doc)

# 为每一个匹配结果生成一个Span并赋予标签"ANIMAL"

spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]

# 用匹配到的span覆盖doc.ents

doc.ents = spans

return doc

# 把组件加入到流程中，紧跟在"ner"组件后面

nlp.add_pipe("animal_component", after="ner")

print(nlp.pipe_names)

# 处理文本，打印doc.ents的文本和标签

doc = nlp("我养了一只猫和一条金毛犬。")

print([(ent.text, ent.label_) for ent in doc.ents])

返回结果如下

animal_patterns: [金毛犬, 猫, 乌龟, 老鼠]

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner', 'animal_component']

[('猫', 'ANIMAL'), ('金毛犬', 'ANIMAL')]

扩展属性

设置定制化属性

添加定制化元数据到文档document、词符token和跨度span中
通过._属性来读取

doc._.title = "My document"

token._.is_color = True

span._.has_color = False

使用set_extension方法在全局的Doc、Token或Span上注册。

# 导入全局类

from spacy.tokens import Doc, Token, Span

# 在Doc、Token和Span上设置扩展属性

Doc.set_extension("title", default=None)

Token.set_extension("is_color", default=False)

Span.set_extension("has_color", default=False)

扩展属性类别

特性（Attribute）扩展
属性（Property）扩展
方法（Method）扩展

特性（Attribute）扩展

设置一个可以被覆盖的默认值。

from spacy.tokens import Token

# 为Token设置一个有默认值的扩展

Token.set_extension("is_color", default=False)

doc = nlp("天空是蓝色的。")

# 覆盖默认扩展特性的值

doc[2]._.is_color = True

属性（Property）扩展 (1)

设置一个取值器（getter）和一个可选的赋值器（setter）函数。
取值器只有当你提取属性值的时候才会被调用。

from spacy.tokens import Token

# 定义取值器函数

def get_is_color(token):

colors = ["红色", "黄色", "蓝色"]

return token.text in colors

# 为词符设置有取值器的扩展

Token.set_extension("is_color", getter=get_is_color)

doc = nlp("天空是蓝色的。")

print(doc[2]._.is_color, "-", doc[2].text)

True - 蓝色

属性（Property）扩展 (2)

Span扩展大部分情况下总是需要有一个取值器。

from spacy.tokens import Span

# 定义取值器函数

def get_has_color(span):

colors = ["红色", "黄色", "蓝色"]

return any(token.text in colors for token in span)

# 为Span设置一个带有取值器getter的扩展

Span.set_extension("has_color", getter=get_has_color)

doc = nlp("天空是蓝色的")

print(doc[1:4]._.has_color, "-", doc[1:4].text)

print(doc[0:2]._.has_color, "-", doc[0:2].text)

1 2	True - 是蓝色的 False - 天空是

方法（Method）扩展

作为一个实例的方法引入一个函数
可以向扩展函数中传入参数

from spacy.tokens import Doc

# 定义含有参数的方法

def has_token(doc, token_text):

in_doc = token_text in [token.text for token in doc]

return in_doc

# 在doc上设置方法扩展

Doc.set_extension("has_token", method=has_token)

doc = nlp("天空是蓝色的。")

print(doc._.has_token("蓝色"), "- 蓝色")

print(doc._.has_token("云朵"), "- 云朵")

1 2	True - 蓝色 False - 云朵

设置扩展属性

用Token.set_extension来注册"is_country"（默认是False）。
对"Spain"更新该扩展属性，然后对所有词符打印这个属性。

import spacy

from spacy.tokens import Token

nlp = spacy.load("zh_core_web_sm")

# 注册词符的扩展属性"is_country"，其默认值是False

Token.set_extension("is_country", default=False)

# 处理文本，将词符"新加坡"的is_country属性设置为True

doc = nlp("我住在新加坡。")

doc[2]._.is_country = True

# 对所有词符打印词符文本及is_country属性

print([(token.text, token._.is_country) for token in doc])

返回结果如下

1	[('我', False), ('住在', False), ('新加坡', True), ('。', False)]

用Token.set_extension来注册"reversed"（取值函数是get_reversed）。
对所有词符打印这个属性的值。

import spacy

from spacy.tokens import Token

nlp = spacy.blank("zh")

# 定义取值器函数，读入一个词符并返回其逆序的文本

def get_reversed(token):

return token.text[::-1]

# 注册词符的扩展属性get_reversed及其取值器get_reversed

Token.set_extension("reversed", getter=get_reversed)

# 处理文本，打印没一个词符的逆序属性

doc = nlp("我说的所有话都是假的，包括这一句。")

for token in doc:

print("reversed:", token._.reversed)

返回结果如下

reversed: 我

reversed: 说

reversed: 的

reversed: 所

reversed: 有

reversed: 话

reversed: 都

reversed: 是

reversed: 假

reversed: 的

reversed: ，

reversed: 包

reversed: 括

reversed: 这

reversed: 一

reversed: 句

reversed: 。

完成get_has_number函数。
用Doc.set_extension来注册"has_number"（取值函数是get_has_number）并打印这个属性的值。

import spacy

from spacy.tokens import Doc

nlp = spacy.blank("zh")

# 定义取值器函数

def get_has_number(doc):

# 返回是否doc中的任一个词符的token.like_num返回True

return any(token.like_num for token in doc)

# 注册Doc的扩展属性"has_number"及其取值器get_has_number

Doc.set_extension("has_number", getter=get_has_number)

# 处理文本，检查定制化的has_number属性

doc = nlp("这家博物馆在2012年关了五个月。")

print("has_number:", doc._.has_number)

返回结果如下

1	has_number: True

用Span.set_extension来注册"to_html"（to_html方法）。
在doc[0:2]上用标签"strong"来调用它。

import spacy

from spacy.tokens import Span

nlp = spacy.blank("zh")

# 定义这个方法

def to_html(span, tag):

# 将span文本包在HTML标签中并返回

return f"<{tag}>{span.text}</{tag}>"

# 注册这个Span方法扩展名"to_html"及其方法to_html

Span.set_extension("to_html", method=to_html)

# 处理文本，在span上调用to_html方法及其标签名"strong"

doc = nlp("大家好，这是一个句子。")

span = doc[0:3]

print(span._.to_html("strong"))

返回结果如下

1	<strong>大家好</strong>

实体和扩展

完成get_wikipedia_url这个取值函数，使其只有在span的标签在标签列表中时才返回URL。
用取值函数get_wikipedia_url设置Span的扩展"wikipedia_url"。
遍历doc中的实体，输出它们的维基百科URL。

import spacy

from spacy.tokens import Span

nlp = spacy.load("zh_core_web_sm")

def get_wikipedia_url(span):

# 如果span有其中一个标签则获取其维基百科URL

if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):

entity_text = span.text.replace(" ", "_")

return "https://zh.wikipedia.org/w/index.php?search=" + entity_text

# 设置Span的扩展wikipedia_url及其取值器get_wikipedia_url

Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(

"出道这么多年，周杰伦已经成为几代年轻人共同的偶像。"

)

for ent in doc.ents:

# 打印实体的文本和其维基百科URL

print(ent.text, ent._.wikipedia_url)

返回结果如下

1	周杰伦 https://zh.wikipedia.org/w/index.php?search=周杰伦

含有扩展的组件

matcher变量中已经有一个匹配所有国家的短语匹配器。CAPITALS变量中则有一个把国家名映射到其首都城市的字典。

完成countries_component_function，为所有匹配结果创建一个含有标签"GPE"（地理政治实体）的Span。
把组件加入到流程中。
使用取值函数get_capital注册Span的扩展属性"capital"。
处理文本，对每一个doc.ents中的实体打印其实体文本、实体标签和实体的首都城市。

import json

import spacy

from spacy.language import Language

from spacy.tokens import Span

from spacy.matcher import PhraseMatcher

with open("exercises/zh/countries.json", encoding="utf8") as f:

COUNTRIES = json.loads(f.read())

with open("exercises/zh/capitals.json", encoding="utf8") as f:

CAPITALS = json.loads(f.read())

nlp = spacy.blank("zh")

matcher = PhraseMatcher(nlp.vocab)

matcher.add("COUNTRY", list(nlp.pipe(COUNTRIES)))

@Language.component("countries_component")

def countries_component_function(doc):

# 对所有匹配结果创建一个标签为"GPE"的实体Span

matches = matcher(doc)

doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]

return doc

# 把这个组件加入到流程中

nlp.add_pipe("countries_component")

print(nlp.pipe_names)

# 取值器，在国家首都的字典中寻找span的文本

get_capital = lambda span: CAPITALS.get(span.text)

# 用这个取值器注册Span的扩展属性"capital"

Span.set_extension("capital", getter=get_capital, force=True)

# 处理文本，打印实体文本、标签和首都属性

doc = nlp("新加坡可能会和马来西亚一起建造高铁。")

print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

返回结果如下

1 2	['countries_component'] [('新加坡', 'GPE', '新加坡'), ('马来西亚', 'GPE', '吉隆坡')]

规模化和性能

处理大规模语料

使用nlp.pipe方法
用流模式来处理文本，生成Doc实例
这比直接在每段文本上面调用nlp快得多

不好的方法：

1	docs = [nlp(text) for text in LOTS_OF_TEXTS]

好的方法：

1	docs = list(nlp.pipe(LOTS_OF_TEXTS))

传入语境(1)

在nlp.pipe设置as_tuples=True，这样我们可以传入一些列形式为 (text, context)的元组。
产生一系列(doc, context)元组。
当我们要把doc关联到一些元数据时这种方法就很有用。

data = [

("这是一段文本", {"id": 1, "page_number": 15}),

("以及另一段文本", {"id": 2, "page_number": 16}),

]

for doc, context in nlp.pipe(data, as_tuples=True):

print(doc.text, context["page_number"])

1 2	这是一段文本 15 以及另一段文本 16

传入语境(2)

from spacy.tokens import Doc

Doc.set_extension("id", default=None)

Doc.set_extension("page_number", default=None)

data = [

("这是一段文本", {"id": 1, "page_number": 15}),

("以及另一段文本", {"id": 2, "page_number": 16}),

]

for doc, context in nlp.pipe(data, as_tuples=True):

doc._.id = context["id"]

doc._.page_number = context["page_number"]

只用分词器(1)

不要跑整个流程！

只用分词器(2)

用nlp.make_doc将一段文本变成Doc实例

不好的方法：

1	doc = nlp("Hello world")

好的方法：

1	doc = nlp.make_doc("Hello world!")

关闭流程组件

使用nlp.select_pipes来暂时关闭一个或多个流程组件。

# 关闭词性标注器tagger和依存关系标注器parser

with nlp.select_pipes(disable=["tagger", "parser"]):

# 处理文本并打印实体结果

doc = nlp(text)

print(doc.ents)

with代码块之后这些组件会重新启用
这些组件关闭后spaCy流程只会跑剩余的未被关闭的组件

处理流

用nlp.pipe重写这个例子。不要直接遍历文本来处理它们，而是遍历nlp.pipe产生的 doc实例。

import json

import spacy

nlp = spacy.load("zh_core_web_sm")

with open("exercises/zh/weibo.json", encoding="utf8") as f:

TEXTS = json.loads(f.read())

# 处理文本，打印形容词

for text in TEXTS:

doc = nlp(text)

print([token.text for token in doc if token.pos_ == "ADJ"])

返回结果如下

[]

['老']

[]

用nlp.pipe重写这个例子。记着对结果调用list()来把它变为一个列表。

import json

import spacy

nlp = spacy.load("zh_core_web_sm")

with open("exercises/zh/weibo.json", encoding="utf8") as f:

TEXTS = json.loads(f.read())

# 处理文本，打印实体

docs = list(nlp.pipe(TEXTS))

entities = [doc.ents for doc in docs]

print(*entities)

返回结果如下

1	(麦当劳,) (麦当劳, 汉堡, 汉堡) (麦当劳,) (中国, 麦当劳, 北京) (麦当劳,) (今天, 早上, 麦当劳, 一整天)

用nlp.pipe重写这个例子。记着对结果调用list()来把它变为一个列表。

import spacy

nlp = spacy.blank("zh")

people = ["周杰伦", "庞麦郎", "诸葛亮"]

# 为PhraseMatcher创建一个模板列表

patterns = [nlp(person) for person in people]

在语境中处理数据

变量DATA里有一个[text, context]的示例列表。文本text是一些有名书籍的引用，而语境context是一些键值为"author"和"book"的字典。

使用set_extension方法在Doc上注册定制化属性"author"和"book"，其默认值为None。
使用nlp.pipe，设置as_tuples=True，处理DATA中的[text, context]对。
使用传入的对应信息作为语境覆盖doc._.book和doc._.author。

import json

import spacy

from spacy.tokens import Doc

with open("exercises/en/bookquotes.json", encoding="utf8") as f:

DATA = json.loads(f.read())

nlp = spacy.blank("en")

# 注册Doc的扩展"author"（默认值为None）

Doc.set_extension("author", default=None)

# 注册Doc的扩展"book"（默认值为None）

Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):

# 从context中设置属性doc._.book和doc._.author

doc._.book = context["book"]

doc._.author = context["author"]

# 打印文本和定制化的属性数据

print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

返回结果如下

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

— 'Metamorphosis' by Franz Kafka

I know not all that may be coming, but be it what it will, I'll go to it laughing.

— 'Moby-Dick or, The Whale' by Herman Melville

It was the best of times, it was the worst of times.

— 'A Tale of Two Cities' by Charles Dickens

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.

— 'On the Road' by Jack Kerouac

It was a bright cold day in April, and the clocks were striking thirteen.

— '1984' by George Orwell

Nowadays people know the price of everything and the value of nothing.

— 'The Picture Of Dorian Gray' by Oscar Wilde

选择性处理

用nlp.make_doc重写代码使其只对文本做分词。

import spacy

nlp = spacy.load("zh_core_web_sm")

text = (

"在300多年的风雨历程中，历代同仁堂人始终恪守“炮制虽繁必不敢省人工，品味虽贵必不敢减物力”的古训，"

"树立“修合无人见，存心有天知”的自律意识，造就了制药过程中兢兢小心、精益求精的严细精神。"

)

# 仅对文本做分词

doc = nlp.make_doc(text)

print([token.text for token in doc])

返回结果如下

['在', '300多', '年', '的', '风雨', '历程', '中', '，', '历代', '同仁', '堂人', '始终', '恪守', '“', '炮制', '虽', '繁必', '不', '敢', '省', '人工', '，', '品味', '虽', '贵必', '不', '敢', '减物力', '”', '的', '古训', '，', '树立', '“', '修合', '无', '人', '见', '，', '存心', '有', '天知', '”', '的', '自律', '意识', '，', '造就', '了', '制药', '过程', '中', '兢兢小心', '、', '精益求精', '的', '严细', '精神', '。']

用nlp.select_pipes方法关闭词性标注(tagger)和词性还原(lemmatizer)的组件。
处理文本，将所有doc中的结果实体打印出来。

import spacy

nlp = spacy.load("zh_core_web_sm")

text = (

"在300多年的风雨历程中，历代同仁堂人始终恪守“炮制虽繁必不敢省人工，品味虽贵必不敢减物力”的古训，"

"树立“修合无人见，存心有天知”的自律意识，造就了制药过程中兢兢小心、精益求精的严细精神。"

)

# 关闭tagger和parser

with nlp.select_pipes(disable=["tagger", "parser"]):

# 处理文本

doc = nlp(text)

# 打印doc中的实体

print(doc.ents)

返回结果如下

(300多年,)

jsgang9

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
spaCy：处理流程

调用nlp时会发生什么？1原生的流程组件Token.tagToken.posToken.depToken.headDoc.sentsDoc.entsDoc.cats解构后台config.cfg111123456。
复制链接

扫一扫