spacy简单使用

lllhhhv

已于 2022-04-14 10:45:31 修改

阅读量7.6k

点赞数 10

分类专栏： nlp 文章标签： nlp

于 2022-03-07 17:58:48 首次发布

本文链接：https://blog.csdn.net/lllhhhv/article/details/123335675

版权

nlp 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

spacy官方: Install spaCy · spaCy Usage Documentation

3.词性标注 (Part-of-speech tagging)

4.识别停用词 (Stop words)

5.命名实体识别 (Named Entity Recognization)

6.依存分析 (Dependency Parsing)

7.词性还原 (Lemmatization)

8.提取名词短语 (Noun Chunks)

9.指代消解 (Coreference Resolution)

三、可视化

简介:

spacy 可以用于进行分词，命名实体识别，词性识别等等

一、安装

pip install spacy

1. 训练模型

安装之后还要下载官方的训练模型, 不同的语言有不同的训练模型,这里只用对应中文的模型演示:

python -m spacy download zh_core_web_sm

代码中使用:

import spacy
nlp = spacy.load("zh_core_web_sm")

模型官方文档:

Trained Models & Pipelines · spaCy Models Documentation

每种语言也会有几种不同的模型,例如中文的模型除了刚才下载的 zh_core_web_sm 外,还有zh_core_web_trf、zh_core_web_md 等,它们的区别在于准确度和体积大小, zh_core_web_sm 体积小,准确度相比zh_core_web_trf差,zh_core_web_trf相对就体积大。这样可以适应不同场景。

这里以模型 zh_core_web_sm 做一个介绍

Trained Models & Pipelines · spaCy Models Documentation

LANGUAGE : 那种语言的模型
SIZE : 模型体积的大小
COMPONENTS(组件): 模型具备的功能

tok2vec: 分词

tagger: 词性标注

parser: 依存分析

senter: 分句

ner: 命名实体识别

attribute_ruler: 更改属性映射(没有具体了解)

PIPELINE (管道) : 管道组件, 模型会按照管道组件去执行

模型会中指明包含哪些词性、依存分析、实体种类:

这是一些词性名称的解释:

IP：简单从句
NP：名词短语
VP：动词短语
PU：断句符，通常是句号、问号、感叹号等标点符号
LCP：方位词短语
PP：介词短语
CP：由‘的’构成的表示修饰性关系的短语
DNP：由‘的’构成的表示所属关系的短语
ADVP：副词短语
ADJP：形容词短语
DP：限定词短语
QP：量词短语
NN：常用名词
NR：固有名词
NT：时间名词
PN：代词
VV：动词
VC：是
CC：表示连词
VE：有
VA：表语形容词
AS：内容标记（如：了）
VRD：动补复合词
CD: 表示基数词
DT: determiner 表示限定词
EX: existential there 存在句
FW: foreign word 外来词
IN: preposition or conjunction, subordinating 介词或从属连词
JJ: adjective or numeral, ordinal 形容词或序数词
JJR: adjective, comparative 形容词比较级
JJS: adjective, superlative 形容词最高级
LS: list item marker 列表标识
MD: modal auxiliary 情态助动词
PDT: pre-determiner 前位限定词
POS: genitive marker 所有格标记
PRP: pronoun, personal 人称代词
RB: adverb 副词
RBR: adverb, comparative 副词比较级
RBS: adverb, superlative 副词最高级
RP: particle 小品词
SYM: symbol 符号
TO:”to” as preposition or infinitive marker 作为介词或不定式标记
WDT: WH-determiner WH限定词
WP: WH-pronoun WH代词
WP$: WH-pronoun, possessive WH所有格代词
WRB:Wh-adverb WH副词

官方关于词性、依存关系、实体的名词解释:

def explain(term):
    """Get a description for a given POS tag, dependency label or entity type.
    term (str): The term to explain.
    RETURNS (str): The explanation, or `None` if not found in the glossary.
    EXAMPLE:
        >>> spacy.explain(u'NORP')
        >>> doc = nlp(u'Hello world')
        >>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
    """
    if term in GLOSSARY:
        return GLOSSARY[term]


GLOSSARY = {
    # POS tags
    # Universal POS Tags
    # http://universaldependencies.org/u/pos/
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space",
    # POS tags (English)
    # OntoNotes 5 / Penn Treebank
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    ".": "punctuation mark, sentence closer",
    ",": "punctuation mark, comma",
    "-LRB-": "left round bracket",
    "-RRB-": "right round bracket",
    "``": "opening quotation mark",
    '""': "closing quotation mark",
    "''": "closing quotation mark",
    ":": "punctuation mark, colon or ellipsis",
    "$": "symbol, currency",
    "#": "symbol, number sign",
    "AFX": "affix",
    "CC": "conjunction, coordinating",
    "CD": "cardinal number",
    "DT": "determiner",
    "EX": "existential there",
    "FW": "foreign word",
    "HYPH": "punctuation mark, hyphen",
    "IN": "conjunction, subordinating or preposition",
    "JJ": "adjective (English), other noun-modifier (Chinese)",
    "JJR": "adjective, comparative",
    "JJS": "adjective, superlative",
    "LS": "list item marker",
    "MD": "verb, modal auxiliary",
    "NIL": "missing tag",
    "NN": "noun, singular or mass",
    "NNP": "noun, proper singular",
    "NNPS": "noun, proper plural",
    "NNS": "noun, plural",
    "PDT": "predeterminer",
    "POS": "possessive ending",
    "PRP": "pronoun, personal",
    "PRP$": "pronoun, possessive",
    "RB": "adverb",
    "RBR": "adverb, comparative",
    "RBS": "adverb, superlative",
    "RP": "adverb, particle",
    "TO": 'infinitival "to"',
    "UH": "interjection",
    "VB": "verb, base form",
    "VBD": "verb, past tense",
    "VBG": "verb, gerund or present participle",
    "VBN": "verb, past participle",
    "VBP": "verb, non-3rd person singular present",
    "VBZ": "verb, 3rd person singular present",
    "WDT": "wh-determiner",
    "WP": "wh-pronoun, personal",
    "WP$": "wh-pronoun, possessive",
    "WRB": "wh-adverb",
    "SP": "space (English), sentence-final particle (Chinese)",
    "ADD": "email",
    "NFP": "superfluous punctuation",
    "GW": "additional word in multi-word expression",
    "XX": "unknown",
    "BES": 'auxiliary "be"',
    "HVS": 'forms of "have"',
    "_SP": "whitespace",
    # POS Tags (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    "$(": "other sentence-internal punctuation mark",
    "$,": "comma",
    "$.": "sentence-final punctuation mark",
    "ADJA": "adjective, attributive",
    "ADJD": "adjective, adverbial or predicative",
    "APPO": "postposition",
    "APPR": "preposition; circumposition left",
    "APPRART": "preposition with article",
    "APZR": "circumposition right",
    "ART": "definite or indefinite article",
    "CARD": "cardinal number",
    "FM": "foreign language material",
    "ITJ": "interjection",
    "KOKOM": "comparative conjunction",
    "KON": "coordinate conjunction",
    "KOUI": 'subordinate conjunction with "zu" and infinitive',
    "KOUS": "subordinate conjunction with sentence",
    "NE": "proper noun",
    "NNE": "proper noun",
    "PAV": "pronominal adverb",
    "PROAV": "pronominal adverb",
    "PDAT": "attributive demonstrative pronoun",
    "PDS": "substituting demonstrative pronoun",
    "PIAT": "attributive indefinite pronoun without determiner",
    "PIDAT": "attributive indefinite pronoun with determiner",
    "PIS": "substituting indefinite pronoun",
    "PPER": "non-reflexive personal pronoun",
    "PPOSAT": "attributive possessive pronoun",
    "PPOSS": "substituting possessive pronoun",
    "PRELAT": "attributive relative pronoun",
    "PRELS": "substituting relative pronoun",
    "PRF": "reflexive personal pronoun",
    "PTKA": "particle with adjective or adverb",
    "PTKANT": "answer particle",
    "PTKNEG": "negative particle",
    "PTKVZ": "separable verbal particle",
    "PTKZU": '"zu" before infinitive',
    "PWAT": "attributive interrogative pronoun",
    "PWAV": "adverbial interrogative or relative pronoun",
    "PWS": "substituting interrogative pronoun",
    "TRUNC": "word remnant",
    "VAFIN": "finite verb, auxiliary",
    "VAIMP": "imperative, auxiliary",
    "VAINF": "infinitive, auxiliary",
    "VAPP": "perfect participle, auxiliary",
    "VMFIN": "finite verb, modal",
    "VMINF": "infinitive, modal",
    "VMPP": "perfect participle, modal",
    "VVFIN": "finite verb, full",
    "VVIMP": "imperative, full",
    "VVINF": "infinitive, full",
    "VVIZU": 'infinitive with "zu", full',
    "VVPP": "perfect participle, full",
    "XY": "non-word containing non-letter",
    # POS Tags (Chinese)
    # OntoNotes / Chinese Penn Treebank
    # https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
    "AD": "adverb",
    "AS": "aspect marker",
    "BA": "把 in ba-construction",
    # "CD": "cardinal number",
    "CS": "subordinating conjunction",
    "DEC": "的 in a relative clause",
    "DEG": "associative 的",
    "DER": "得 in V-de const. and V-de-R",
    "DEV": "地 before VP",
    "ETC": "for words 等, 等等",
    # "FW": "foreign words"
    "IJ": "interjection",
    # "JJ": "other noun-modifier",
    "LB": "被 in long bei-const",
    "LC": "localizer",
    "M": "measure word",
    "MSP": "other particle",
    # "NN": "common noun",
    "NR": "proper noun",
    "NT": "temporal noun",
    "OD": "ordinal number",
    "ON": "onomatopoeia",
    "P": "preposition excluding 把 and 被",
    "PN": "pronoun",
    "PU": "punctuation",
    "SB": "被 in short bei-const",
    # "SP": "sentence-final particle",
    "VA": "predicative adjective",
    "VC": "是 (copula)",
    "VE": "有 as the main verb",
    "VV": "other verb",
    # Noun chunks
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase",
    # Dependency Labels (English)
    # ClearNLP / Universal Dependencies
    # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
    "acl": "clausal modifier of noun (adjectival clause)",
    "acomp": "adjectival complement",
    "advcl": "adverbial clause modifier",
    "advmod": "adverbial modifier",
    "agent": "agent",
    "amod": "adjectival modifier",
    "appos": "appositional modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "cc": "coordinating conjunction",
    "ccomp": "clausal complement",
    "clf": "classifier",
    "complm": "complementizer",
    "compound": "compound",
    "conj": "conjunct",
    "cop": "copula",
    "csubj": "clausal subject",
    "csubjpass": "clausal subject (passive)",
    "dative": "dative",
    "dep": "unclassified dependent",
    "det": "determiner",
    "discourse": "discourse element",
    "dislocated": "dislocated elements",
    "dobj": "direct object",
    "expl": "expletive",
    "fixed": "fixed multiword expression",
    "flat": "flat multiword expression",
    "goeswith": "goes with",
    "hmod": "modifier in hyphenation",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "intj": "interjection",
    "iobj": "indirect object",
    "list": "list",
    "mark": "marker",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "npadvmod": "noun phrase as adverbial modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "oprd": "object predicate",
    "obj": "object",
    "obl": "oblique nominal",
    "orphan": "orphan",
    "parataxis": "parataxis",
    "partmod": "participal modifier",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "preconj": "pre-correlative conjunction",
    "prep": "prepositional modifier",
    "prt": "particle",
    "punct": "punctuation",
    "quantmod": "modifier of quantifier",
    "rcmod": "relative clause modifier",
    "relcl": "relative clause modifier",
    "reparandum": "overridden disfluency",
    "root": "root",
    "vocative": "vocative",
    "xcomp": "open clausal complement",
    # Dependency labels (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    # currently missing: 'cc' (comparative complement) because of conflict
    # with English labels
    "ac": "adpositional case marker",
    "adc": "adjective component",
    "ag": "genitive attribute",
    "ams": "measure argument of adjective",
    "app": "apposition",
    "avc": "adverbial phrase component",
    "cd": "coordinating conjunction",
    "cj": "conjunct",
    "cm": "comparative conjunction",
    "cp": "complementizer",
    "cvc": "collocational verb construction",
    "da": "dative",
    "dh": "discourse-level head",
    "dm": "discourse marker",
    "ep": "expletive es",
    "hd": "head",
    "ju": "junctor",
    "mnr": "postnominal modifier",
    "mo": "modifier",
    "ng": "negation",
    "nk": "noun kernel element",
    "nmc": "numerical component",
    "oa": "accusative object",
    "oc": "clausal object",
    "og": "genitive object",
    "op": "prepositional object",
    "par": "parenthetical element",
    "pd": "predicate",
    "pg": "phrasal genitive",
    "ph": "placeholder",
    "pm": "morphological particle",
    "pnc": "proper noun component",
    "rc": "relative clause",
    "re": "repeated element",
    "rs": "reported speech",
    "sb": "subject",
    "sbp": "passivized subject (PP)",
    "sp": "subject or predicate",
    "svp": "separable verb prefix",
    "uc": "unit component",
    "vo": "vocative",
    # Named Entity Recognition
    # OntoNotes 5
    # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
    "PERSON": "People, including fictional",
    "NORP": "Nationalities or religious or political groups",
    "FACILITY": "Buildings, airports, highways, bridges, etc.",
    "FAC": "Buildings, airports, highways, bridges, etc.",
    "ORG": "Companies, agencies, institutions, etc.",
    "GPE": "Countries, cities, states",
    "LOC": "Non-GPE locations, mountain ranges, bodies of water",
    "PRODUCT": "Objects, vehicles, foods, etc. (not services)",
    "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
    "WORK_OF_ART": "Titles of books, songs, etc.",
    "LAW": "Named documents made into laws.",
    "LANGUAGE": "Any named language",
    "DATE": "Absolute or relative dates or periods",
    "TIME": "Times smaller than a day",
    "PERCENT": 'Percentage, including "%"',
    "MONEY": "Monetary values, including unit",
    "QUANTITY": "Measurements, as of weight or distance",
    "ORDINAL": '"first", "second", etc.',
    "CARDINAL": "Numerals that do not fall under another type",
    # Named Entity Recognition
    # Wikipedia
    # http://www.sciencedirect.com/science/article/pii/S0004370212000276
    # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
    "PER": "Named person or family.",
    "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
    # https://github.com/ltgoslo/norne
    "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
    "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
    "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
    "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
    "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}

url: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

二、功能

import spacy


s = "小米董事长叶凡决定投资华为。在2002年，他还创作了<遮天>。"

nlp = spacy.load("zh_core_web_sm")
doc = nlp(s)

1. 分句 (sentencizer)

# 1. 分句 (sentencizer)
for i in doc.sents:
    print(i)



"""
小米董事长叶凡决定投资华为。
在2002年，他还创作了<遮天>。

"""

2.分词 (Tokenization)

# 2. 分词 (Tokenization)
print([w.text for w in doc])


"""
['小米', '董事长', '叶凡', '决定', '投资', '华为', '。', '在', '2002年', '，', '他', '还', '创作', '了', '<遮天>', '。']
"""

3.词性标注 (Part-of-speech tagging)

细粒度

print([(w.text, w.tag_) for w in doc])


"""
[('小米', 'NR'), ('董事长', 'NN'), ('叶凡', 'NR'), ('决定', 'VV'), ('投资', 'VV'), ('华为', 'NR'), ('。', 'PU'), ('在', 'P'), ('2002年', 'NT'), ('，', 'PU'), ('他', 'PN'), ('还', 'AD'), ('创作', 'VV'), ('了', 'AS'), ('<遮天>', 'NN'), ('。', 'PU')]
"""

粗粒度

print([(w.text, w.pos_) for w in doc])

"""
[('小米', 'PROPN'), ('董事长', 'NOUN'), ('叶凡', 'PROPN'), ('决定', 'VERB'), ('投资', 'VERB'), ('华为', 'PROPN'), ('。', 'PUNCT'), ('在', 'ADP'), ('2002年', 'NOUN'), ('，', 'PUNCT'), ('他', 'PRON'), ('还', 'ADV'), ('创作', 'VERB'), ('了', 'PART'), ('<遮天>', 'NOUN'), ('。', 'PUNCT')]

"""

4.识别停用词 (Stop words)

print([(w.text, w.is_stop) for w in doc])


"""
[('小米', False), ('董事长', False), ('叶凡', False), ('决定', True), ('投资', False), ('华为', False), ('。', True), ('在', True), ('2002年', False), ('，', True), ('他', True), ('还', True), ('创作', False), ('了', True), ('<遮天>', False), ('。', True)]
"""

5.命名实体识别 (Named Entity Recognization)

# 命名实体识别 (Named Entity Recognization)
print([(e.text, e.label_) for e in doc.ents])



"""
[('小米', 'PERSON'), ('叶凡', 'PERSON'), ('2002年', 'DATE')]
"""

6.依存分析 (Dependency Parsing)

print([(w.text, w.dep_) for w in doc])


"""
[('小米', 'nmod:assmod'), ('董事长', 'appos'), ('叶凡', 'nsubj'), ('决定', 'ROOT'), ('投资', 'ccomp'), ('华为', 'dobj'), ('。', 'punct'), ('在', 'case'), ('2002年', 'nmod:prep'), ('，', 'punct'), ('他', 'nsubj'), ('还', 'advmod'), ('创作', 'ROOT'), ('了', 'aux:asp'), ('<遮天>', 'dobj'), ('。', 'punct')]
"""

7.词性还原 (Lemmatization)

这个模型没有这个功能,用英文模型演示下

找到单词的原型，即词性还原，将am, is, are, have been 还原成be，复数还原成单数(cats -> cat)，过去时态还原成现在时态 (had -> have)。

import spacy
nlp = spacy.load('en_core_web_sm')

txt = "A magnetic monopole is a hypothetical elementary particle."
doc = nlp(txt)

lem = [token.lemma_ for token in doc]
print(lem)


"""
['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']
"""

8.提取名词短语 (Noun Chunks)

这个模型没有这个功能,用英文模型演示下

noun_chunks = [nc for nc in doc.noun_chunks]
print(noun_chunks)

"""
[A magnetic monopole, a hypothetical elementary particle]
"""

9.指代消解 (Coreference Resolution)

指代消解，寻找句子中代词 he，she，it 所对应的实体。为了使用这个模块，需要使用神经网络预训练的指代消解系数，如果前面没有安装，可运行命令：pip install neuralcoref

这个模型没有这个功能,用英文模型演示下

txt = "My sister has a son and she loves him."

# 将预训练的神经网络指代消解加入到spacy的管道中
import neuralcoref
neuralcoref.add_to_pipe(nlp)

doc = nlp(txt)
doc._.coref_clusters

"""
[My sister: [My sister, she], a son: [a son, him]]
"""

三、可视化

from spacy import displacy



# 可视化依存关系
html_str = displacy.render(doc, style="dep")


#可视化命名名称实体
# html_str = displacy.render(doc, style="ent")


with open("D:\\data\\ss.html", "w", encoding="utf8") as f:
    f.write(html_str)

html_str 是一个html格式的字符串, 保存到本地 ss.html文件,浏览器打开效果:

依存关系