1. 安装Spacy如下:https://spacy.io/usage
我的选项如下,其中Select pipeline for efficiency对应en_core_web_sm,Select pipeline for accuracy 对应en_core_web_trf;
2. 手动下载Spacy模型,调用如下命令安装:
pip install en_core_web_sm-3.0.0.tar.gz
pip install en_core_web_trf-3.0.0.tar.gz
en_core_web_sm文件:https://github.com/explosion/spacy-models/releases//tag/en_core_web_sm-3.0.0
en_core_web_trf文件:https://github.com/explosion/spacy-models/releases//tag/en_core_web_trf-3.0.0
3. 常见的NLP处理流程:
知乎介绍:https://zhuanlan.zhihu.com/p/63110761
官网介绍:https://spacy.io/usage/linguistic-features
4. Spacy Labels介绍:
POS Tagging Labels:https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/POS-Keywords.html
Dependency Labels:https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
5. Spacy中token属性:https://www.jianshu.com/p/488e29470755
6. Spacy依赖树结构介绍和打印:
import spacy
from nltk import Tree
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is some sentence that spacy will not appreciate .")
# 打印当前单词,父单词,所有子单词,左子单词(单词索引在其之前),右子单词(单词索引在其之后)
# 注:子单词是直接相连,间接相连为子孙单词
for token in doc:
print(token.text, token.head.text, [child for child in token.children],
[left_child for left_child in token.lefts], [right_child for right_child in token.rights], )
def to_nltk_tree(node):
if node.n_lefts + node.n_rights > 0:
return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
else:
return node.orth_
print([to_nltk_tree(sent.root).pretty_print() for sent in doc.sents])
结果:
6. 批量处理文本:https://spacy.io/usage/processing-pipelines
texts = ["This is a text", "These are lots of texts", "..."]
docs = list(nlp.pipe(texts))