Python spaCy 库【NLP处理库】的基础知识讲解-CSDN博客

spaCy 是一个高效的工业级自然语言处理（NLP）库，专注于处理和分析文本数据。与 NLTK 不同，spaCy 设计目标是 生产环境，提供高性能的预训练模型和简洁的 API。

安装 spaCy：
```
pip install spacy
```
下载预训练模型（以英文模型为例）：
```
python -m spacy download en_core_web_sm
```
- 模型命名规则：[语言]_[类型]_[能力]_[大小]（如 en_core_web_sm 表示小型英文模型）。

import spacy

# 加载预训练模型
nlp = spacy.load("en_core_web_sm")

# 处理文本
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)

分词（Tokenization）：

for token in doc:
    print(token.text)  # 输出每个词的文本

输出：

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.

词性标注（POS Tagging）：

for token in doc:
    print(f"{token.text} → {token.pos_} → {token.tag_}")  # 词性（粗粒度）和详细标签

输出示例：

Apple → PROPN → NNP
is → AUX → VBZ
looking → VERB → VBG
...

命名实体识别（NER）：

for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")  # 实体文本和类型

输出：

Apple → ORG
U.K. → GPE
$1 billion → MONEY

依存句法分析（Dependency Parsing）：

for token in doc:
    print(f"{token.text} → {token.dep_} → {token.head.text}")

输出示例：

Apple → nsubj → looking
is → aux → looking
looking → ROOT → looking
...

spaCy 提供 displacy 模块，用于可视化文本分析结果。

from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)  # 在 Jupyter 中显示

displacy.render(doc, style="ent", jupyter=True)

对于长文本，建议使用 nlp.pipe 批量处理以提高效率：

texts = ["This is a sentence.", "Another example text."]
docs = list(nlp.pipe(texts))

# 可结合多线程加速（需谨慎）
docs = list(nlp.pipe(texts, n_process=2))

支持的模型：
- 英文：en_core_web_sm, en_core_web_md, en_core_web_lg（小型/中型/大型）。
- 中文：zh_core_web_sm。
- 其他语言：德语（de）、法语（fr）、西班牙语（es）等。
自定义模型：
spaCy 支持用户训练自己的模型，需准备标注数据。