SpaCy的使用例子总结

茫茫人海一粒沙

已于 2024-02-15 22:03:37 修改

阅读量966

点赞数 7

分类专栏： nlp 文章标签：自然语言处理人工智能

于 2024-02-15 21:56:50 首次发布

本文链接：https://blog.csdn.net/keeppractice/article/details/136123758

版权

nlp 专栏收录该内容

10 篇文章

订阅专栏

当使用Spacy进行自然语言处理时，常见的用例包括文本分词、命名实体识别、词性标注、句法分析等。下面是一些常见的使用例子及相应的代码：

文本分词（Tokenization）：

将文本划分成单词或标点符号等基本单元。

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_sm")
# 文本分词
text = "This is a sample sentence."
doc = nlp(text)

# 输出分词结果
for token in doc:
    print(token.text)

运行结果

This
is
a
sample
sentence
.

命名实体识别（Named Entity Recognition）：

识别文本中的命名实体，如人名、地名、组织机构等。

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_sm")
# 文本
text = "Apple is a big company, headquartered in Cupertino, California."
# 处理文本
doc = nlp(text)
# 提取命名实体
for ent in doc.ents:
    print(ent.text, ent.label_)

运行结果:

Apple ORG
Cupertino GPE
California GPE

词性标注（Part-of-speech Tagging）：

标注文本中每个词的词性

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_sm")

# 文本
text = "This is a sample sentence."

# 处理文本
doc = nlp(text)

# 输出词性标注结果
for token in doc:
    print(token.text, token.pos_)

运行结果：

This PRON
is AUX
a DET
sample NOUN
sentence NOUN
. PUNCT

句法分析（Dependency Parsing）：

分析文本中单词之间的依赖关系。

import spacy

# 加载英文模型
nlp = spacy.load("en_core_web_sm")

# 文本
text = "Apple is looking at buying U.K. startup for $1 billion"

# 处理文本
doc = nlp(text)

# 输出句法依赖关系
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])

运行结果：

Apple nsubj looking VERB []
is aux looking VERB []
looking ROOT looking VERB [Apple, is, at, startup]
at prep looking VERB [buying]
buying pcomp at ADP [U.K.]
U.K. dobj buying VERB []
startup dep looking VERB [for]
for prep startup NOUN [billion]
$ quantmod billion NUM []
1 compound billion NUM []
billion pobj for ADP [$, 1]

英文分句

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sentence in doc.sents:
    print(sentence)

运行结果：

This is a sentence.
This is another sentence.

关键字抽取

import spacy

nlp = spacy.load("en_core_web_sm")
text= """
    Please ignore that NLLB is not made to translate this large number of tokens at once. Again, I am more interest in the computational limits I have.

I already use torch.no_grad() and put the model in evaluation mode which I read online should safe some memory. My full code to run the inference looks like this:
    """

doc = nlp(text)
keywords = [token.text for token in doc if token.pos_ in ['NOUN', 'PROPN']]
print(keywords)

运行结果：

['NLLB', 'number', 'tokens', 'interest', 'limits', 'torch.no_grad', 'model', 'evaluation', 'mode', 'memory', 'code', 'inference']

句子相似度的比较

import spacy
nlp = spacy.load("en_core_web_lg")
 
doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')
 
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))

运行结果：

0.7003971105290047
0.9671912343259517
0.6121211244876517

Model Architectures · spaCy API Documentation