NLP——spacy

EnjoyFailure

已于 2023-10-03 16:00:25 修改

阅读量46

点赞数

分类专栏： NLP 文章标签：自然语言处理深度学习人工智能

于 2023-10-02 18:06:39 首次发布

本文链接：https://blog.csdn.net/S_5922/article/details/133498385

版权

NLP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

导入包（要先下载spacy和对应的语言模型）

import spacy
nlp = spacy.load('en_core_web_sm')# 导入对应的语言模型
doc = nlp('Jack is learning NLP')# 要处理的句子

词性标注

for token in doc:# doc已经分好词了！
    print('{}-{}'.format(token, token.pos_))# 词性标注

命名实体识别

for ent in doc.ents:
    print('{}-{}'.format(ent, ent.label_))# 命名实体识别

实例——找到书上所有人物及其出现次数

# 实例：找到书上所有的人物以及出现次数
from collections import Counter
def find_person(processed_text):# processed_text是经过spacy nlp处理后的文本
    c = Counter()# 计数器
    for ent in processed_text.ents:
        if (ent.label_ == 'PERSON'):# 用实体识别得到的标签来定位，获得满足某一个特定要求的实体
            c[ent] += 1
    return c
print(find_person(doc).most_common(3))# 打印出现次数前三的人物

补充：

用于数据分析的Pandas库

此处可作为统计结果分析的数据展示

import pandas as pd

d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])
}

# 没有传递索引和列，则结果的索引为各个Series索引的并集，列是字典的键
df = pd.DataFrame(d)
print(df)

# 指定index，Series中匹配标签的数据会被取出，没有匹配的标签的值为NaN
df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)

# 同时指定了索引和列，同样的，如果字典中没有和指定列标签匹配的键，则结果中该列标签对应的列值都为NaN
df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print(df)

#结果
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN