nlp入门之spacy工具包的使用



已于 2023-08-18 16:43:12 修改

阅读量330

点赞数

分类专栏： nlp入门文章标签：自然语言处理人工智能机器学习

于 2023-08-18 16:38:04 首次发布

本文链接：https://blog.csdn.net/zsh_mmd/article/details/132364858

版权

nlp入门专栏收录该内容

9 篇文章 0 订阅

订阅专栏

源码请到：自然语言处理练习: 学习自然语言处理时候写的一些代码 (gitee.com)

四、spacy工具包的使用

4.1 spacy工具包安装

spacy工具包宣称可以做到nltk做到的所有事情，并且速度更快，还更好的适配深度学习，最关键的是提供了中文语言模型！！

由于某些不可说的原因，使用官网的安装方式很难成功推荐直接使用conda内部的整合包

运行

conda install spacy
conda install -c conda-forge spacy-model-en_core_web_sm

就可以安装成功了

如果不成功可以网上寻找spacy的离线安装包，可以参考这篇文章

安装spaCy（最简单的教程）_spacy安装_御用厨师的博客-CSDN博客

4.2 加载模型

可以自行选择安装需要的模型，然后使用命令加载，我这里使用英文模型做示范

示例：

# 加载模型
nlp = spacy.load("en_core_web_sm")

4.3 分词

spacy同样可以做到分词

示例：

# 加载语料
doc = nlp('Weather is good, very windy and sunny. We have no classes in the afternoon')
# 分词
for token in doc:
    print(token)

4.4 分句

spacy还提供了分句功能

示例：

# 分句
for sent in doc.sents:
    print(sent)

4.5 词性

spacy和nltk一样提供了分析词性的功能

示例：

# 词性
for token in doc:
    print('{}-{}'.format(token, token.pos_))

词性对照表可以参考

SpaCy词性对照表 - 知乎 (zhihu.com)

4.6 命名体识别

spacy也提供了命名体识别功能

示例：

# 命名体识别
doc_2 = nlp("I went to Paris where I met my old friend Jack from uni")
for ent in doc_2.ents:
    print('{}-{}'.format(ent, ent.label_))

还可以将结果进行可视化展示

# 展示
doc = nlp("I went to Paris where I met my old friend Jack from uni")
svg = displacy.render(doc, style='ent')
output_path = Path(os.path.join("./", "sentence.html"))
output_path.open('w', encoding="utf-8").write(svg)

4.7 找出书中所有人物的名字

以傲慢与偏见为语料，做一个找出所有人物名字的实战示例

示例：

# 找到书中所有人物名字
def read_file(file_name):
    with open(file_name, 'r') as f:
        return f.read()


text = read_file(os.path.join('./', 'data/Pride and Prejudice.txt'))
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print(len(sentences))
print(sentences[:5])


def find_person(doc):
    c = Counter()
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            c[ent.lemma_] += 1
    return c.most_common(10)


print(find_person(processed_text))

4.8 恐怖袭击分析

根据世界反恐怖组织官网上下载的恐怖袭击事件，来分析特定的组织在特定的地点作案的次数

示例：

# 恐怖袭击分析
def read_file_to_list(file_name):
    with open(file_name, 'r') as f:
        return f.readlines()


terrorist_articles = read_file_to_list(os.path.join('./', 'data/rand-terrorism-dataset.txt'))
print(terrorist_articles[:5])
terrorist_articles_nlp = [nlp(art.lower()) for art in terrorist_articles]
common_terrorist_groups = [
    'taliban',
    'al-qaeda',
    'hamas',
    'fatah',
    'plo',
    'bilad al-rafidayn'
]

commmon_locations = [
    'iraq',
    'baghdad',
    'kirkuk',
    'mosul',
    'afghanistan',
    'kabul',
    'basra',
    'palestine',
    'gaza',
    'israel',
    'istanbul',
    'beirut',
    'pakistan'
]

location_entity_dict = defaultdict(Counter)
for article in terrorist_articles_nlp:
    article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == "ORG"]
    article_locations = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']
    terrorist_common = [ent for ent in article_terrorist_groups if ent in common_terrorist_groups]
    location_common = [ent for ent in article_locations if ent in commmon_locations]
    for found_entity in terrorist_common:
        for found_location in location_common:
            location_entity_dict[found_entity][found_location] += 1

print(location_entity_dict)
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df = location_entity_df.fillna(value=0).astype(int)
print(location_entity_df)

plt.figure(figsize=(12, 10))
hmap = sns.heatmap(location_entity_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)

plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()



关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
nlp入门之spacy工具包的使用

spacy工具包宣称可以做到nltk做到的所有事情，并且速度更快，还更好的适配深度学习，最关键的是提供了中文语言模型！由于某些不可说的原因，使用官网的安装方式很难成功推荐直接使用conda内部的整合包。根据世界反恐怖组织官网上下载的恐怖袭击事件，来分析特定的组织在特定的地点作案的次数。可以自行选择安装需要的模型，然后使用命令加载，我这里使用英文模型做示范。如果不成功可以网上寻找spacy的离线安装包，可以参考这篇文章。spacy和nltk一样提供了分析词性的功能。spacy也提供了命名体识别功能。
复制链接

扫一扫