源码请到:自然语言处理练习: 学习自然语言处理时候写的一些代码 (gitee.com)
四、spacy工具包的使用
4.1 spacy工具包安装
spacy工具包宣称可以做到nltk做到的所有事情,并且速度更快,还更好的适配深度学习,最关键的是提供了中文语言模型!!
由于某些不可说的原因,使用官网的安装方式很难成功推荐直接使用conda内部的整合包
运行
conda install spacy
conda install -c conda-forge spacy-model-en_core_web_sm
就可以安装成功了
如果不成功可以网上寻找spacy的离线安装包,可以参考这篇文章
安装spaCy(最简单的教程)_spacy安装_御用厨师的博客-CSDN博客
4.2 加载模型
可以自行选择安装需要的模型,然后使用命令加载,我这里使用英文模型做示范
示例:
# 加载模型
nlp = spacy.load("en_core_web_sm")
4.3 分词
spacy同样可以做到分词
示例:
# 加载语料
doc = nlp('Weather is good, very windy and sunny. We have no classes in the afternoon')
# 分词
for token in doc:
print(token)
4.4 分句
spacy还提供了分句功能
示例:
# 分句
for sent in doc.sents:
print(sent)
4.5 词性
spacy和nltk一样提供了分析词性的功能
示例:
# 词性
for token in doc:
print('{}-{}'.format(token, token.pos_))
词性对照表可以参考
4.6 命名体识别
spacy也提供了命名体识别功能
示例:
# 命名体识别
doc_2 = nlp("I went to Paris where I met my old friend Jack from uni")
for ent in doc_2.ents:
print('{}-{}'.format(ent, ent.label_))
还可以将结果进行可视化展示
# 展示
doc = nlp("I went to Paris where I met my old friend Jack from uni")
svg = displacy.render(doc, style='ent')
output_path = Path(os.path.join("./", "sentence.html"))
output_path.open('w', encoding="utf-8").write(svg)
4.7 找出书中所有人物的名字
以傲慢与偏见为语料,做一个找出所有人物名字的实战示例
示例:
# 找到书中所有人物名字
def read_file(file_name):
with open(file_name, 'r') as f:
return f.read()
text = read_file(os.path.join('./', 'data/Pride and Prejudice.txt'))
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print(len(sentences))
print(sentences[:5])
def find_person(doc):
c = Counter()
for ent in doc.ents:
if ent.label_ == 'PERSON':
c[ent.lemma_] += 1
return c.most_common(10)
print(find_person(processed_text))
4.8 恐怖袭击分析
根据世界反恐怖组织官网上下载的恐怖袭击事件,来分析特定的组织在特定的地点作案的次数
示例:
# 恐怖袭击分析
def read_file_to_list(file_name):
with open(file_name, 'r') as f:
return f.readlines()
terrorist_articles = read_file_to_list(os.path.join('./', 'data/rand-terrorism-dataset.txt'))
print(terrorist_articles[:5])
terrorist_articles_nlp = [nlp(art.lower()) for art in terrorist_articles]
common_terrorist_groups = [
'taliban',
'al-qaeda',
'hamas',
'fatah',
'plo',
'bilad al-rafidayn'
]
commmon_locations = [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan'
]
location_entity_dict = defaultdict(Counter)
for article in terrorist_articles_nlp:
article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == "ORG"]
article_locations = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']
terrorist_common = [ent for ent in article_terrorist_groups if ent in common_terrorist_groups]
location_common = [ent for ent in article_locations if ent in commmon_locations]
for found_entity in terrorist_common:
for found_location in location_common:
location_entity_dict[found_entity][found_location] += 1
print(location_entity_dict)
location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df = location_entity_df.fillna(value=0).astype(int)
print(location_entity_df)
plt.figure(figsize=(12, 10))
hmap = sns.heatmap(location_entity_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)
plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()