NLP学习（一）

最新推荐文章于 2022-11-15 16:57:20 发布

宋建国

最新推荐文章于 2022-11-15 16:57:20 发布

阅读量613

点赞数 1

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/hot7732788/article/details/89281763

版权

9 篇文章 6 订阅

订阅专栏

1.NLTK模块

在这里插入图片描述

def lexical_diversity(text): #词密度-重复率
    return len(text) / len(set(text))

text1.concordance("monstrous")

text1.similar("monstrous")

text2.common_contexts(["monstrous", "very"])

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

在这里插入图片描述

fdist1 = FreqDist(text1) #以词汇本身作为索引值

vocabulary1 = list(fdist1.keys())
print(vocabulary1[:50]) #50个最常出现的词

print(fdist1.hapaxes()) #获取只出现过一次的词汇

V = set(text4)
long_words = [w for w in V if len(w) > 15] #获取长词
sorted(long_words)
print(long_words)

fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])#长度超过7并且出现的频率超过7

from nltk.util import bigrams #双连词
print(list(bigrams(['more', 'is', 'said', 'than', 'done'])))


print(text4.collocations()) #输出双连词搭配

-以词频长度作为索引建立频率分布-输出索引值（即词语长度的分类） + 输出分布后经过频数统计的数值

fdist = FreqDist([len(w) for w in text1]) #词语长度的词频
print(fdist)
print(fdist.keys()) #查看索引值-即词语长度

print(fdist.items()) #输出统计后的数据

关注

专栏目录