第一章语言处理与Python

最新推荐文章于 2020-01-29 21:08:13 发布

Miracle_520

最新推荐文章于 2020-01-29 21:08:13 发布

阅读量156

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/Miracle_520/article/details/90722165

版权

NLP 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

在这里插入图片描述

import nltk
nltk.download()
#从NLTK的book模块加载所有东西
from nltk.book import *
#搜索文本，词语索引使我们看到词的上下文
text1.concordance('monstrous')
#哪些词出现在相似的上下文中？
text1.similar('monstrous')
#研究两个或两个以上的词共同的上下文
text2.common_contexts(['monstrous','very'])
#判断词在文本中的位置：从文本开头算起在它前面有多少词。。这个位置信息 可以用离散图表示。每一个竖线代表一个单词，每一行代表整个文本
text4.dispersion_plot(['citizens','democracy','freedom','duties','Americal'])
#计数词汇
len(text3)
#获得词汇集
sorted(set(text3))
len(set(text3))
#对文本词汇丰富度进行测量
from _future_ import division
len(text3) / len(set(text4))
#计数一个词在文本中出现的次数，计算一个特定的词 在文本中占据的百分比。
text3.count('smote')
100 * text4.count('a') / len(text4)
#频率分布，寻找最常见的50个词
fdist1 = FreqDist(text1)
print(fdist1)
vocabulary1 = list(fdist1.keys())
print(vocabulary1)
print(vocabulary1[:50])
print(fdist1['whale'])
#产生一个这些词汇的累积频率图
fdist1.plot(50, cumulative=True)
#《白鲸记》中 50 个最常用词的累积频率图，这些词占了所有标识符的将近一半 如果高频词对我们没有帮助，那些只出现了一次的词（所谓的hapaxes hapaxes hapaxes hapaxes）又如何呢？只出现了一次的词
fdist1.hapaxes()
#高频词和低频词都没有帮助
#细粒度的选择词
V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))
fdist5 = FreqDist(text5)
print(sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]))
#词语搭配，更频繁出现的双连词
text4.collocations()
#计数
fdist = FreqDist([len(w) for w in text1])
print(fdist.keys())
print(fdist.items())
print(fdist.max())
print(fdist[3])
print(fdist.freq(3))
#过滤掉所有非字母元素，从词汇表中消除数字和标 点符号
print(len(set([word.lower() for word in text1 if word.isalpha()])))
#人机对话系统
nltk.chat.chatbots()

在这里插入图片描述

Miracle_520

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第一章语言处理与Python

import nltknltk.download()#从NLTK的book模块加载所有东西from nltk.book import *#搜索文本，词语索引使我们看到词的上下文text1.concordance('monstrous')#哪些词出现在相似的上下文中？text1.similar('monstrous')#研究两个或两个以上的词共同的上下文text2.common_c...
复制链接

扫一扫