python怎么安装chatdet_NLTK的安装/对象/词库/分词/词性标注/分块

最新推荐文章于 2022-10-25 20:44:15 发布

力扣（LeetCode）

最新推荐文章于 2022-10-25 20:44:15 发布

阅读量330

点赞数

文章标签： python怎么安装chatdet

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_30402009/article/details/112957932

版权

python版本

需要Python2.7或3.4+

使用PIP安装

pip install -U nltk

安装NLTK数据

import nltk

nltk.download()

# 导入Brown Corpus

from nltk.corpus import brown

brown.words()

下载之后，如果找不到数据，需要设置NLTK_DATA为数据的目录。

Text对象

form nltk.book import *

#打印出输入单词在文本中出现的上下文

text1.concordance('monstrous')

#打印出和输入单词具有相同上下文的其他单词

text1.similar('monstrous')

#接受一个单词列表，会打印出列表中所有单词共同的上下文

text1.common_contexts(['monstrous', 'gamesome'])

#绘制每个单词在文本中的分布情况

text4.dispersion_plot(['freedom', 'America'])

#返回该单词在文本中出现的次数

text1.count('monstrous')

#打印出文本中频繁出现的双连词

text1.collocations()

FreqDist对象

import nltk

from nltk.book import *

''' 生成FreqDist对象，FreqDist继承自dict FreqDist中的键为单词，值为单词的出现总次数 FreqDist构造函数接受任意一个列表 '''

fdist1 = FreqDist(text1)

#绘制高频词汇

fdist1.plot(10)

#以表格的方式打印出现次数最多的前15项

fdist1.tabulate(15)

#返回出现次数最多的前15项列表

#[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ...

fdist1.most_common(15)

#返回一个低频项列表，低频项即出现一次的项

#['whalin', 'comforts', 'footmanism', 'peacefulness', 'incorruptible', ...]

FreqDist::hapaxes()

#返回出现次数最多的项

fdist1.max()

#文本中长度大于7个字符出现次数超过7次的词

words = set(text1)

long_words = [w for w in words if len(w) > 7 and fdist1[w] > 7]

print(sorted(long_words))

中文分词

# -*- coding:utf-8 -*-

from nltk.tokenize.stanford_segmenter import StanfordSegmenter

segmenter = StanfordSegmenter(

path_to_jar="stanford-segmenter-3.7.0.jar",

path_to_slf4j="slf4j-simple-1.7.25.jar",

path_to_sihan_corpora_dict="./data",

path_to_model="./data/pku.gz",

path_to_dict="./data/dict-chris6.ser.gz"

)

sentence = u"这是斯坦福中文分词器测试"

# 这是斯坦福中文分词器测试

print segmenter.segment(sentence)

print segmenter.segment_file("test.simp.utf8")

语料库

import nltk

#古腾堡语料库 gutenberg、webtext和inaugural是PlaintextCorpusReader的实例对象

from nltk.corpus import gutenberg

#['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', '...

#返回语料库中的文本标识列表

gutenberg.fileids()

#接受一个或多个文本标识作为参数，返回文本单词列表

#['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

emma = gutenberg.words("austen-emma.txt")

#接受一个或多个文本标识为参数，返回文本原始字符串

#'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, ...'

emma_str = gutenberg.raw("austen-emma.txt")

#接受一个或多个文本标识为参数，返回文本中的句子列表

emma_sents = gutenberg.sents("austen-emma.txt")

print(emma_sents)

#网络文本语料库

#['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt', 'singles.txt', 'wine.txt']

from nltk.corpus import webtext

print(webtext.fileids())

#就职演说语料库

from nltk.corpus import inaugural

print(inaugural.fileids())

#即时消息聊天会话语料库 nps_chat是一个NPSChatCorpusReader对象

from nltk.corpus import nps_chat

print(nps_chat.fileids())

#返回一个包含对话的列表，每一个对话又同时是单词的列表

chat_room = nps_chat.posts

最低0.47元/天解锁文章

力扣（LeetCode）

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python怎么安装chatdet_NLTK的安装/对象/词库/分词/词性标注/分块

python版本需要Python2.7或3.4+使用PIP安装pip install -U nltk安装NLTK数据import nltknltk.download()# 导入Brown Corpusfrom nltk.corpus import brownbrown.words()下载之后，如果找不到数据，需要设置NLTK_DATA为数据的目录。Text对象form nltk.book impo...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。