NLTK(text)

本系列博客为学习《用Python进行自然语言处理》一书的学习笔记。

import nltk
form nltk.book import *
text1
Out[64]: <Text: Moby Dick by Herman Melville 1851>

text2
Out[65]: <Text: Sense and Sensibility by Jane Austen 1811>

Text::concordance(word)

该方法接受一个单词字符串,会打印出输入单词在文本中出现的上下文,查看单词的上下文可以帮助我们了解单词的词性。

text1.concordance('monstrous')
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text1.concordance("imperial")
Displaying 10 of 10 matches:
ALING NOT RESPECTABLE ? Whaling is imperial ! By old English statutory law , t
 fortress . Curious to tell , this imperial negro , Ahasuerus Daggoo , was the
h outward homage as if he wore the imperial purple , and not the shabbiest of 
f geographical empire encircles an imperial brain ; then , the plebeian herds 
or profoundly dines with the seven Imperial Electors , so these cabin meals we
 overlording Rome , having for the imperial colour the same imperial hue ; and
g for the imperial colour the same imperial hue ; and though this pre - eminen
 could have furnished him . A most imperial and archangelical apparition of th
ts he should only be treated of in imperial folio . Not to tell over again his
with archangelic shrieks , and his imperial beak thrust upwards , and his whol

Text::similar(word)

该方法接受一个单词字符串,会打印出和输入单词具有相同上下文的其他单词,也就是说找出和指定单词相似的其他单词,比如monstrous用在the_Pictures上下文中,similar方法会打印出所有使用the_Pictures上下文的单词。

text1.similar('monstrous')
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

Text::common_contexts(words)

该方法接受一个单词列表,会打印出列表中所有单词共同的上下文。

text1.common_contexts(['monstrous', 'imperial'])
most_and

Text::dispersion_plot(words)

该方法接受一个单词列表,会绘制每个单词在文本中的分布情况。
这里写图片描述

Text::collocations()

该方法会打印出文本中频繁出现的双连词。

text1.collocations()
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
首先,需要安装nltktextrank库。 在Python中,安装nltk库可以使用以下命令: ```python !pip install nltk ``` 安装textrank库可以使用以下命令: ```python !pip install sumy ``` 接下来,我们先使用nltk库对文本进行分词: ```python import nltk from nltk.tokenize import word_tokenize # 文本 text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data." # 分词 tokens = word_tokenize(text) print(tokens) ``` 输出结果为: ``` ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.'] ``` 接下来,我们使用textrank库对文本进行分词: ```python from sumy.nlp.tokenizers import Tokenizer from sumy.parsers.plaintext import PlaintextParser # 文本 text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data." # 分词 tokenizer = Tokenizer("english") parser = PlaintextParser.from_string(text, tokenizer) tokens = [str(sentence).strip() for sentence in parser.document.sentences] print(tokens) ``` 输出结果为: ``` ['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.'] ``` 可以看到,使用textrank库对文本进行分词后,整个文本被当作一个句子处理。如果需要对文本进行句子级别的分词,可以使用nltk库中的sent_tokenize函数。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值