《Natural Language Processing with Python》读书笔记 001期

最新推荐文章于 2019-05-14 20:14:25 发布

bright_silmarillion

最新推荐文章于 2019-05-14 20:14:25 发布

阅读量1.2k

点赞数 2

分类专栏： NLTK 读书笔记文章标签： NLP Python3 读书笔记

本文链接：https://blog.csdn.net/bright_silmarillion/article/details/81141866

版权

NLTK 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

读书笔记

12 篇文章 0 订阅

订阅专栏

这本书对应python2的中文版书籍网上有很多，但是随后更新的python3的版本却微乎其微，只能从官网上的电子英文版开看了，反正也全当练习了。

官网明确更新的几条观月NLTK 3.0的信息，间接说明这些可能很重要或者很常用，就像print对于python一样。

NLTK also includes some pervasive changes:

many types are initialised from strings using a fromstring() method
many functions now return iterators instead of lists
ContextFreeGrammar is now called CFG and WeightedGrammar is now called PCFG
batch_tokenize() is now called tokenize_sents(); there are corresponding changes for batch taggers, parsers, and classifiers
some implementations have been removed in favour of external packages, or because they could not be maintained adequately

详情：https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

第一章没什么新内容，多了一个concordance的方法

>>> text5.concordance('lol')
Displaying 25 of 25 matches:
ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w
ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s
a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha
e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does
30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for 
 didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it
es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap
ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE
k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 '
 loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett
ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I 
cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? 
 that mean I want you ? U6 hello room lol U83 and this .. has been the grammar 
 the rule he 's in PM land now though lol ah ok i wont bug em then someone wann
flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80
ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265
082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class
 . i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any
 . whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo
 ??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads 
 ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl
ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim
pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLO
 is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic
s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to

这里写图片描述
通过实验，可以知道dispersion_plot是注意大小写的，可以稍微见得，在NLP处理过程中大小写都是要很注意的。
对于generate这个函数，根据网页：https://github.com/nltk/nltk/issues/736来看，仍然没有解决，最近的一条回复竟然是18号，然而很多其他也并不能给出相应的解答，无非都是没办法，不去管，我这边也尝试了几种不同的方式，也没有得到不错的结果……故而暂且搁置，文章说第三章会再见，我们第三期再说。

token被译为标识符（管他第二个字念什么），括号和标点符号的组合体貌似算是一种标识符，有点意思。
word type 词类型，含有标点符号的一般不叫word type，而是叫item type，换句话说纯正的单词表才会是word type。

1.3上来这个saying是什么就不知道，中间一串省略号…

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done']
>>> tokens=set(saying)
>>> tokens=sorted(tokens)
>>> tokens[-2:]
['said', 'than']

“单纯来看”

这里写图片描述

再使用hapaxes方法的时候可能会出现IDLE短时死机的可能，不过等一会儿就好了，毕竟9000多个词呢。

Collocations被翻译成了搭配，好像没什么问题

只计数小写的词肯定有问题啊，国家名地名什么的……

babelize_shell()这个函数已经不再使用了，官网的电子书给出了解释：

Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings, which are starkly revealed by translating a sentence back and forth between a pair of languages until equilibrium is reached, e.g.:

0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become nonsensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure.

正如之前讨论所得出的结果一样，现在很多翻译器的翻译结果都是呈离散型的，换句话说一句话翻译过去在翻译过来并不能和原句相同，这也许是现在NLP面临的另外一个难题吧。

bright_silmarillion

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
《Natural Language Processing with Python》读书笔记 001期

这本书对应python2的中文版书籍网上有很多，但是随后更新的python3的版本却微乎其微，只能从官网上的电子英文版开看了，反正也全当练习了。官网明确更新的几条观月NLTK 3.0的信息，间接说明这些可能很重要或者很常用，就像print对于python一样。NLTK also includes some pervasive changes:many types are initia...
复制链接

扫一扫

专栏目录