《Natural Language Processing with Python》读书笔记 001期

这本书对应python2的中文版书籍网上有很多,但是随后更新的python3的版本却微乎其微,只能从官网上的电子英文版开看了,反正也全当练习了。

官网明确更新的几条观月NLTK 3.0的信息,间接说明这些可能很重要或者很常用,就像print对于python一样。

NLTK also includes some pervasive changes:

  • many types are initialised from strings using a fromstring() method
  • many functions now return iterators instead of lists
  • ContextFreeGrammar is now called CFG and WeightedGrammar is now called PCFG
  • batch_tokenize() is now called tokenize_sents(); there are corresponding changes for batch taggers, parsers, and classifiers
  • some implementations have been removed in favour of external packages, or because they could not be maintained adequately

详情:https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

第一章没什么新内容,多了一个concordance的方法

>>> text5.concordance('lol')
Displaying 25 of 25 matches:
ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w
ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s
a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha
e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does
30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for 
 didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it
es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap
ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE
k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 '
 loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett
ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I 
cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? 
 that mean I want you ? U6 hello room lol U83 and this .. has been the grammar 
 the rule he 's in PM land now though lol ah ok i wont bug em then someone wann
flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80
ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265
082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class
 . i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any
 . whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo
 ??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads 
 ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl
ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim
pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLO
 is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic
s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to

这里写图片描述
通过实验,可以知道dispersion_plot是注意大小写的,可以稍微见得,在NLP处理过程中大小写都是要很注意的。
对于generate这个函数,根据网页:https://github.com/nltk/nltk/issues/736来看,仍然没有解决,最近的一条回复竟然是18号,然而很多其他也并不能给出相应的解答,无非都是没办法,不去管,我这边也尝试了几种不同的方式,也没有得到不错的结果……故而暂且搁置,文章说第三章会再见,我们第三期再说。

token被译为标识符(管他第二个字念什么),括号和标点符号的组合体貌似算是一种标识符,有点意思。
word type 词类型,含有标点符号的一般不叫word type,而是叫item type,换句话说纯正的单词表才会是word type。

1.3上来这个saying是什么就不知道,中间一串省略号…

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done']
>>> tokens=set(saying)
>>> tokens=sorted(tokens)
>>> tokens[-2:]
['said', 'than']

“单纯来看”

这里写图片描述

再使用hapaxes方法的时候可能会出现IDLE短时死机的可能,不过等一会儿就好了,毕竟9000多个词呢。

Collocations被翻译成了搭配,好像没什么问题

只计数小写的词肯定有问题啊,国家名地名什么的……

babelize_shell()这个函数已经不再使用了,官网的电子书给出了解释:

Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings, which are starkly revealed by translating a sentence back and forth between a pair of languages until equilibrium is reached, e.g.:

0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become nonsensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure.

正如之前讨论所得出的结果一样,现在很多翻译器的翻译结果都是呈离散型的,换句话说一句话翻译过去在翻译过来并不能和原句相同,这也许是现在NLP面临的另外一个难题吧。

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值