《Natural Language Processing with Python》读书笔记 002期

第二章一开始核心就是再讲nltk里面内置的各种语料库,但是个人觉得这个并不是这张的重点,重点在于后面如何自己构造自己的语料库,毕竟如果一般训练的话,都肯定是拿自己手头的data来搞。

这个地方其实也没有什么要多加注意的,就是要仔细注意编码问题,都变成utf-8的格式最好统一,这样与PlaintextCorpusReader的默认编码就相同了。

def __init__(self, root, fileids,
                 word_tokenizer=WordPunctTokenizer(),
                 sent_tokenizer=nltk.data.LazyLoader(
                     'tokenizers/punkt/english.pickle'),
                 para_block_reader=read_blankline_block,
                 encoding='utf8'):

其实交互式编程在测试代码,或者实现这种短代码的时候非常有用,只要玩转IDLE就好了,具体怎么玩转,网上有各种各样的功能代码,find by yourself.
原因?因为IDLE貌似不需要像普通编辑器那样从头运行,所以节省了很多加载的时间。

                         0    1    2    3    4    5    6    7    8    9 
            Chickasaw    0  411  510  551  619  710  799  876  946  995 
              English    0  185  525  883  997 1166 1283 1440 1558 1638 
       German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275 
Greenlandic_Inuktikut    0  139  150  151  154  175  182  241  259  283 
     Hungarian_Magyar    0  302  431  503  655  767  881  972 1081 1171 
          Ibibio_Efik    0  228  440  915 1418 1705 1867 1974 2049 2074 
           Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
   news        54        43        22        20        41        33        51 
romance         2         3         3         1         3         4         5 

Zipf's Law中文译为齐夫定律,参照百度百科词条中理论的解释:
这个“定律”是哈佛大学的语言学家GeorgeKingsley Zipf1949年发表的。比如,在 Brown 语料库中,“the”是最常见的单词,它在这个语料库中出现了大约7%(100万单词中出现69971次)。正如齐夫定律中所描述的一样,出现次数为第二位的单词“of”占了整个语料库中的3.5%(36411次),之后的是“and”(28852次)。仅仅135个字汇就占了Brown语料库的一半。

齐夫定律是一个实验定律,而非理论定律。齐夫分布可以在很多现象中被观察到。齐夫分布的在现实中的起因是一个争论的焦点。齐夫定律很容易用点阵图观察,坐标为log(排名)和log(频率)。比如,“the”用上述表述可以描述为x = log(1), y = log(69971)的点。如果所有的点接近一条直线,那么它就遵循齐夫定律。最简单的齐夫定律的例子是“1/f function”。给出一组齐夫分布的频率,按照从最常见到非常见排列,第二常见的频率是最常见频率的出现次数的½,第三常见的频率是最常见的频率的1/3,第n常见的频率是最常见频率出现次数的1/n。然而,这并不精确,因为所有的项必须出现一个整数次数,一个单词不可能出现2.5次。然而,在一个广域范围内并且做出适当的近似,许多自然现象都符合齐夫定律。

反比例性质?有点意思,相当于一个潜在的fact被重新发现。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值