《Natural Language Processing with Python》6.2节的一些错误

最近一直在阅读《Natural Language Processing with Python》,在阅读该书的6.2节的Sentence Segmentation时,发现错误比较多。现在记录下来,也许可以帮到其他人,也许以后可以整理一下给该书作者发个邮件。

在说明问题之前,在这里先列出我的软件版本:

Python2.5

Nltk2.0b2

本节内容简介

Sentence Segmentation(我不知道如何翻译,下文中称作“句子划分”)把一段文字分成一组句子,由于句子的结尾一般有比较特殊的标点,所以句子划分可以被看做一个针对标点符号的分类问题,即当我们遇到一个标点符号时,判断它是不是一个句子的结束。

本节中对于句子划分任务用的方法是:利用监督学习之朴素Bayes方法,和一般监督学习算法一样,其基本步骤为:

1,数据预处理:把数据整理成一种合适的格式,这样就便于下一步的特征提取。

 

2,特征提取:提取一些比较有分辨能力的特征。

 

3,准备训练数据和测试数据

 

4,训练:利用朴素贝叶斯算法训练。

 

5,测试:测试训练出来的分类器效果如何。

 

那么如何用该分类器来对一段文字进行句子划分呢?本节用的方法就是检查每个标点,判断它是不是句子的边界,如果是就把文字从该标点出分开。

 

错误之处及改正方法

下面我将指出该节代码中的错误之处,并给出相应的解决方法。

问题1:特征提取函数可能产生越界。

特征提取函数中的代码:

'next-word-capitalized': tokens[i+1][0].isupper()

有可能越界,即如果i是这tokens序列的最后一个字符的索引值,那么上面的代码就会越界。而且这种情况会经常发生,因为段文字的最后一个字符通常是“.?!”(对英文而言)中的一个,所以一定会执行改行代码。

解决办法:因为一般在句尾遇到“.?!”标点之一时,标志的一句话的结束,那我们就捕获该异常并把“next-word-capitalized”的值设置为True

问题2及解决方法:一个小的打印错误,书中忘了把对要分类的数据进行特征提取。

segment_sentence”函数中第4行的“classifier.classify(words, i) == True”应该改为“classifier.classify(punct_features(words, i))

问题3segment_sentence函数才执行完后,我如何才能获得句子划分的结果?由于保存划分结果的sents变量是一个局部变量,在执行完后,我们在函数外边是得不到结果的,而且其内部也没有打印该结果。

解决办法:在函数结尾加上一句“return sents”,让其把划分结果返回。

问题4:当我调用“classifier.show_most_informative_features()”时,提示如下错误:

“File "E:/编程工具/py2.5/lib/site-packages/nltk/classify/naivebayes.py", line 144, in show_most_informative_features“

“TypeError: 'bool' object is unsubscriptable”

我针对该错误,找到文件“naivebayes.py”,其144行语句是:

 print ('%24s = %-14r %6s : %-6s = %s : 1.0' %

                   (fname, fval, l1[:6], l0[:6], ratio))

解决办法:由于,我们的分类任务的类别只有两个,即True和False,在代码中即用l1和l0来表示,由于在144行对其执行了下标操作,从而导致了前面的错误。我猜想nltk作者可能假设所有类别都是字符串,他们为了在输出的时候保持格式较短,才在这里做了下标操作。所以,我的解决办法是在对l1和l0进行下标操作之前把其转化为字符串,即把上面的代码改为:

 print ('%24s = %-14r %6s : %-6s = %s : 1.0' %

                   (fname, fval, str(l1)[:6], str(l0)[:6], ratio))

至此,我解决了我遇到的问题,我才可以放心的进行下面章节的学习。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,
Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, Iti Mathur 2016 | ISBN: 1783989041 | English | 238 pages Maximize your NLP capabilities while creating amazing NLP projects in Python About This Book Learn to implement various NLP tasks in Python Gain insights into the current and budding research topics of NLP This is a comprehensive step-by-step guide to help students and researchers create their own projects based on real-life applications Who This Book Is For This book is for intermediate level developers in NLP with a reasonable knowledge level and understanding of Python. What You Will Learn Implement string matching algorithms and normalization techniques Implement statistical language modeling techniques Get an insight into developing a stemmer, lemmatizer, morphological analyzer, and morphological generator Develop a search engine and implement POS tagging concepts and statistical modeling concepts involving the n gram approach Familiarize yourself with concepts such as the Treebank construct, CFG construction, the CYK Chart Parsing algorithm, and the Earley Chart Parsing algorithm Develop an NER-based system and understand and apply the concepts of sentiment analysis Understand and implement the concepts of Information Retrieval and text summarization Develop a Discourse Analysis System and Anaphora Resolution based system In Detail Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We'll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution. Style and approach This is an easy-to-follow guide, full of hands-on examples of real-world tasks. Each topic is explained and placed in context, and for the more inquisitive, there are more details of the concepts used.
以下是《Natural Language Processing with Python》一书中前 100 个出现频率最高的单词及其出现次数: | 单词 | 出现次数 | | --- | --- | | the | 2360 | | , | 2197 | | . | 1974 | | of | 1254 | | and | 1075 | | to | 1052 | | a | 1024 | | in | 820 | | 's | 741 | | that | 622 | | for | 439 | | is | 416 | | we | 392 | | with | 387 | | The | 374 | | it | 352 | | as | 345 | | on | 332 | | this | 331 | | be | 326 | | are | 321 | | by | 304 | | from | 301 | | can | 298 | | our | 291 | | an | 284 | | or | 266 | | language | 256 | | NLP | 240 | | at | 237 | | natural | 215 | | processing | 202 | | not | 201 | | but | 199 | | have | 196 | | will | 186 | | text | 182 | | all | 180 | | their | 179 | | has | 178 | | one | 178 | | used | 177 | | more | 174 | | by-nc-nd | 172 | | using | 170 | | about | 166 | | can't | 166 | | or-nc-nd | 165 | | its | 165 | | they | 165 | | other | 164 | | than | 164 | | some | 163 | | which | 160 | | also | 159 | | than-nc | 154 | | than-nc-nd | 153 | | may | 151 | | would | 151 | | these | 143 | | such | 142 | | there | 139 | | new | 136 | | when | 134 | | into | 133 | | been | 128 | | two | 127 | | many | 124 | | most | 124 | | using-nc-nd | 123 | | first | 121 | | up | 120 | | should | 118 | | out | 116 | | between | 115 | | also-nc-nd | 114 | | them | 114 | | do | 113 | | using-nc | 112 | | only | 111 | | time | 111 | | been-nc-nd | 110 | | if | 109 | | like | 109 | | because | 108 | | used-nc-nd | 108 | | which-nc-nd | 108 | | so | 107 | | each | 106 | | two-nc-nd | 104 | | were | 103 | 可以看出,这份数据中出现次数最多的单词都是一些常见的英文单词,比如“the”、“and”、“a”、“in”等等。同时,也有一些和本书主题相关的单词,比如“NLP”、“natural”、“processing”和“text”等等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值