Natural Language Processing With Python (2)

Chapter 3:

This chapter describes the skill to process raw text.

Some important point:

1. Access text from web and disk : api such as urlopen(), open(), read(), write() and some string operation . Also some tool to process text of html.

2. Text processing with Unicode : file/terminal(specific encoding) -> In-memory program including python processing(Unicode) -> file/terminal (specific encoding)

 

3. Regular expressions : re.search, find, findall, replace, splite and so on (remember to add r charater for raw text of regular expression).

Another api in nltk is nltk.regexp_tokenize() which is similar to findall.

Useful for finding word stems and searching tokenized text.

 

4. Normalizing Text and Segmentation : Stemmers, Lemmatization, Sentence Segmantation, Word Segmantation.

Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, Iti Mathur 2016 | ISBN: 1783989041 | English | 238 pages Maximize your NLP capabilities while creating amazing NLP projects in Python About This Book Learn to implement various NLP tasks in Python Gain insights into the current and budding research topics of NLP This is a comprehensive step-by-step guide to help students and researchers create their own projects based on real-life applications Who This Book Is For This book is for intermediate level developers in NLP with a reasonable knowledge level and understanding of Python. What You Will Learn Implement string matching algorithms and normalization techniques Implement statistical language modeling techniques Get an insight into developing a stemmer, lemmatizer, morphological analyzer, and morphological generator Develop a search engine and implement POS tagging concepts and statistical modeling concepts involving the n gram approach Familiarize yourself with concepts such as the Treebank construct, CFG construction, the CYK Chart Parsing algorithm, and the Earley Chart Parsing algorithm Develop an NER-based system and understand and apply the concepts of sentiment analysis Understand and implement the concepts of Information Retrieval and text summarization Develop a Discourse Analysis System and Anaphora Resolution based system In Detail Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We'll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution. Style and approach This is an easy-to-follow guide, full of hands-on examples of real-world tasks. Each topic is explained and placed in context, and for the more inquisitive, there are more details of the concepts used.
以下是《Natural Language Processing with Python》一书中前 100 个出现频率最高的单词及其出现次数: | 单词 | 出现次数 | | --- | --- | | the | 2360 | | , | 2197 | | . | 1974 | | of | 1254 | | and | 1075 | | to | 1052 | | a | 1024 | | in | 820 | | 's | 741 | | that | 622 | | for | 439 | | is | 416 | | we | 392 | | with | 387 | | The | 374 | | it | 352 | | as | 345 | | on | 332 | | this | 331 | | be | 326 | | are | 321 | | by | 304 | | from | 301 | | can | 298 | | our | 291 | | an | 284 | | or | 266 | | language | 256 | | NLP | 240 | | at | 237 | | natural | 215 | | processing | 202 | | not | 201 | | but | 199 | | have | 196 | | will | 186 | | text | 182 | | all | 180 | | their | 179 | | has | 178 | | one | 178 | | used | 177 | | more | 174 | | by-nc-nd | 172 | | using | 170 | | about | 166 | | can't | 166 | | or-nc-nd | 165 | | its | 165 | | they | 165 | | other | 164 | | than | 164 | | some | 163 | | which | 160 | | also | 159 | | than-nc | 154 | | than-nc-nd | 153 | | may | 151 | | would | 151 | | these | 143 | | such | 142 | | there | 139 | | new | 136 | | when | 134 | | into | 133 | | been | 128 | | two | 127 | | many | 124 | | most | 124 | | using-nc-nd | 123 | | first | 121 | | up | 120 | | should | 118 | | out | 116 | | between | 115 | | also-nc-nd | 114 | | them | 114 | | do | 113 | | using-nc | 112 | | only | 111 | | time | 111 | | been-nc-nd | 110 | | if | 109 | | like | 109 | | because | 108 | | used-nc-nd | 108 | | which-nc-nd | 108 | | so | 107 | | each | 106 | | two-nc-nd | 104 | | were | 103 | 可以看出,这份数据中出现次数最多的单词都是一些常见的英文单词,比如“the”、“and”、“a”、“in”等等。同时,也有一些和本书主题相关的单词,比如“NLP”、“natural”、“processing”和“text”等等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值