持续更新笔记:用Python进行自然语言处理(Natural Language Processing with python)

#下载 NLTK 图书集
>>> import nltk
>>> nltk.download()
使用 nltk.download()浏览可用的软件包。下载器上的 Collections 选项卡显示软件包如何被打包分组。选择 book 标记所在行,可以获取本书的例子和练习所需的全部数据。这些数据包括约 30 个压缩文件,需要 100MB 硬盘空间。完整的数据集(即下载器中的 all)在本书写作期间大约是这个大小的 5 倍,还在不断扩充。

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

文本1:《白鲸记》
文本2:《理智与情感》
文本3:《创世纪》
文本4:《就职演讲语料库》
文本5:《NPS 聊天语料库》
文本6:《巨蟒与圣杯》
文本7:《华尔街日报》
文本8:《交友科尔普斯》
文本9:《谁是星期四》
================================================
#搜索文本
##查《白鲸记》中的词monstrous:
text1.concordance("monstrous")

##搜索《理智与情感》中的词affection:
text2.concordance("affection")

##搜索《创世纪》找出某人活了多久:
text3.concordance("lived")

##在《NPS 聊天语料库》搜索一些网络词,如im,ur,lol
text5.concordance("im")

##索引,使我们看到词的上下文,索引monstrous在《白鲸记》出现的上下文
text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
In [15]:

##索引monstrous在《理智与情感》出现的上下文
text2.similar("monstrous")
very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast
###观察我们从不同的文本中得到的不同结果。Austen(奥斯丁,英国女小说家)使用这些词与 Melville 完全不同;在她那里,monstrous是正面的意思,有时它的功能像词very一样作强调成分。


##研究两个或两个以上的词共同的上下文,如monstrous和very
text2.common_contexts(["monstrous", "very"])


##以判断词在文本中的位置:从文本开头算起在它前面有多少词。这个位置信息可以用离散图表示
##在过去220年中的一些显著的词语用法模式(在一个由就职演说语料首尾相连的人为组合的文本中)
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
###将产生-美国总统就职演说词汇分布图,可以用来研究随时间推移语言使用上的变化


##产生随机文本,由于要搜集词序列的统计信息而执行的比较慢。每次运行它,输出的文本都会不同
text3.generate()
'''
《python自然语言处理时》第28页有这样一个命令--text3.generate()---功能是:产生一些与text3风格类似的随机文本。
用NLTK3.0.4和Python2.7.6来实现时却出现错误:'Text' object has no attribute 'generate' .
探索一下后发现问题所在:
打开nltk文件夹中的text.py发现了,原来新版本的NLTK没有了“text1.generate()”这个功能作者已经把demo里的text.generate()注释掉了,但是我下载了nltk2.0.1版本的安装包,解压后打开nltk文件夹下的text.py,发现老版本中有这个功能(《python自然语言处理时》书中用的是NLTK2.0版本),所以要是想用这个功能的同学请安装nltk2.0.1版本,nltk3.x的版本是没了
'''

================================================
#计数词汇
##获取《创世纪》的长度
len(text3)
44764

##获取词汇表,Python 中我们可以使用命令:set(text3)获得 text3 的词汇表
set(text3)
##排序词汇表
sort(set(text3))
##获取词汇表的长度
len(set(text3))
2789
###尽管小说中有 44,764 个标识符,但只有 2,789 个不同的词汇或“词类型”。一个词类型是指一个词在一
个文本中独一无二的出现形式或拼写。也就是说,这个词在词汇表中是唯一的。我们计数的2,789个项目中包括标点符号,所以我们把这些叫做唯一项目类型而不是词类型。

##测量文本词汇丰富度
len(text3) / len(set(text3))


##计数一个词在文本中出现的次数,计算一个特定的词在文本中占据的百分比
text3.count("smote")

100 * text4.count('a') / len(text4)

##定义计数函数
###词汇差异度,丰富度,越趋近1 越丰富(1 <= lexical_diversity <= len(text))
def lexical_diversity(text):
    return len(text) / len(set(text))
###百分比
def percentage(count, total):
    return 100 * count / total

转载于:https://my.oschina.net/u/614290/blog/742031

Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,
Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, Iti Mathur 2016 | ISBN: 1783989041 | English | 238 pages Maximize your NLP capabilities while creating amazing NLP projects in Python About This Book Learn to implement various NLP tasks in Python Gain insights into the current and budding research topics of NLP This is a comprehensive step-by-step guide to help students and researchers create their own projects based on real-life applications Who This Book Is For This book is for intermediate level developers in NLP with a reasonable knowledge level and understanding of Python. What You Will Learn Implement string matching algorithms and normalization techniques Implement statistical language modeling techniques Get an insight into developing a stemmer, lemmatizer, morphological analyzer, and morphological generator Develop a search engine and implement POS tagging concepts and statistical modeling concepts involving the n gram approach Familiarize yourself with concepts such as the Treebank construct, CFG construction, the CYK Chart Parsing algorithm, and the Earley Chart Parsing algorithm Develop an NER-based system and understand and apply the concepts of sentiment analysis Understand and implement the concepts of Information Retrieval and text summarization Develop a Discourse Analysis System and Anaphora Resolution based system In Detail Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We'll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution. Style and approach This is an easy-to-follow guide, full of hands-on examples of real-world tasks. Each topic is explained and placed in context, and for the more inquisitive, there are more details of the concepts used.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值