【读书笔记】Python Natural Language Processing by Jalaj Thanaki

Corpus analysis

nltk包括四种形式的corpora

  1. Isolate corpus:文本和自然语言的集合,例如gutenberg、webtext等。
  2. Categorized corpus:这里的文本被打包分类成不同的类别,例如brown包含了news、hobbies、humor等类别。
  3. Overlapping corpus:这里的类别与类别之间会有重叠区域,例如retuers,比如你认为coconuts是一个类别,你还会看到coconut-oil是其一个子类别,还会有一个cotton-oil的类别,这些类别之间存在重叠。
  4. Temporal corpus:这个语料库收集了一个时期内自然语言的使用情况,例如inaugural address,你可以看到某个城市不同年份的就职演说。

练习地址GitHub:

https://github.com/jalajthanaki/NLPython/tree/master/ch2
https://nbviewer.jupyter.org/github/jalajthanaki/NLPython/blob/master/ch2/2_1_Basic_corpus_analysis.html

可以熟悉nltk API,这里使用brown和gutenberg corpora

几个数据集的网站

https://github.com/caesar0301/awesome-public-datasets
https://www.kaggle.com/datasets
https://www.reddit.com/r/datasets/

Chapter 3 Understanding the Structure of a Sentences

作者推荐了几个用于Linguistics branch的库:

  1. For POS tagging(词性标注):nltk,pycorenlp
  2. Morph analysis(这里应该指的是构词法):nltk,polyglot
  3. 生成语法树:nltk,spaCy

context-free-grammer(CFG)上下文无关语法在这里插入图片描述

  1. 非终结符集合N
  2. 终结符集合T
  3. 开始符号S,非终结符
  4. 产生规则P

Morphological analysis(词法分析)

可以看到GitHub上的例子

https://github.com/jalajthanaki/NLPython/blob/master/ch3/

Lexical analysis(词性分析)

这个页面也有

https://github.com/jalajthanaki/NLPython/blob/master/ch3/

Syntactic analysis(句法分析)(树)

Semantic analysis(语义分析)

Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值