第三章-处理原始文本(Natural Language Processing with Python第二版)

研究的问题

  1. 为了获得无限范围的语言材料我们如何编写程序来从本地文件和Web中访问文本?
  2. 我们如何将文档分割成单独的单词和标点符号,
    所以我们可以进行和前几章一样的文本语料库分析?
    3.我们如何编写程序来生成格式化的输出并将其保存在文件中?

从Web和磁盘访问文本

1.电子图书

1) raw text获取和类型处理

1.从Gutenberg读取txt文件(太大读不出来,读本地的代替了,读出是字符串类型)

from urllib.request import  urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()

raw=open("./text.txt").read()

2.把读出的内容进行 tokenization(去掉空格,空行,换行的过程),成了List类型

tokens = nltk.word_tokenize(raw)

3.把tokenization后的List化为text类型

text = nltk.Text(tokens)

2)text一些操作

1.输出文本中搭配使用的词

text.collocations()

2.找某字符串开始出现的位置

 raw.find("PART I")

3.找某字符串开始结束的位置

raw.rfind("PART I")

4.使用find() rfind()可以截取包含某字符串的那段

2.处理HTML

1) HTML获取和类型处理

1.读取html文件(同text一样)

html=open("./html.html").read()

2.把html文件转换为raw text(python3需要安装BS4)

from bs4 import BeautifulSoup
raw = BeautifulSoup(html,"html5lib").get_text()

3.其余操作同raw text一样处理

3.处理搜索引擎结果

搜索引擎语料库优点

1)规模大 2)易使用

搜索引擎语料库缺点
1)搜索模式严格受限,搜索引擎通常只允许搜索单个单词或单词字符串,有时使用通配符。
2)搜索引擎给出的结果不一致
3)最后,搜索引擎返回的结果中的标记可能会发生不可预测的变化,从而破坏任何基于模式的定位特定内容的方法(使用搜索引擎api可以改善这个问题)。

4.处理 RSS Feeds

博客圈是一个重要的文本来源,通过Universal Feed Parser可以任意下载里面的内容

1) blog 获取和类型处理

需要安装feedparser

import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
nltk.word_tokenize(BeautifulSoup(llog.entries[2].content[0].value,"html5lib").get_text())

2)blog 一些操作

llog['feed']['title'] 
post = llog.entries[2]
post.title 
content = post.content[0].value

5.读取本地文件

1) 本地文件获取

import os
os.listdir('.')
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

6.从PDF、MSWord和其他二进制格式中提取文本

PDF和msword,只能使用专门的软件打开,第三方库(如pypdf和pywin32)提供了对这些格式的访问,从多文档中提取文本尤其具有挑战性。对于经过一次性转换个别文件,使用合适的应用程序打开文档,然后将其作为文本保存到本地就可以访问它。如果文档已经在Web上,您可以在谷歌的搜索框中输入它的URL。
搜索结果通常包含一个指向HTML文档的链接,您可以将其保存为文本。

7.捕获用户的输入

s = input("Enter some text: ")
print ("You typed", len(nltk.word_tokenize(s)), "words.")

8.The NLP Pipeline

Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, Iti Mathur 2016 | ISBN: 1783989041 | English | 238 pages Maximize your NLP capabilities while creating amazing NLP projects in Python About This Book Learn to implement various NLP tasks in Python Gain insights into the current and budding research topics of NLP This is a comprehensive step-by-step guide to help students and researchers create their own projects based on real-life applications Who This Book Is For This book is for intermediate level developers in NLP with a reasonable knowledge level and understanding of Python. What You Will Learn Implement string matching algorithms and normalization techniques Implement statistical language modeling techniques Get an insight into developing a stemmer, lemmatizer, morphological analyzer, and morphological generator Develop a search engine and implement POS tagging concepts and statistical modeling concepts involving the n gram approach Familiarize yourself with concepts such as the Treebank construct, CFG construction, the CYK Chart Parsing algorithm, and the Earley Chart Parsing algorithm Develop an NER-based system and understand and apply the concepts of sentiment analysis Understand and implement the concepts of Information Retrieval and text summarization Develop a Discourse Analysis System and Anaphora Resolution based system In Detail Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We'll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution. Style and approach This is an easy-to-follow guide, full of hands-on examples of real-world tasks. Each topic is explained and placed in context, and for the more inquisitive, there are more details of the concepts used.
Python Natural Language Processing by Jalaj Thanaki English | 31 July 2017 | ISBN: 1787121429 | ASIN: B072B8YWCJ | 486 Pages | AZW3 | 11.02 MB Key Features Implement Machine Learning and Deep Learning techniques for efficient natural language processing Get started with NLTK and implement NLP in your applications with ease Understand and interpret human languages with the power of text analysis via Python Book Description This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world. What you will learn Focus on Python programming paradigms, which are used to develop NLP applications Understand corpus analysis and different types of data attribute. Learn NLP using Python libraries such as NLTK, Polyglot,
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值