python与自然语言处理-读书笔记3

The goal of this chapter is to answer the following questions:
1.怎么获取本地和网络文本 How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of
language material?
2.怎么切词tokenization. How can we split documents up into individual words and
punctuation symbols, so we can carry out the same kinds of
analysis we did with text corpora in earlier chapters?
3.怎么格式化,降噪How can we write programs to produce formatted output
and save it in a file?

本章需要的模块

>>> from __future__ import division #注意:future前后的下划线分别是两根
>>>> import nltk, re, pprint
>>> from nltk import word_tokenize

gutenberg的文本

http://www.gutenberg.org/catalog点击
文本的获取

>>> from urllib import request
>>> url = "http://www.gutenberg.org/files/2554/2554-0.txt"
>>> response = request.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> type(raw)
<class 'str'>
>>> len(raw)
1176893 #这里包含了对空格、行距等的统计信息。所以需要进行tokenize
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines.空格,换行符,空行

>>> tokens = word_tokenize(raw)
>>> type(tokens)
<class 'list'>
>>> len(tokens)
254354
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

word中的ctrl+F在这里就是.find()
分析文本,delete元信息,进行切片操作。

>>> raw.find("PART I")
5338
>>> raw.rfind("End of Project Gutenberg's Crime")
1157743
>>> raw = raw[5338:1157743] 
>>> raw.find("PART I")
0

网页

爬取网页的内容,需要包BeautifulSoup

搜索记录

好处就是库容大。

The web can be thought of as a huge corpus of unannotated text.
The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in.

坏处是1range of patterns, severely restricted. 2inconsistent results.3markup unpredictably.

Unfortunately, search engines have some significant shortcomings.
First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).

读取本地文件

方法1

>>> f = open('LICENSE.txt')
>>> raw = f.read()
>>> print(f.read())

检查文件夹directory下的所有文件:

>>> import os
>>> os.listdir('.')
>>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
>>> raw = open(path, 'rU').read()

读取本地文件,涉及到不同文件的读取方式,设置默认文件夹或者路径文件夹。

pdf

Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats.

网页文件的爬取

下载网页,切片,tokenize转换为文本,切词并转换大小写。

在这里插入图片描述

string----(tokenize)----list(如下代码检测所示,对list和string属性和功能的认识。list可以添加更改等)----

>>> tokens = word_tokenize(raw)
>>> type(tokens)
<class 'list'>
>>> words = [w.lower() for w in tokens]
>>> type(words)
<class 'list'>
>>> vocab = sorted(set(words))
>>> type(vocab)
<class 'list'>

处理字符串时需要注意的

>>> circus = 'Monty Python\'s Flying Circus' #backslash-escape the quote表明不是字符串的引号,是所有格。
>>> circus
"Monty Python's Flying Circus"#输出结果
>>> couplet = "Shall I compare thee to a Summer's day?"\ #用斜杠或整句话()或三个引号'''  '''括起来,表示字符没有完。
...           "Thou are more lovely and more temperate:" 
>>> print(couplet)
Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:

concatenation
variable value

strings are indexed, starting from zero.
The slice [m,n] contains the characters from position m through n-1.
Lists have the added power that you can change their elements
This is because strings are immutable — you can’t change a
string once you have created it. However, lists are mutable,
and their contents can be modified at any time. As a result, lists
support operations that modify the original value rather than producing a new value.

3.3 Text Processing with Unicode

What is Unicode?
Unicode supports over a million characters. Each
character is assigned a number, called a code point. In Python, code
points are written in the form \uXXXX, where XXXX is the number
in 4-digit hexadecimal form.

The Python open() function can read encoded data into Unicode strings, and write out Unicode strings in encoded form

>>>path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
>>> f = open(path, encoding='latin2')
>>> for line in f:
...    line = line.strip()
...    print(line)

unicode decoding and encoding

3.4 regular expressions

在这里插入图片描述
wild cards:
$表示单词结尾
^表示单词开头
· 任何字符
?前一个字母或符号可以有、也可以没有
\:解除通配符的意义。如点指的是任何字符,但是.就取消了点的这个意义,就是指点。means: the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, . only matches a period.

下面这两个符号都是针对前面的元素的数量进行限制,叫做Kleene closures:
"星号"零个或多个, [a-z]*
+一个或多个 “one or more instances of the preceding item”
括号:
[]选择其中的一个元素,没有顺序规定。
{}重复次数,{n,}至少重复n次,{,m}最多重复m次,{n,m}最少n,最多m
()
<>将单词与单词之间分开。The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK’s findall() method for texts).
|
re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s
$:use the dollar sign which has a special behavior in the context of regular expressions in that it matches the end of the word

>>> [w for w in wordlist if re.search('ed$', w)]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

the ? symbol specifies that the previous character is optional.
Thus «^e-?mail » w i l l m a t c h b o t h e m a i l a n d e − m a i l . W e c o u l d c o u n t t h e t o t a l n u m b e r o f o c c u r r e n c e s o f t h i s w o r d ( i n e i t h e r s p e l l i n g ) i n a t e x t u s i n g s u m ( 1 f o r w i n t e x t i f r e . s e a r c h ( ′ e − ? m a i l » will match both email and e-mail. We could count the total number of occurrences of this word (in either spelling)in a text using sum(1 for w in text if re.search('^e-?mail »willmatchbothemailandemail.Wecouldcountthetotalnumberofoccurrencesofthisword(ineitherspelling)inatextusingsum(1forwintextifre.search(e?mail’, w)).

>>> [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

re.findall() (“find all”) method finds all (non-overlapping)
matches of the given regular expression.该函数直接找出列出,与re.search(p, s)比较,是判断是否有,所以需要写一个for的函数赋予给w

# all sequences of two or more vowels in some text,and determine their relative frequency找出两个或多个元音字母的排列,并按频率排序
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
...                       for vs in re.findall(r'[aeiou]{2,}', word))#找到在wsj这个库中所有词语包含排在一起的两个元音字母。将所有这些结果赋予到vs中,再
>>> fd.most_common(12)#以most_common这个方法找出已经按数量方法FreqDist排列的数量中最常见的。
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95)]
>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words())) #选出所有词语,并类符化sorted
>>> [w for w in chat_words if re.search('^m+i+n+e+$', w)]#换成«^m*i*n*e*$»,+是一个或以上,星号是零个或以上。
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
>>> [w for w in chat_words if re.search('^[ha]+$', w)] #注意,在方括号里面,不管顺序
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',
'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',
'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]

写一个粗糙的去除元音的代码(还蛮有趣的!)

#matches initial vowel sequences, final vowel sequences, and all consonants 步骤:定义规则。定义函数compress用re.findall筛选,从所有的词语中筛选符合规则的词语。用for函数遍历所有的单词(所有单词都用compress(w)函数过一遍)
>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ''.join(pieces)#''.join()表示连接。
...
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words     
              for cv in re.findall(r'[ptksvr][aeiou]', w)]#这两行代码反复体会
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
    a    e    i    o    u
k  418  148   94  420  173
p   83   31  105   34   51
r  187   63   84   89   79
s    0    0  100    2    1
t   47    8    0  148   37
v   93   27  105   48   49

Finding Word Stems一段粗糙的去除主干的代码,NLTK中有built-in stemmer,这一段仅仅是利用正则表达式的一个尝试。

>>> def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...         if word.endswith(suffix):
...             return word[:-len(suffix)]
...     return word
>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['ing']
>>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['processing']
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('processe', 's')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]

This is because括号的第二个功能,选择字符串 the parentheses have a second function,to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:,which is just one of many arcane subtleties of regular expressions.

>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>") #找出a与man之间的词语。
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>") 
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"<l.*>{3,}") 
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la

3.6 标准化Normalizing Text

normalizing包含很多:大小写转换。去掉词缀stemming。确保去掉的词缀是字典里的词语lemmatization.
By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this, and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary,a task known as lemmatization.

>>> raw = " i love listen to music. And how about you? Have you been loving listening to music? "
>>> tokens = word_tokenize(raw)#上面两行的目的是选定数据define the data
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()#上面两行是两种不同类型的工具
>>> porter.stem(t) for t in tokens
SyntaxError: invalid syntax #上面这一行要用方括号括起来
>>> [porter.stem(t) for t in tokens]
['i', 'love', 'listen', 'to', 'music', '.', 'and', 'how', 'about', 'you', '?', 'have', 'you', 'been', 'love', 'listen', 'to', 'music', '?']

3.7 tokenizing

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data.

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone though), 'I won't have any pepper in my kitchen AT ALL. Soup does very well without--Maybe it's always pepper that makes people hot-tempered,'..."""
>>> re.split(r' ', raw) 
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
>>> re.split(r'[ \t\n]+', raw)  #[ \t\n]+ matches one or more space, tab (\t) or newline (\n).
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

在这里插入图片描述

3.8 segmentation

分句与分词

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
该资源内项目源码是个人的课程设计、毕业设计,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! ## 项目备注 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设计、作业、项目初期立项演示等。 3、如果基础还行,也可在此代码基础上进行修改,以实现其他功能,也可用于毕设、课设、作业等。 下载后请首先打开README.md文件(如有),仅供学习参考, 切勿用于商业用途。 该资源内项目源码是个人的课程设计,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! ## 项目备注 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用! 2、本项目适合计算机相关专业(如计科、人工智能、通信工程、自动化、电子信息等)的在校学生、老师或者企业员工下载学习,也适合小白学习进阶,当然也可作为毕设项目、课程设计、作业、项目初期立项演示等。 3、如果基础还行,也可在此代码基础上进行修改,以实现其他功能,也可用于毕设、课设、作业等。 下载后请首先打开README.md文件(如有),仅供学习参考, 切勿用于商业用途。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值