
The goal of this chapter is to answer the following questions:
1.怎么获取本地和网络文本 How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of
language material?
2.怎么切词tokenization. How can we split documents up into individual words and
punctuation symbols, so we can carry out the same kinds of
analysis we did with text corpora in earlier chapters?
3.怎么格式化,降噪How can we write programs to produce formatted output
and save it in a file?


>>> from __future__ import division #注意:future前后的下划线分别是两根
>>>> import nltk, re, pprint
>>> from nltk import word_tokenize


>>> from urllib import request
>>> url = ""
>>> response = request.urlopen(url)
>>> raw ='utf8')
>>> type(raw)
<class 'str'>
>>> len(raw)
1176893 #这里包含了对空格、行距等的统计信息。所以需要进行tokenize
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines.空格,换行符,空行

>>> tokens = word_tokenize(raw)
>>> type(tokens)
<class 'list'>
>>> len(tokens)
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


>>> raw.find("PART I")
>>> raw.rfind("End of Project Gutenberg's Crime")
>>> raw = raw[5338:1157743] 
>>> raw.find("PART I")





The web can be thought of as a huge corpus of unannotated text.
The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in.

坏处是1range of patterns, severely restricted. 2inconsistent results.3markup unpredictably.

Unfortunately, search engines have some significant shortcomings.
First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).



>>> f = open('LICENSE.txt')
>>> raw =
>>> print(


>>> import os
>>> os.listdir('.')
>>> path ='corpora/gutenberg/melville-moby_dick.txt')
>>> raw = open(path, 'rU').read()



Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats.





>>> tokens = word_tokenize(raw)
>>> type(tokens)
<class 'list'>
>>> words = [w.lower() for w in tokens]
>>> type(words)
<class 'list'>
>>> vocab = sorted(set(words))
>>> type(vocab)
<class 'list'>


>>> circus = 'Monty Python\'s Flying Circus' #backslash-escape the quote表明不是字符串的引号,是所有格。
>>> circus
"Monty Python's Flying Circus"#输出结果
>>> couplet = "Shall I compare thee to a Summer's day?"\ #用斜杠或整句话()或三个引号'''  '''括起来,表示字符没有完。
...           "Thou are more lovely and more temperate:" 
>>> print(couplet)
Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:

variable value

strings are indexed, starting from zero.
The slice [m,n] contains the characters from position m through n-1.
Lists have the added power that you can change their elements
This is because strings are immutable — you can’t change a
string once you have created it. However, lists are mutable,
and their contents can be modified at any time. As a result, lists
support operations that modify the original value rather than producing a new value.

3.3 Text Processing with Unicode

What is Unicode?
Unicode supports over a million characters. Each
character is assigned a number, called a code point. In Python, code
points are written in the form \uXXXX, where XXXX is the number
in 4-digit hexadecimal form.

The Python open() function can read encoded data into Unicode strings, and write out Unicode strings in encoded form

>>>path ='corpora/unicode_samples/polish-lat2.txt')
>>> f = open(path, encoding='latin2')
>>> for line in f:
...    line = line.strip()
...    print(line)

unicode decoding and encoding

3.4 regular expressions

wild cards:
· 任何字符
\:解除通配符的意义。如点指的是任何字符,但是.就取消了点的这个意义,就是指点。means: the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, . only matches a period.

下面这两个符号都是针对前面的元素的数量进行限制,叫做Kleene closures:
"星号"零个或多个, [a-z]*
+一个或多个 “one or more instances of the preceding item”
<>将单词与单词之间分开。The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK’s findall() method for texts).
|, s) function to check whether the pattern p can be found somewhere inside the string s
$:use the dollar sign which has a special behavior in the context of regular expressions in that it matches the end of the word

>>> [w for w in wordlist if'ed$', w)]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

the ? symbol specifies that the previous character is optional.
Thus «^e-?mail » w i l l m a t c h b o t h e m a i l a n d e − m a i l . W e c o u l d c o u n t t h e t o t a l n u m b e r o f o c c u r r e n c e s o f t h i s w o r d ( i n e i t h e r s p e l l i n g ) i n a t e x t u s i n g s u m ( 1 f o r w i n t e x t i f r e . s e a r c h ( ′ e − ? m a i l » will match both email and e-mail. We could count the total number of occurrences of this word (in either spelling)in a text using sum(1 for w in text if'^e-?mail »willmatchbothemailandemail.Wecouldcountthetotalnumberofoccurrencesofthisword(ineitherspelling)inatextusingsum(’, w)).

>>> [w for w in wordlist if'^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

re.findall() (“find all”) method finds all (non-overlapping)
matches of the given regular expression.该函数直接找出列出,与, s)比较,是判断是否有,所以需要写一个for的函数赋予给w

# all sequences of two or more vowels in some text,and determine their relative frequency找出两个或多个元音字母的排列,并按频率排序
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
...                       for vs in re.findall(r'[aeiou]{2,}', word))#找到在wsj这个库中所有词语包含排在一起的两个元音字母。将所有这些结果赋予到vs中,再
>>> fd.most_common(12)#以most_common这个方法找出已经按数量方法FreqDist排列的数量中最常见的。
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95)]
>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words())) #选出所有词语,并类符化sorted
>>> [w for w in chat_words if'^m+i+n+e+$', w)]#换成«^m*i*n*e*$»,+是一个或以上,星号是零个或以上。
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
>>> [w for w in chat_words if'^[ha]+$', w)] #注意,在方括号里面,不管顺序
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',
'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',
'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]


#matches initial vowel sequences, final vowel sequences, and all consonants 步骤:定义规则。定义函数compress用re.findall筛选,从所有的词语中筛选符合规则的词语。用for函数遍历所有的单词(所有单词都用compress(w)函数过一遍)
>>> regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ''.join(pieces)#''.join()表示连接。
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words     
              for cv in re.findall(r'[ptksvr][aeiou]', w)]#这两行代码反复体会
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
    a    e    i    o    u
k  418  148   94  420  173
p   83   31  105   34   51
r  187   63   84   89   79
s    0    0  100    2    1
t   47    8    0  148   37
v   93   27  105   48   49

Finding Word Stems一段粗糙的去除主干的代码,NLTK中有built-in stemmer,这一段仅仅是利用正则表达式的一个尝试。

>>> def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...         if word.endswith(suffix):
...             return word[:-len(suffix)]
...     return word
>>> re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
>>> re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]
>>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('processe', 's')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]

This is because括号的第二个功能,选择字符串 the parentheses have a second function,to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:,which is just one of many arcane subtleties of regular expressions.

>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>") #找出a与man之间的词语。
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>") 
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"<l.*>{3,}") 
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la

3.6 标准化Normalizing Text

By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this, and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary,a task known as lemmatization.

>>> raw = " i love listen to music. And how about you? Have you been loving listening to music? "
>>> tokens = word_tokenize(raw)#上面两行的目的是选定数据define the data
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()#上面两行是两种不同类型的工具
>>> porter.stem(t) for t in tokens
SyntaxError: invalid syntax #上面这一行要用方括号括起来
>>> [porter.stem(t) for t in tokens]
['i', 'love', 'listen', 'to', 'music', '.', 'and', 'how', 'about', 'you', '?', 'have', 'you', 'been', 'love', 'listen', 'to', 'music', '?']

3.7 tokenizing

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data.

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone though), 'I won't have any pepper in my kitchen AT ALL. Soup does very well without--Maybe it's always pepper that makes people hot-tempered,'..."""
>>> re.split(r' ', raw) 
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
>>> re.split(r'[ \t\n]+', raw)  #[ \t\n]+ matches one or more space, tab (\t) or newline (\n).
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',
'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',
'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


3.8 segmentation


