Python自然语言处理学习笔记(9):2.1 访问文本语料库

 
Updated 1st 2011.8.6

 

CHAPTER 2   

Accessing Text Corpora and Lexical Resources

访问文本语料库和词汇资源 

 

Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora. The goal of this chapter is to answer the following questions:

 

1. What are some useful text corpora and lexical resources, and how can we access them with Python?

      什么是有用的文本预料可和词汇资源,我们如何通过Python访问它们?

2. Which Python constructs are most helpful for this work?

     Python构造的哪个方面对于这一项工作是最有帮助的?

3. How do we avoid repeating ourselves when writing Python code?

      在写Python代码的时候,我们如何避免重复?

 

This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait until later before exploring each Python construct systematically. Don’t worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and—if you’re game(勇敢的)—modify it by substituting(代替) some part of the code with a different text or word. This way you will associate(联系) a task with a programming idiom, and learn the hows and whys later.

 

2.1 Accessing Text Corpora 访问文本语料库 

 

As just mentioned, a text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres. We examined some small text collections in Chapter 1, such as the speeches known as the US Presidential Inaugural Addresses. This particular corpus actually contains dozens of individual texts—one per address—but for convenience we glued them end-to-end and treated them as a single text. Chapter 1 also used various predefined texts that we accessed by typing from book import *. However, since we want to be able to work with other texts, this section examines a variety of text corpora. We’ll see how to select individual texts, and how to work with them.

 

Gutenberg Corpus

 

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive(古腾堡电子文本存档), which contains some 25,000(现在是36,000) free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

 


  >>>   import  nltk

 
>>>  nltk.corpus.gutenberg.fileids()

[
' austen-emma.txt ' ' austen-persuasion.txt ' ' austen-sense.txt ' ' bible-kjv.txt ' , ' blake-poems.txt ' ' bryant-stories.txt ' ' burgess-busterbrown.txt ' ' carroll-alice.txt ' , ' chesterton-ball.txt ' ' chesterton-brown.txt ' ' chesterton-thursday.txt ' ' edgeworth-parents.txt ' ' melville-moby_dick.txt ' ' milton-paradise.txt ' , ' shakespeare-caesar.txt ' ' shakespeare-hamlet.txt ' ' shakespeare-macbeth.txt ' , ' whitman-leaves.txt ' ]

 

Let’s pick out the first of these texts—Emma by Jane Austen—and give it a short name, emma, then find out how many words it contains:

 


>>>  emma  =  nltk.corpus.gutenberg.words( ' austen-emma.txt ' )

>>>  len(emma)

192427

 

In Section 1.1, we showed how you could carry out concordancing of a text such as text1 with the command text1.concordance(). However, this assumes that you are using one of the nine texts obtained as a result of doing from nltk.book import *. Now that you have started examining data from nltk.corpus, as in the previous example, you have to employ the following pair of statements to perform concordancing and other tasks from Section 1.1:


>>>  emma  =  nltk.Text(nltk.corpus.gutenberg.words( ' austen-emma.txt ' ))

 
>>>  emma.concordance( " surprize " )

 

When we defined emma, we invoked the words() function of the gutenberg object in NLTK’s corpus package. But since it is cumbersome(累赘的) to type such long names all the time, Python provides another version of the import statement, as follows:

 


>>>   from  nltk.corpus  import  gutenberg

>>>  gutenberg.fileids()

[
' austen-emma.txt ' ' austen-persuasion.txt ' ' austen-sense.txt ' , ...]

>>>  emma  =  gutenberg.words( ' austen-emma.txt ' )

 

Let’s write a short program to display other information about each text, by looping over all the values of fileid(文件标识) corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will make sure that the numbers are all integers, using int().


>>>   for  fileid  in  gutenberg.fileids():

...     num_chars 
=  len(gutenberg.raw(fileid)) ①

...     num_words 
=  len(gutenberg.words(fileid))

...     num_sents 
=  len(gutenberg.sents(fileid))

...     num_vocab 
=  len(set([w.lower()  for  w  in  gutenberg.words(fileid)]))

...           
print  int(num_chars / num_words), int(num_words / num_sents),     int(num_words / num_vocab), fileid

...

4   21   26  austen - emma.txt

4   23   16  austen - persuasion.txt

4   24   22  austen - sense.txt

4   33   79  bible - kjv.txt

4   18   5  blake - poems.txt

4   17   14  bryant - stories.txt

4   17   12  burgess - busterbrown.txt

4   16   12  carroll - alice.txt

4   17   11  chesterton - ball.txt

4   19   11  chesterton - brown.txt

4   16   10  chesterton - thursday.txt

4   18   24  edgeworth - parents.txt

4   24   15  melville - moby_dick.txt

4   52   10  milton - paradise.txt

4   12   8  shakespeare - caesar.txt

4   13   7  shakespeare - hamlet.txt

4   13   6  shakespeare - macbeth.txt

4   35   12  whitman - leaves.txt

 

This program displays three statistics for each text: average word length平均字长, average sentence length平均句长, and the number of times each vocabulary item appears in the text on average本文中每个词汇平均出现数量 (our lexical diversity score我们的词汇多样性得分). Observe that average word length appears to be a general property of English, since it has a recurrent(周期性的) value of 4. (In fact, the average word length is really 3, not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

 

The previous example also showed how we can access the “raw” text of the book, not split up into tokens. The raw() function gives us the contents of the file without any linguistic processing(对文件的内容不进行任何语言处理). So, for example, len(gutenberg.raw('blake-poems.txt') tells us how many letters occur in the text, including the spaces between words. The sents() function divides the text up into its sentences, where each sentence is a list of words(把文本分割成句子,每个句子是一个由单词组成的列表):


 

>>>  macbeth_sentences  =  gutenberg.sents( ' shakespeare-macbeth.txt ' )

>>>  macbeth_sentences

[[
' [ ' ' The ' ' Tragedie ' ' of ' ' Macbeth ' ' by ' ' William ' ' Shakespeare ' ' 1603 ' ' ] ' ], [ ' Actus ' ' Primus ' ' . ' ], ...]

>>>  macbeth_sentences[ 1037 ]

[
' Good ' ' night ' ' , ' ' and ' ' better ' ' health ' ' Attend ' ' his ' ' Maiesty ' ]

>>>  longest_len  =  max([len(s)  for  s  in  macbeth_sentences])

>>>  [s  for  s  in  macbeth_sentences  if  len(s)  ==  longest_len]

[[
' Doubtfull ' ' it ' ' stood ' ' , ' ' As ' ' two ' ' spent ' ' Swimmers ' ' , ' ' that ' ' doe ' ' cling ' ' together ' ' , ' ' And ' ' choake ' ' their ' ' Art ' ' : ' ' The ' ' mercilesse ' ' Macdonwald ' , ...], ...]

 

Most NLTK corpus readers include a variety of access methods apart from words(), raw(), and sents(). Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters.

 

Web and Chat Text   Web和聊天文本 

 

Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard(无意听到的) in New York, the movie script of Pirates of the Carribean(加勒比海盗), personal advertisements, and wine reviews:

 

>>>   from  nltk.corpus  import  webtext

>>>   for  fileid  in  webtext.fileids():

...     
print  fileid, webtext.raw(fileid)[: 65 ],  ' ... '

...

firefox.txt Cookie Manager: 
" Don't allow sites that set removed cookies to se...

grail.txt SCENE 
1 : [wind] [clop clop clop] KING ARTHUR: Whoa there!  [clop...

overheard.txt White guy: So, do you have any plans 
for  this evening? Asian girl...

pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN
' S CHEST, by Ted Elliott & Terr...

singles.txt 
25  SEXY MALE, seeks attrac older single lady,  for  discreet encoun...

wine.txt Lovely delicate, fragrant Rhone wine. Polished leather 
and  strawb...

 

There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators(捕食者).The corpus contains over 10,000 posts(帖子), anonymized by replacing usernames with generic names(通用名) of the form “UserNNN”, and manually edited to remove any other identifying information. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom). The filename contains the date, chat-room, and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006.

 


>>>   from  nltk.corpus  import  nps_chat

>>>  chatroom  =  nps_chat.posts( ' 10-19-20s_706posts.xml ' )

>>>  chatroom[ 123 ]

[
' i ' ' do ' " n't " ' want ' ' hot ' ' pics ' ' of ' ' a ' ' female ' ' , ' , ' I ' ' can ' ' look ' ' in ' ' a ' ' mirror ' , ' . ' ]

 

Brown Corpus  布朗语料库 

 

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html).

 

IDFileGenreDescription
A16ca16newsChicago Tribune: Society Reportage
B02cb02editorialChristian Science Monitor: Editorials
C17cc17reviewsTime Magazine: Reviews
D12cd12religionUnderwood: Probing the Ethics of Realtors
E36ce36hobbiesNorling: Renting a Car in Europe
F25cf25loreBoroff: Jewish Teenage Culture
G22cg22belles_lettresReiner: Coping with Runaway Technology
H15ch15governmentUS Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17cj19learnedMosteller: Probability with Statistical Applications
K04ck04fictionW.E.B. Du Bois: Worlds of Color
L13cl13mysteryHitchens: Footsteps in the Night
M01cm01science_fictionHeinlein: Stranger in a Strange Land
N14cn15adventureField: Rattlesnake Ridge
P12cp12romanceCallaghan: A Passion in Rome
R06cr06humorThurber: The Future, If Any, of Comedy

 

                Table 2-1. Example document for each section of the Brown Corpus

 

We can access the corpus as a list of words or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

 


>>>   from  nltk.corpus  import  brown

>>>  brown.categories()

[
' adventure ' ' belles_lettres ' ' editorial ' ' fiction ' ' government ' ' hobbies ' ,

' humor ' ' learned ' ' lore ' ' mystery ' ' news ' ' religion ' ' reviews ' ' romance ' ,

' science_fiction ' ]

>>>  brown.words(categories = ' news ' )

[
' The ' ' Fulton ' ' County ' ' Grand ' ' Jury ' ' said ' , ...]

>>>  brown.words(fileids = [ ' cg22 ' ])

[
' Does ' ' our ' ' society ' ' have ' ' a ' ' runaway ' ' , ' , ...]

>>>  brown.sents(categories = [ ' news ' ' editorial ' ' reviews ' ])

[[
' The ' ' Fulton ' ' County ' ...], [ ' The ' ' jury ' ' further ' ...], ...]

 

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry(语言学研究) known as stylistics(文体学). Let’s compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

 


>>>   from  nltk.corpus  import  brown

>>>  news_text  =  brown.words(categories = ' news ' )

>>>  fdist  =  nltk.FreqDist([w.lower()  for  w  in  news_text])

>>>  modals  =  [ ' can ' ' could ' ' may ' ' might ' ' must ' ' will ' ]

>>>   for  m  in  modals:

...     
print  m  +   ' : ' , fdist[m],

...

can: 
94  could:  87  may:  93  might:  38  must:  53  will:  389

 

Your Turn: Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what, when, where, who and why.

 

Next, we need to obtain counts for each genre of interest. We’ll use NLTK’s support for conditional frequency distributions. These are presented systematically in  Section 2.2, where we also unpick(拆散) the following code line by line. For the moment, you can ignore the details and just concentrate on the output(忽略细节,专注于结果).


>>>  cfd  =  nltk.ConditionalFreqDist(

...           (genre, word)

...           
for  genre  in  brown.categories()

...           
for  word  in  brown.words(categories = genre))

>>>  genres  =  [ ' news ' ' religion ' ' hobbies ' ' science_fiction ' ' romance ' ' humor ' ]

>>>  modals  =  [ ' can ' ' could ' ' may ' ' might ' ' must ' ' will ' ]

>>>  cfd.tabulate(conditions = genres, samples = modals)

                 can  could  may might must will

           news   
93     86     66     38     50    389

       religion   
82     59     78     12     54     71

        hobbies  
268     58    131     22     83    264

 science_fiction 
16     49      4     12      8     16

       romance   
74    193     11     51     45     43

         humor   
16     30      8      8      9     13

 

Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could. Would you have predicted this? The idea that word counts might distinguish(区分) genres will be taken up(采纳) again in Chapter 6.

 

Reuters Corpus  路透社语料库 

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”(训练和测试); thus, the text with fileid 'test/14826' is a document drawn from the test set. This split(分割) is for training and testing algorithms that automatically detect the topic of a document, as we will see in Chapter 6.

 


>>>   from  nltk.corpus  import  reuters

>>>  reuters.fileids()

[
' test/14826 ' ' test/14828 ' ' test/14829 ' ' test/14832 ' , ...]

>>>  reuters.categories()

[
' acq ' ' alum ' ' barley ' ' bop ' ' carcass ' ' castor-oil ' ' cocoa ' ,

' coconut ' ' coconut-oil ' ' coffee ' ' copper ' ' copra-cake ' ' corn ' ,

' cotton ' ' cotton-oil ' ' cpi ' ' cpu ' ' crude ' ' dfl ' ' dlr ' , ...]

 

Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other(相互覆盖,也就是内容有重复), simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.


>>>  reuters.categories( ' training/9865 ' )

[
' barley ' ' corn ' ' grain ' ' wheat ' ]

>>>  reuters.categories([ ' training/9865 ' ' training/9880 ' ])

[
' barley ' ' corn ' ' grain ' ' money-fx ' ' wheat ' ]

>>>  reuters.fileids( ' barley ' )

[
' test/15618 ' ' test/15649 ' ' test/15676 ' ' test/15728 ' ' test/15871 ' , ...]

>>>  reuters.fileids([ ' barley ' ' corn ' ])

[
' test/14832 ' ' test/14858 ' ' test/15033 ' ' test/15043 ' ' test/15106 ' ,

' test/15287 ' ' test/15341 ' ' test/15618 ' ' test/15618 ' ' test/15648 ' , ...]

 

Similarly, we can specify the words or sentences we want in terms of(按照) files or categories. The first handful(少数) of words in each of these texts are the titles, which by convention(按照惯例) are stored as uppercase.

 

>>>  reuters.words( ' training/9865 ' )[: 14 ]

[
' FRENCH ' ' FREE ' ' MARKET ' ' CEREAL ' ' EXPORT ' ' BIDS ' ,

' DETAILED ' ' French ' ' operators ' ' have ' ' requested ' ' licences ' ' to ' ' export ' ]

>>>  reuters.words([ ' training/9865 ' ' training/9880 ' ])

[
' FRENCH ' ' FREE ' ' MARKET ' ' CEREAL ' ' EXPORT ' , ...]

>>>  reuters.words(categories = ' barley ' )

[
' FRENCH ' ' FREE ' ' MARKET ' ' CEREAL ' ' EXPORT ' , ...]

>>>  reuters.words(categories = [ ' barley ' ' corn ' ])

[
' THAI ' ' TRADE ' ' DEFICIT ' ' WIDENS ' ' IN ' ' FIRST ' , ...]

 

Inaugural Address Corpus  就职演说语料库 

 

In Section 1.1, we looked at the Inaugural Address Corpus, but treated it as a single text. The graph in Figure 1-2 used “word offset”(单词位移) as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension(时间维度,奥巴马的也有):

 


>>>   from  nltk.corpus  import  inaugural

>>>  inaugural.fileids()   

[
' 1789-Washington.txt ' ' 1793-Washington.txt ' ' 1797-Adams.txt ' , ...]

>>>  [fileid[: 4 for  fileid  in  inaugural.fileids()]

[
' 1789 ' ' 1793 ' ' 1797 ' ' 1801 ' ' 1805 ' ' 1809 ' ' 1813 ' ' 1817 ' ' 1821 ' , ...]

 

Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].

 

Let’s look at how the words America and citizen are used over time. The following code converts the words in the Inaugural corpus to lowercase using w.lower(), then checks whether they start with either of the “targets” america or citizen using startswith(). Thus it will count words such as American’s and Citizens. We’ll learn about conditional frequency distributions in Section 2.2; for now, just consider the output, shown in Figure 2-1.


>>>  cfd  =  nltk.ConditionalFreqDist(

...           (target, file[:
4 ])

...           
for  fileid  in  inaugural.fileids()

...           
for  w  in  inaugural.words(fileid)

...           
for  target  in  [ ' america ' ' citizen ' ]

...           
if  w.lower().startswith(target)) ①

>>>  cfd.plot()

 

运行有问题,类型错误


Traceback (most recent call last):

  File 
" E:/Test/NLTK/2.1.py " , line  6 in   < module >

    
for  fileid  in  inaugural.fileids()

  File 
" C:\Python26\lib\site-packages\nltk\probability.py " , line  1740 in   __init__

    
for  (cond, sample)  in  cond_samples:

  File 
" E:/Test/NLTK/2.1.py " , line  9 in   < genexpr >

    
if  w.lower().startswith(target))

TypeError: 
' type '  object  is  unsubscriptable

Figure 2-1. Plot of a conditional frequency distribution: All words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

 

Annotated Text Corpora  注释文本语料库 

 

Many text corpora contain linguistic annotations, representing part-of-speech tags, named entities, syntactic structures(句法结构), semantic roles(语义角色), and so forth. NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. Table 2-2 lists some of the corpora. For information about downloading them, see http://www.nltk.org/data .For more examples of how to access NLTK corpora, please consult the Corpus HOWTO at http://www.nltk.org/howto .

 

                                    Table 2-2. Some of the corpora and corpus samples distributed with NLTK

CorpusCompilerContents
Brown CorpusFrancis, Kucera15 genres, 1.15M words, tagged, categorized
CESS TreebanksCLiC-UB1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data FilesPereira & WarrenWorld Geographic Database
CMU Pronouncing DictionaryCMU127k entries
CoNLL 2000 Chunking DataCoNLL270k words, tagged and chunked
CoNLL 2002 Named EntityCoNLL700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)CoNLL150k words, dependency parsed (Basque, Catalan)
Dependency TreebankNaradDependency parsed version of Penn Treebank sample
Floresta TreebankDiana Santos et al9k sentences, tagged and parsed (Portuguese)
Gazetteer ListsVariousLists of cities and countries
Genesis CorpusMisc web sources6 texts, 200k words, 6 languages
Gutenberg (selections)Hart, Newby, et al18 texts, 2M words
Inaugural Address CorpusCSpanUS Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged CorpusKumaran et al60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho CorpusNILC, USP, Brazil1M words, tagged (Brazilian Portuguese)
Movie ReviewsPang, Lee2k movie reviews with sentiment polarity classification
Names CorpusKantrowitz, Ross8k male and female names
NIST 1999 Info Extr (selections)Garofolo63k words, newswire and named-entity SGML markup
NPS Chat CorpusForsyth, Martell10k IM chat posts, POS-tagged and dialogue-act tagged
PP Attachment CorpusRatnaparkhi28k prepositional phrases, tagged as noun or verb modifiers
Proposition BankPalmer113k propositions, 3300 verb frames
Question ClassificationLi, Roth6k questions, categorized
Reuters CorpusReuters1.3M words, 10k news documents, categorized
Roget's ThesaurusProject Gutenberg200k words, formatted text
RTE Textual EntailmentDagan et al8k sentence pairs, categorized
SEMCORRus, Mihalcea880k words, part-of-speech and sense tagged
Senseval 2 CorpusPedersen600k words, part-of-speech and sense tagged
Shakespeare texts (selections)Bosak8 books in XML format
State of the Union CorpusCSPAN485k words, formatted text
Stopwords CorpusPorter et al2,400 stopwords for 11 languages
Swadesh CorpusWiktionarycomparative wordlists in 24 languages
Switchboard Corpus (selections)LDC36 phonecalls, transcribed, parsed
Univ Decl of Human RightsUnited Nations480k words, 300+ languages
Penn Treebank (selections)LDC40k words, tagged and parsed
TIMIT Corpus (selections)NIST/LDCaudio files and transcripts for 16 speakers
VerbNet 2.1Palmer et al5k verbs, hierarchically organized, linked to WordNet
Wordlist CorpusOpenOffice.org et al960k words and 20k affixes for 8 languages
WordNet 3.0 (English)Miller, Fellbaum145k synonym sets

 

Corpora in Other Languages  其他语言的语料库 

 

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora (see Section 3.3).

 

>>>  nltk.corpus.cess_esp.words()

[
' El ' ' grupo ' ' estatal ' ' Electricit\xe9_de_France ' , ...]

>>>  nltk.corpus.floresta.words()

[
' Um ' ' revivalismo ' ' refrescante ' ' O ' ' 7_e_Meio ' , ...]

>>>  nltk.corpus.indian.words( ' hindi.pos ' )

[
' \xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3 ' ,

' \xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4

\x82\xe0\xa4\xa7
' , ...]

>>>  nltk.corpus.udhr.fileids()

[
' Abkhaz-Cyrillic+Abkh ' ' Abkhaz-UTF8 ' ' Achehnese-Latin1 ' ' Achuar-Shiwiar-Latin1 ' ,

' Adja-UTF8 ' ' Afaan_Oromo_Oromiffa-Latin1 ' ' Afrikaans-Latin1 ' ' Aguaruna-Latin1 ' ,

' Akuapem_Twi-UTF8 ' ' Albanian_Shqip-Latin1 ' ' Amahuaca ' ' Amahuaca-Latin1 ' , ...]

>>>  nltk.corpus.udhr.words( ' Javanese-Latin1 ' )[ 11 :]

[u
' Saben ' , u ' umat ' , u ' manungsa ' , u ' lair ' , u ' kanthi ' , ...]

 

The last of these corpora, udhr, contains the Universal Declaration of Human Rights(国际人权宣言)in over 300 languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. Let’s use a conditional frequency distribution to examine the differences in word lengths for a selection of languages included in the udhr corpus. The output is shown in Figure 2-2 (run the program yourself to see a color plot). Note that True and False are Python’s built-in Boolean values.

 


>>>   from  nltk.corpus  import  udhr

>>>  languages  =  [ ' Chickasaw ' ' English ' ' German_Deutsch ' ,

...     
' Greenlandic_Inuktikut ' ' Hungarian_Magyar ' ' Ibibio_Efik ' ]

>>>  cfd  =  nltk.ConditionalFreqDist(

...           (lang, len(word))

...           
for  lang  in  languages

...           
for  word  in  udhr.words(lang  +   ' -Latin1 ' ))

>>>  cfd.plot(cumulative = True)

  

 

Figure 2-2. Cumulative word length distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having five or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

 

Your Turn: Pick a language of interest in udhr.fileids(), and define a variable raw_text = udhr.raw(Language-Latin1). Now plot a frequency distribution of the letters of the text using nltk.FreqDist(raw_text).plot().

不知道为什么Chinese_Mandarin-UTF8不能用,留下该问题继续看

 

Unfortunately, for many languages, substantial corpora are not yet available. Often there is insufficient(不足的) government or industrial support for developing language resources, and individual efforts are piecemeal(零碎的) and hard to discover or reuse. Some languages have no established writing system, or are endangered. (See Section 2.7 for suggestions on how to locate(查找) language resources.)

  

Text Corpus Structure 文本语料库结构

 

We have seen a variety of corpus structures so far; these are summarized in Figure 2-3. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably(尤其) in the case of topical(时事问题) categories, as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure(时态结构), news collections being the most common example. NLTK’s corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. Table 2-3 lists functionality provided by the corpus readers.

 

Figure 2-3. Common structures for text corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories, such as genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).4种不同类型的语料库


Table 2-3. Basic corpus functionality defined in NLTK: More documentation can be found using help(nltk.corpus.reader) and by reading the online Corpus HOWTO at http://www.nltk.org/howto .

Example

Description

fileids()

The files of the corpus

fileids([categories])

The files of the corpus corresponding to these categories

categories()

The categories of the corpus

categories([fileids])

The categories of the corpus corresponding to these files

raw()

The raw content of the corpus

raw(fileids=[f1,f2,f3])

The raw content of the specified files

raw(categories=[c1,c2])

The raw content of the specified categories

words()

The words of the whole corpus

words(fileids=[f1,f2,f3])

The words of the specified fileids

words(categories=[c1,c2])

The words of the specified categories

sents()

The sentences of the specified categories

sents(fileids=[f1,f2,f3])

The sentences of the specified fileids

sents(categories=[c1,c2])

The sentences of the specified categories

abspath(fileid)

The location of the given file on disk

encoding(fileid)

The encoding of the file (if known)

open(fileid)

Open a stream for reading the given corpus file

root()

The path to the root of locally installed corpus

readme()

The contents of the README file of the corpus

 

We illustrate the difference between some of the corpus access methods here:

 

>>>  raw  =  gutenberg.raw( " burgess-busterbrown.txt " )

>>>  raw[ 1 : 20 ]      # 这个按单个字符算的

' The Adventures of B '  

>>>  words  =  gutenberg.words( " burgess-busterbrown.txt " )

>>>  words[ 1 : 20 ]         # 这个按单个词和符号数字算的

[
' The ' ' Adventures ' ' of ' ' Buster ' ' Bear ' ' by ' ' Thornton ' ' W ' ' . ' ,

' Burgess ' ' 1920 ' ' ] ' ' I ' ' BUSTER ' ' BEAR ' ' GOES ' ' FISHING ' ' Buster ' ,

' Bear ' ]

>>>  sents  =  gutenberg.sents( " burgess-busterbrown.txt " )

>>>  sents[ 1 : 20 ]   # 按句子,那么这个I为啥算单独的一句?

[[
' I ' ], [ ' BUSTER ' ' BEAR ' ' GOES ' ' FISHING ' ], [ ' Buster ' ' Bear ' ' yawned ' ' as ' ,

' he ' ' lay ' ' on ' ' his ' ' comfortable ' ' bed ' ' of ' ' leaves ' ' and ' ' watched ' ,

' the ' ' first ' ' early ' ' morning ' ' sunbeams ' ' creeping ' ' through ' , ...], ...]

 

Loading Your Own Corpus  装载你自己的语料库

 

If you have a your own collection of text files that you would like to access using the methods discussed earlier, you can easily load them with the help of NLTK’s Plain textCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict(这是Linux的吧). Whatever the location, set this to be the value of corpus_root. The second parameter of the PlaintextCorpusReader initializercan be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see Section 3.4 for information about regular expressions).

 

>>>   from  nltk.corpus  import  PlaintextCorpusReader

>>>  corpus_root  =   ' /usr/share/dict '  ①

>>>  wordlists  =  PlaintextCorpusReader(corpus_root,  ' .* ' ) ②

>>>  wordlists.fileids()

[
' README ' ' connectives ' ' propernames ' ' web2 ' ' web2a ' ' words ' ]

>>>  wordlists.words( ' connectives ' )

[
' the ' ' of ' ' and ' ' to ' ' a ' ' in ' ' that ' ' is ' , ...]

 

As another example, suppose you have your own local copy of Penn Treebank (release 3), in C:\corpora. We can use the BracketParseCorpusReader to access this corpus. We specify the corpus_root to be the location of the parsed Wall Street Journal component of the corpus, and give a file_pattern that matches the files contained within its subfolders (using forward slashes斜杠).

 

>>>   from  nltk.corpus  import  BracketParseCorpusReader

>>>  corpus_root  =  r " C:\corpora\penntreebank\parsed\mrg\wsj "  ①

>>>  file_pattern  =  r " .*/wsj_.*\.mrg "   ②

>>>  ptb  =  BracketParseCorpusReader(corpus_root, file_pattern)

>>>  ptb.fileids()

[
' 00/wsj_0001.mrg ' , ' 00/wsj_0002.mrg ' ' 00/wsj_0003.mrg ' ' 00/wsj_0004.mrg ' , ...]

>>>  len(ptb.sents())

49208

>>>  ptb.sents(fileids = ' 20/wsj_2013.mrg ' )[ 19 ]

[
' The ' ' 55-year-old ' ' Mr. ' ' Noriega ' ' is ' " n't " ' as ' ' smooth ' ' as ' ' the ' ,

' shah ' ' of ' ' Iran ' ' , ' ' as ' ' well-born ' ' as ' ' Nicaragua ' " 's " ' Anastasio ' ,

' Somoza ' ' , ' ' as ' ' imperial ' ' as ' ' Ferdinand ' ' Marcos ' ' of ' ' the ' ' Philippines ' ,

' or ' ' as ' ' bloody ' ' as ' ' Haiti ' " 's " ' Baby ' , Doc ' ' Duvalier ' ' . ' ]
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值