自然语言处理(2)之文本资料库
1.获取文本资料库
本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库。我们用help来查看它的类型
1 >>> import nltk 2 >>> help(nltk.corpus.gutenberg) 3 Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object: 4 5 class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader) 6 | Reader for corpora that consist of plaintext documents. Paragraphs 7 | are assumed to be split using blank lines. Sentences and words can 8 | be tokenized using the default tokenizers, or by custom tokenizers 9 | specificed as parameters to the constructor. 10 | 11 | This corpus reader can be customized (e.g., to skip preface 12 | sections of specific document formats) by creating a subclass and 13 | overriding the ``CorpusView`` class variable. 14 | 15 | Method resolution order: 16 | PlaintextCorpusReader 17 | nltk.corpus.reader.api.CorpusReader 18 | __builtin__.object 19 | 20 | Methods defined here: 21 | 22 | __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.Punkt 23 SentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding=None) 24 | Construct a new plaintext corpus reader for a set of documents 25 | located at the given root directory. Example usage: 26 | 27 | >>> root = '/usr/local/share/nltk_data/corpora/webtext/' 28 | >>> reader = PlaintextCorpusReader(root, '.*\.txt') 29 | 30 | :param root: The root directory for this corpus. 31 | :param fileids: A list or regexp specifying the fileids in this corpus. 32 | :param word_tokenizer: Tokenizer for breaking sentences or 33 | paragraphs into words. 34 | :param sent_tokenizer: Tokenizer for breaking paragraphs 35 | into words. 36 | :param para_block_reader: The block reader used to divide the 37 | corpus into paragraph blocks. 38 | 39 | paras(self, fileids=None, sourced=False) 40 | :return: the given file(s) as a list of 41 | paragraphs, each encoded as a list of sentences, which are 42 | in turn encoded as lists of word strings. 43 | :rtype: list(list(list(str))) 44 | 45 | raw(self, fileids=None, sourced=False) 46 | :return: the given file(s) as a single string. 47 | :rtype: str 48 | 49 | sents(self, fileids=None, sourced=False) 50 | :return: the given file(s) as a list of 51 | sentences or utterances, each encoded as a list of word 52 | strings. 53 | :rtype: list(list(str)) 54 | 55 | words(self, fileids=None, sourced=False) 56 | :return: the given file(s) as a list of words 57 | and punctuation symbols. 58 | :rtype: list(str) 59 | 60 | ---------------------------------------------------------------------- 61 | Data and other attributes defined here: 62 | 63 | CorpusView = <class 'nltk.corpus.reader.util.StreamBackedCorpusView'> 64 | A 'view' of a corpus file, which acts like a sequence of tokens: 65 | it can be accessed by index, iterated over, etc. However, the 66 | tokens are only constructed as-needed -- the entire corpus is 67 | never stored in memory at once. 68 | 69 | The constructor to ``StreamBackedCorpusView`` takes two arguments: 70 | a corpus fileid (specified as a string or as a ``PathPointer``); 71 | and a block reader. A "block reader" is a function that reads 72 | zero or more tokens from a stream, and returns them as a list. A 73 | very simple example of a block reader is: 74 | 75 | >>> def simple_block_reader(stream): 76 | ... return stream.readline().split() 77 | 78 | This simple block reader reads a single line at a time, and 79 | returns a single token (consisting of a string) for each 80 | whitespace-separated substring on the line. 81 | 82 | When deciding how to define the block reader for a given 83 | corpus, careful consideration should be given to the size of 84 | blocks handled by the block reader. Smaller block sizes will 85 | increase the memory requirements of the corpus view's internal 86 | data structures (by 2 integers per block). On the other hand, 87 | larger block sizes may decrease performance for random access to 88 | the corpus. (But note that larger block sizes will *not* 89 | decrease performance for iteration.) 90 | 91 | Internally, ``CorpusView`` maintains a partial mapping from token 92 | index to file position, with one entry per block. When a token 93 | with a given index *i* is requested, the ``CorpusView`` constructs 94 | it as follows: 95 | 96 | 1. First, it searches the toknum/filepos mapping for the token 97 | index closest to (but less than or equal to) *i*. 98 | 99 | 2. Then, starting at the file position corresponding to that 100 | index, it reads one block at a time using the block reader 101 | until it reaches the requested token. 102 | 103 | The toknum/filepos mapping is created lazily: it is initially 104 | empty, but every time a new block is read, the block's 105 | initial token is added to the mapping. (Thus, the toknum/filepos 106 | map has one entry per block.) 107 | 108 | In order to increase efficiency for random access patterns that 109 | have high degrees of locality, the corpus view may cache one or 110 | have high degrees of locality, the corpus view may cache one or 111 | more blocks. 112 | 113 | :note: Each ``CorpusView`` object internally maintains an open file 114 | object for its underlying corpus file. This file should be 115 | automatically closed when the ``CorpusView`` is garbage collected, 116 | but if you wish to close it manually, use the ``close()`` 117 | method. If you access a ``CorpusView``'s items after it has been 118 | closed, the file object will be automatically re-opened. 119 | 120 | :warning: If the contents of the file are modified during the 121 | lifetime of the ``CorpusView``, then the ``CorpusView``'s behavior 122 | is undefined. 123 | 124 | :warning: If a unicode encoding is specified when constructing a 125 | ``CorpusView``, then the block reader may only call 126 | ``stream.seek()`` with offsets that have been returned by 127 | ``stream.tell()``; in particular, calling ``stream.seek()`` with 128 | relative offsets, or with offsets based on string lengths, may 129 | lead to incorrect behavior. 130 | 131 | :ivar _block_reader: The function used to read 132 | a single block from the underlying file stream. 133 | :ivar _toknum: A list containing the token index of each block 134 | that has been processed. In particular, ``_toknum[i]`` is the 135 | token index of the first token in block ``i``. Together 136 | with ``_filepos``, this forms a partial mapping between token 137 | indices and file positions. 138 | :ivar _filepos: A list containing the file position of each block 139 | that has been processed. In particular, ``_toknum[i]`` is the 140 | file position of the first character in block ``i``. Together 141 | with ``_toknum``, this forms a partial mapping between token 142 | indices and file positions. 143 | :ivar _stream: The stream used to access the underlying corpus file. 144 | :ivar _len: The total number of tokens in the corpus, if known; 145 | or None, if the number of tokens is not yet known. 146 | :ivar _eofpos: The character position of the last character in the 147 | file. This is calculated when the corpus view is initialized, 148 | and is used to decide when the end of file has been reached. 149 | :ivar _cache: A cache of the most recently read block. It 150 | is encoded as a tuple (start_toknum, end_toknum, tokens), where 151 | start_toknum is the token index of the first token in the block; 152 | end_toknum is the token index of the first token not in the 153 | block; and tokens is a list of the tokens in the block. 154 | 155 | ---------------------------------------------------------------------- 156 | Methods inherited from nltk.corpus.reader.api.CorpusReader: 157 | 158 | __repr__(self) 159 | 160 | abspath(self, fileid) 161 | Return the absolute path for the given file. 162 | 163 | :type file: str 164 165 | :param file: The file identifier for the file whose path 166 | should be returned. 167 | :rtype: PathPointer 168 | 169 | abspaths(self, fileids=None, include_encoding=False, include_fileid=False) 170 | Return a list of the absolute paths for all fileids in this corpus; 171 | or for the given list of fileids, if specified. 172 | 173 | :type fileids: None or str or list 174 | :param fileids: Specifies the set of fileids for which paths should 175 | be returned. Can be None, for all fileids; a list of 176 | file identifiers, for a specified set of fileids; or a single 177 | file identifier, for a single file. Note that the return 178 | value is always a list of paths, even if ``fileids`` is a 179 | single file identifier. 180 | 181 | :param include_encoding: If true, then return a list of 182 | ``(path_pointer, encoding)`` tuples. 183 | 184 | :rtype: list(PathPointer) 185 | 186 | encoding(self, file) 187 | Return the unicode encoding for the given corpus file, if known. 188 | If the encoding is unknown, or if the given file should be 189 | processed using byte strings (str), then return None. 190 | 191 | fileids(self) 192 | Return a list of file identifiers for the fileids that make up 193 | this corpus. 194 | 195 | open(self, file, sourced=False) 196 | Return an open stream that can be used to read the given file. 197 | If the file's encoding is not None, then the stream will 198 | automatically decode the file's contents into unicode. 199 | 200 | :param file: The file identifier of the file to read. 201 | 202 | readme(self) 203 | Return the contents of the corpus README file, if it exists. 204 | 205 | ---------------------------------------------------------------------- 206 | Data descriptors inherited from nltk.corpus.reader.api.CorpusReader: 207 | 208 | __dict__ 209 | dictionary for instance variables (if defined) 210 | 211 | __weakref__ 212 | list of weak references to the object (if defined) 213 | 214 | root 215 | The directory where this corpus is stored. 216 | 217 | :type: PathPointer
在PlaintextCorpusReader中可以看到很多本文例子中方法,比如fileids(),words()等等。
1.1 fileids()返回语料库的文件标识符
1 >>> from nltk.corpus import gutenberg 2 >>> gutenberg.fileids() 3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
1.2 words()返回文件的单词列表
1 >>> from nltk.corpus import gutenberg 2 >>> gutenberg.fileids() 3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] 4 >>> gutenberg.words('austen-emma.txt') 5 ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...] 6 >>> len(gutenberg.words('austen-emma.txt')) 7 192427
用concordance()来搜索文本里的单词
1 >>> emma = nltk.Text(gutenberg.words('austen-emma.txt')) 2 >>> emma 3 <Text: Emma by Jane Austen 1816> 4 >>> emma.concordance('surperize') 5 Building index... 6 No matches 7 >>> emma.concordance('surprize') 8 Displaying 25 of 37 matches: 9 er father , was sometimes taken by surprize at his being still able to pity ` 10 hem do the other any good ." " You surprize me ! Emma must do Harriet good : a 11 Knightley actually looked red with surprize and displeasure , as he stood up , 12 r . Elton , and found to his great surprize , that Mr . Elton was actually on 13 d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great , 14 father was quite taken up with the surprize of so sudden a journey , and his f 15 y , in all the favouring warmth of surprize and conjecture . She was , moreove 16 he appeared , to have her share of surprize , introduction , and pleasure . Th 17 ir plans ; and it was an agreeable surprize to her , therefore , to perceive t 18 talking aunt had taken me quite by surprize , it must have been the death of m 19 f all the dialogue which ensued of surprize , and inquiry , and congratulation 20 the present . They might chuse to surprize her ." Mrs . Cole had many to agre 21 the mode of it , the mystery , the surprize , is more like a young woman ' s s 22 to her song took her agreeably by surprize -- a second , slightly but correct 23 " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 24 t to be considered . Emma ' s only surprize was that Jane Fairfax should accep 25 of your admiration may take you by surprize some day or other ." Mr . Knightle 26 ation for her will ever take me by surprize .-- I never had a thought of her i 27 expected by the best judges , for surprize -- but there was great joy . Mr . 28 sound of at first , without great surprize . " So unreasonably early !" she w 29 d Frank Churchill , with a look of surprize and displeasure .-- " That is easy 30 ; and Emma could imagine with what surprize and mortification she must be retu 31 tled that Jane should go . Quite a surprize to me ! I had not the least idea ! 32 . It is impossible to express our surprize . He came to speak to his father o 33 g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai
这里用到了nltk.Text类,再次通过help查看这个类,通过method的查看发现这个类非常有用。
1 class Text(__builtin__.object) 2 | A wrapper around a sequence of simple (string) tokens, which is 3 | intended to support initial exploration of texts (via the 4 | interactive console). Its methods perform a variety of analyses 5 | on the text's contexts (e.g., counting, concordancing, collocation 6 | discovery), and display the results. If you wish to write a 7 | program which makes use of these analyses, then you should bypass 8 | the ``Text`` class, and use the appropriate analysis function or 9 | class directly instead. 10 | 11 | A ``Text`` is typically initialized from a given document or 12 | corpus. E.g.: 13 | 14 | >>> import nltk.corpus 15 | >>> from nltk.text import Text 16 | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) 17 | 18 | Methods defined here: 19 | 20 | __getitem__(self, i) 21 | 22 | __init__(self, tokens, name=None) 23 | Create a Text object. 24 | 25 | :param tokens: The source text. 26 | :type tokens: sequence of str 27 | 28 | __len__(self) 29 | 30 | __repr__(self) 31 | :return: A string representation of this FreqDist. 32 | :rtype: string 33 | 34 | collocations(self, num=20, window_size=2) 35 | Print collocations derived from the text, ignoring stopwords. 36 | 37 | :seealso: find_collocations 38 | :param num: The maximum number of collocations to print. 39 | :type num: int 40 | :param window_size: The number of tokens spanned by a collocation (default=2) 41 | :type window_size: int 42 | 43 | common_contexts(self, words, num=20) 44 | Find contexts where the specified words appear; list 45 | most frequent common contexts first. 46 | 47 | :param word: The word used to seed the similarity search 48 | :type word: str 49 | :param num: The number of words to generate (default=20) 50 | :type num: int 51 | :seealso: ContextIndex.common_contexts() 52 | 53 | concordance(self, word, width=79, lines=25) 54 | Print a concordance for ``word`` with the specified context window. 55 | Word matching is not case-sensitive. 56 | :seealso: ``ConcordanceIndex`` 57 | 58 | count(self, word) 59 | Count the number of times this word appears in the text. 60 | 61 | dispersion_plot(self, words) 62 | Produce a plot showing the distribution of the words through the text. 63 | Requires pylab to be installed. 64 | 65 | :param words: The words to be plotted 66 | :type word: str 67 | :seealso: nltk.draw.dispersion_plot() 68 | 69 | findall(self, regexp) 70 | Find instances of the regular expression in the text. 71 | The text is a list of tokens, and a regexp pattern to match 72 | a single token must be surrounded by angle brackets. E.g. 73 | 74 | >>> from nltk.book import text1, text5, text9 75 | >>> text5.findall("<.*><.*><bro>") 76 | you rule bro; telling you bro; u twizted bro 77 | >>> text1.findall("<a>(<.*>)<man>") 78 | monied; nervous; dangerous; white; white; white; pious; queer; good; 79 | mature; white; Cape; great; wise; wise; butterless; white; fiendish; 80 | pale; furious; better; certain; complete; dismasted; younger; brave; 81 | brave; brave; brave 82 | >>> text9.findall("<th.*>{3,}") 83 | thread through those; the thought that; that the thing; the thing 84 | that; that that thing; through these than through; them that the; 85 | through the thick; them that they; thought that the 86 | 87 | :param regexp: A regular expression 88 | :type regexp: str 89 | 90 | generate(self, length=100) 91 | Print random text, generated using a trigram language model. 92 | 93 | :param length: The length of text to generate (default=100) 94 | :type length: int 95 | :seealso: NgramModel 96 | 97 | index(self, word) 98 | Find the index of the first occurrence of the word in the text. 99 | 100 | plot(self, *args) 101 | See documentation for FreqDist.plot() 102 | :seealso: nltk.prob.FreqDist.plot() 103 | 104 | readability(self, method) 105 | 106 | similar(self, word, num=20) 107 | Distributional similarity: find other words which appear in the 108 | same contexts as the specified word; list most similar words first. 109 | 110 | :param word: The word used to seed the similarity search 111 | :type word: str 112 | :param num: The number of words to generate (default=20) 113 | :type num: int 114 | :seealso: ContextIndex.similar_words() 115 | 116 | vocab(self) 117 | :seealso: nltk.prob.FreqDist 118 | 119 | ---------------------------------------------------------------------- 120 | Data descriptors defined here: 121 | 122 | __dict__ 123 | dictionary for instance variables (if defined) 124 | 125 | __weakref__ 126 | list of weak references to the object (if defined)
1.3 raw,sent,words的区别
我们通过以下例子来查看raw,sent,words的区别:
1 #!/bin/envs python 2 from nltk.corpus import gutenberg 3 for fileid in gutenberg.fileids(): 4 num_chars = len(gutenberg.raw(fileid)) // 字母的个数 5 num_words = len(gutenberg.words(fileid)) // 单词的个数 6 num_sents = len(gutenberg.sents(fileid)) // 句子的个数 7 num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) // 不相同的单词的个数 8 print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid 4 21 26 austen-emma.txt //平均单词长度 平均每句单词个数 平均单词的重复个数 4 23 16 austen-persuasion.txt 4 23 22 austen-sense.txt 4 33 79 bible-kjv.txt 4 18 5 blake-poems.txt 4 17 14 bryant-stories.txt 4 17 12 burgess-busterbrown.txt 4 16 12 carroll-alice.txt 4 17 11 chesterton-ball.txt 4 19 11 chesterton-brown.txt 4 16 10 chesterton-thursday.txt 4 17 24 edgeworth-parents.txt 4 24 15 melville-moby_dick.txt 4 52 10 milton-paradise.txt 4 11 8 shakespeare-caesar.txt 4 12 7 shakespeare-hamlet.txt 4 12 6 shakespeare-macbeth.txt 4 35 12 whitman-leaves.txt
获取并查看shakespeare-macbeth.txt文本最长的一个句子
1 #!/bin/envs python 2 from nltk.corpus import gutenberg 3 macbenth_sentences = gutenberg.sents('shakespeare-macbeth.txt') # 获取句子的list 4 print macbenth_sentences 5 print macbenth_sentences[1037] 6 longtest_len=max([len(s) for s in macbenth_sentences]) # 获取最长句子的长度 7 [ s for s in macbenth_sentences if longtest_len == len(s)] # 获取最长句子的内容 [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...] ['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty'] [['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]
1.4 NPSChatCorpusReader类
接下来学习下新的一个reader类,nltk给出另一个实例类nltk.corpus.nps_chat,同样用help来查看下该类的信息。可以初步看出该类与xml格式的文件有关。
1 nps_chat = class NPSChatCorpusReader(nltk.corpus.reader.xmldocs.XMLCorpusReader) 2 | Method resolution order: 3 | NPSChatCorpusReader 4 | nltk.corpus.reader.xmldocs.XMLCorpusReader 5 | nltk.corpus.reader.api.CorpusReader 6 | __builtin__.object 7 | 8 | Methods defined here: 9 ...
1 >>> from nltk.corpus import nps_chat 2 >>> nps_chat.fileids() 3 ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml'] 4 >>> chartoom=nps_chat.posts('10-19-20s_706posts.xml') 5 >>> chartoom[123] 6 ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
1.5 CategorizedTaggedCorpusReader类
本文以brown类为实例介绍了CategorizedTaggedCorpusReader类。
1 >>> from nltk.corpus import brown 2 >>> help(brown) 3 class CategorizedTaggedCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, TaggedCorpusReader) 4 | A reader for part-of-speech tagged corpora whose documents are 5 | divided into categories based on their file identifiers. 6 | 7 | Method resolution order: 8 | CategorizedTaggedCorpusReader 9 | nltk.corpus.reader.api.CategorizedCorpusReader 10 | TaggedCorpusReader 11 | nltk.corpus.reader.api.CorpusReader 12 | __builtin__.object 13 | 14 | Methods defined here: 15 | 16 | __init__(self, *args, **kwargs) 17 | Initialize the corpus reader. Categorization arguments 18 | (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to 19 | the ``CategorizedCorpusReader`` constructor. The remaining arguments 20 | are passed to the ``TaggedCorpusReader``. 21 | 22 | paras(self, fileids=None, categories=None) 23 | 24 | raw(self, fileids=None, categories=None) 25 | 26 | sents(self, fileids=None, categories=None) 27 | 28 | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) 29 | 30 | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) 31 | 32 | tagged_words(self, fileids=None, categories=None, simplify_tags=False) 33 | 34 | words(self, fileids=None, categories=None) 35 | 36 | ---------------------------------------------------------------------- 37 | Methods inherited from nltk.corpus.reader.api.CategorizedCorpusReader: 38 | 39 | categories(self, fileids=None) 40 | Return a list of the categories that are defined for this corpus, 41 | or for the file(s) if it is given. 42 | 43 | fileids(self, categories=None) 44 | Return a list of file identifiers for the files that make up 45 | this corpus, or that make up the given category(s) if specified. 46 | 47 | ---------------------------------------------------------------------- 48 | Data descriptors inherited from nltk.corpus.reader.api.CategorizedCorpusReader: 49 | 50 | __dict__ 51 | dictionary for instance variables (if defined) 52 | 53 | __weakref__ 54 | list of weak references to the object (if defined) 55 | 56 | ---------------------------------------------------------------------- 57 | Methods inherited from nltk.corpus.reader.api.CorpusReader: 58 | 59 | __repr__(self) 60 | 61 | abspath(self, fileid) 62 | Return the absolute path for the given file. 63 | 64 | :type file: str 65 | :param file: The file identifier for the file whose path 66 | should be returned. 67 | :rtype: PathPointer 68 | 69 | abspaths(self, fileids=None, include_encoding=False, include_fileid=False) 70 | Return a list of the absolute paths for all fileids in this corpus; 71 | or for the given list of fileids, if specified. 72 | 73 | :type fileids: None or str or list 74 | :param fileids: Specifies the set of fileids for which paths should 75 | be returned. Can be None, for all fileids; a list of 76 | file identifiers, for a specified set of fileids; or a single 77 | file identifier, for a single file. Note that the return 78 | value is always a list of paths, even if ``fileids`` is a 79 | single file identifier. 80 | 81 | :param include_encoding: If true, then return a list of 82 | ``(path_pointer, encoding)`` tuples. 83 | 84 | :rtype: list(PathPointer) 85 | 86 | encoding(self, file) 87 | Return the unicode encoding for the given corpus file, if known. 88 | If the encoding is unknown, or if the given file should be 89 | processed using byte strings (str), then return None. 90 | 91 | open(self, file, sourced=False) 92 | Return an open stream that can be used to read the given file. 93 | If the file's encoding is not None, then the stream will 94 | automatically decode the file's contents into unicode. 95 | 96 | :param file: The file identifier of the file to read. 97 | 98 | readme(self) 99 | Return the contents of the corpus README file, if it exists. 100 | 101 | ---------------------------------------------------------------------- 102 | Data descriptors inherited from nltk.corpus.reader.api.CorpusReader: 103 | 104 | root 105 | The directory where this corpus is stored. 106 | 107 | :type: PathPointer
看下 brown的内容,如果获取brown资料库的主题和文件
1 >>> from nltk.corpus import brown 2 >>> brown.categories() //返回brown资料库的主题种类 3 ['adventure', 'belles_lettres', 'editori', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 4 >>> brown.fileids()[1:10] //返回brown资料库内的文件 5 ['ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10'] 6 >>> brown.words(categories='news') //返回brown资料库内类别名为news的类别,并按次进行切分 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 8 >>> brown.words(fileids=['cg22']) //返回brown资料库内的文件名为cg22的文件,并按词进行切分 9 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 10 >>> brown.sents(categories=['news','editori','reviews'])//返回多个类别,并按句进行切分 11 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]
对brown内的特定的文体进行计数:
1 from nltk.corpus import brown 2 import nltk 3 news_text = brown.words(categories='news') //返回brown资料库内类别名为news的类别,并按次进行切分
4 fdist = nltk.FreqDist([w.lower() for w in news_text]) //获取news的频率分布
5 modals = ['can','could','may','might','must','will']
6 for m in modals :
7 print m + ':',fdist[m], //获取modals的计数
输出
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
计算多个特定类别的多个文体进行统计
1 from nltk.corpus import brown 2 import nltk 3 cfd = nltk.ConditionalFreqDist( 4 (genre,word) 5 for genre in brown.categories() 6 for word in brown.words(categories=genre)) 7 genres=['new','religion','hobbies','science_fiction','romance','humor'] 8 modals = ['can','could','may','might','must','will'] 9 cfd.tabulate(conditions=genres,samples=modals) can could may might must will new 0 0 0 0 0 0 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13
1.6 CategorizedPlaintextCorpusReader类
相比与brown(CategorizedTaggedCorpusReader),retuters(CategorizedPlaintextCorpusReader)的区别在于,retuters可以查找一个或者多个文档涵盖的主题,也可以查找包含在一个或多个类别的文档。
1 >>> from nltk.corpus import reuters 2 >>> reuters.fileids()[1:10] 3 ['test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843'] 4 >>> reuters.categories() 5 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc'] 6 >>> reuters.categories('training/9865') 7 ['barley', 'corn', 'grain', 'wheat'] 8 >>> reuters.categories(['training/9865','training/9880']) 9 ['barley', 'corn', 'grain', 'money-fx', 'wheat'] 10 >>> reuters.categories('training/9880') 11 ['money-fx']
对比brown:
1 >>> from nltk.corpus import brown 2 >>> brown.categories(['news','reviews']) //不能对多个主题进行查找 3 []
4 >>> brown.fileids(['cr05','cr06'])
5 []
1.7 基本语料库函数
示例 | 描述 |
fileids() | 语料库的文件 |
fileids([categories]) | 分类对应的语料库中的文件 |
categories() | 语料库中的分类 |
categoried([fileids]) | 文件对应的语料库中的分类 |
raw() | 语料库的原始内容 |
raw(fileids=[f1,f2,f3]) | 指定文件的原始内容 |
raw(categories=[c1,c2]) | 制定分类的原始内容 |
words() | 整个语料库中的词汇 |
words(fileids=[f1,f2,f3]) | 指定文件的词汇 |
words(categories=[c1,c2]) | 指定分类的词汇 |
sents() | 指定分类的句子 |
sents(fileids=[f1,f2,f3]) | 指定文件的句子 |
sents(categories=[c1,c2]) | 指定分类的句子 |
abspath(fileid) | 制定文件在磁盘的位置 |
encoding(fileid) | 文件的编码(如果知道的话) |
open(fileid) | 打开指定语料库文件的文件流 |
root() | 到本地安装的语料库根目录的路径 |
readme() | 语料库的README文件的内容 |
1.8 载入自己的语料库
1 >>> from nltk.corpus import PlaintextCorpusReader 2 >>> corpus_root='/Users/rcf/workspace/python/python_test/NLP_WITH_PYTHON/chapter_2' 3 >>> wordlist=PlaintextCorpusReader(corpus_root,'.*') //corpus_root 资料库路径,'.*'文件类型
4 >>> wordlist.fileids()
5 ['1.py', '2.py', '3.py', '4.py']
6 >>> wordlist.words('3.py')
7 ['from', 'nltk', '.', 'corpus', 'import', 'brown', ...]