第二章 获得文本语料和词汇资源
什么是有用的文本语料和词汇资源?怎么获取?
哪些Python结构适合这个工作?
怎么编写代码可以避免重复工作
目录
1、查看语料库 nltk.corpus.gutenberg.fileids()
2、选取语料库内文本 emma=nltk.corpus.gutenberg.words("austen-emma.txt")
3、对语料库内文字进行研究 emma=nltk.Text(emma) emma.concordance("surprise")
4、获取语料库内文本字符及个数 gutenberg.raw("austen-emma.txt")
5、获取语料库内文本的单词及个数 gutenberg.words("austen-emma.txt")
6、获取语料库内文本的句子及个数 gutenberg.sents("austen-emma.txt")
7、获取语料库内文本的词汇表及个数 [w.lower() for w in gutenberg.words("austen-emma.txt")]
1、获取网络聊天语料库 from nltk.corpus import webtext for fileid in webtext.fileids():
2、获取幼童虐待癖聊天室内容语料库 from nltk.corpus import nps_chat chatroom=nps_chat.posts("10-19-20s_706posts.xml")
2、使用配对列表创建ConditionalFreqDist cfd=ConditionalFreqDist(genre_word)
2.1获取文本语料库
古腾堡语料库
1、查看语料库 nltk.corpus.gutenberg.fileids()
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
2、选取语料库内文本 emma=nltk.corpus.gutenberg.words("austen-emma.txt")
>>> emma=nltk.corpus.gutenberg.words("austen-emma.txt")
>>> len(emma)
192427
还有一种更简单的方式,这需要在import的时候改变import的对象
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> emma=gutenberg.words("austen-emma.txt")
>>> len(emma)
192427
3、对语料库内文字进行研究 emma=nltk.Text(emma) emma.concordance("surprise")
>>> emma=gutenberg.words("austen-emma.txt")
>>> emma.concordance("surprise")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'StreamBackedCorpusView' object has no attribute 'concordance'
##可以看到,直接进行跟第一章一样的研究会报错,所以需要先进行nltk.Text()
>>> emma=nltk.Text(emma)
>>> emma.concordance("surprise")
Displaying 1 of 1 matches:
that Emma could not but feel some surprise , and a little displeasure , on he
4、获取语料库内文本字符及个数 gutenberg.raw("austen-emma.txt")
>>> gutenberg.raw("austen-emma.txt")
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress...'
>>> len(gutenberg.raw("austen-emma.txt"))
887071
可以看到,此处获取到的并不是列表,而是字符串,且空格,回车都包含在内。而第一章中的text1-text9与sent1-8以列表形式存在, 且不包含空格与回车。
注意:字符串的长度为字符串内字符个数,包含回车与空格。
5、获取语料库内文本的单词及个数 gutenberg.words("austen-emma.txt")
>>> gutenberg.words("austen-emma.txt")
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
>>> len(gutenberg.words("austen-emma.txt"))
192427
6、获取语料库内文本的句子及个数 gutenberg.sents("austen-emma.txt")
>>> gutenberg.sents("austen-emma.txt")
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]
>>> len(gutenberg.sents("austen-emma.txt"))
7752
7、获取语料库内文本的词汇表及个数 [w.lower() for w in gutenberg.words("austen-emma.txt")]
>>> [w.lower() for w in gutenberg.words("austen-emma.txt")]
['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with',...]
>>> len(set([w.lower() for w in gutenberg.words("austen-emma.txt")]))
7344
(大小写不敏感)
网络聊天文本
1、获取网络聊天语料库 from nltk.corpus import webtext for fileid in webtext.fileids():
>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
... print(fileid,webtext.raw(fileid)[:50])
...
firefox.txt Cookie Manager: "Don't allow sites that set remove
grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Who
overheard.txt White guy: So, do you have any plans for this even
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted
singles.txt 25 SEXY MALE, seeks attrac older single lady, for
wine.txt Lovely delicate, fragrant Rhone wine. Polished lea
>>>
2、获取幼童虐待癖聊天室内容语料库 from nltk.corpus import nps_chat
chatroom=nps_chat.posts("10-19-20s_706posts.xml")
>>> from nltk.corpus import nps_chat
>>> chatroom=nps_chat.posts("10-19-20s_706posts.xml")
>>> chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
###语料库按年龄与日期分类 10-19-20s_706表示2006年10月19日从20多岁聊天室收集的706个帖子的内容
布朗语料库
一个百万级的语料库,包含新闻、社论、文学作品等。布朗语料库是很好的用来研究文体之间差异的一个语料库
1、语料库基本操作
>>> from nltk.corpus import brown
>>> brown.categories() #查看种类
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> brown.words(categories="news")
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=["cg22"]) #cg22是纯文学文体的标号
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=["news","editorial","reviews"])
[['The', 'Fulton', 'County', ...],['The', 'jury', 'further', ...],...]
2、比较不同问文体中情态动词用法差异
>>> news_text=brown.words(categories="news")
>>> fiction_text=brown.words(categories="fiction")
>>> fdist1=FreqDist([w.lower() for w in news_text])
>>>fdist1=FreqDist([w.lower() for w in news_text])
>>> fdist2=FreqDist([w.lower() for w in fiction_text])
>>> modals=["can","could","may","might","must","will"]
>>> for m in modals:
... print(m+" num in news_text:"+str(fdist1[m])+" freq:"+str(100*fdist1[m]/len(news_text)))
... print(m+" num in fiction_text:"+str(fdist2[m])+" freq:"+str(100*fdist2[m]/len(fiction_text)))
...
can num in news_text:94 freq:0.09348210911550013
can num in fiction_text:39 freq:0.05694428221002219
could num in news_text:87 freq:0.08652067545796288
could num in fiction_text:168 freq:0.24529844644317253
may num in news_text:93 freq:0.09248761859299481
may num in fiction_text:10 freq:0.014601098002569793
might num in news_text:38 freq:0.03779063985520218
might num in fiction_text:44 freq:0.06424483121130709
must num in news_text:53 freq:0.05270799769278199
must num in fiction_text:55 freq:0.08030603901413386
will num in news_text:389 freq:0.38685681325456966
will num in fiction_text:56 freq:0.08176614881439084
路透社语料库
包含的全是新闻文档,并按照测试和训练分组,可用来训练与测试算法。
1、语料库获取
>>> from nltk.corpus import reuters
>>> reuters.fileids()
['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840','test/14841','test/14842','test/14843',...,'training/8210','training/8211','training/8212',...]
>>> reuters.categories()
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
2、语料库查询
>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']
>>> reuters.categories(['training/9865','training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']
>>> reuters.fileids('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275', 'test/19668', 'training/10175', 'training/1067', 'training/11208', 'training/11316', 'training/11885', 'training/12428', 'training/13099', 'training/13744', 'training/13795', 'training/13852', 'training/13856', 'training/1652', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2191', 'training/2217', 'training/2232', 'training/3132', 'training/3324', 'training/395', 'training/4280', 'training/4296', 'training/5', 'training/501', 'training/5467', 'training/5610', 'training/5640', 'training/6626', 'training/7205', 'training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865', 'training/9958']
>>> reuters.fileids(['barley','corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648', 'test/15649', 'test/15676', 'test/15686', 'test/15720', 'test/15728', 'test/15845', 'test/15856', 'test/15860', 'test/15863', 'test/15871', 'test/15875', 'test/15877', 'test/15890', 'test/15904', 'test/15906', 'test/15910', 'test/15911', 'test/15917', 'test/15952', 'test/15999', 'test/16012', 'test/16071', 'test/16099', 'test/16147', 'test/16525', 'test/16624', 'test/16751', 'test/16765', 'test/17503', 'test/17509', 'test/17722', 'test/17767', 'test/17769', 'test/18024', 'test/18035', 'test/18263', 'test/18482', 'test/18614', 'test/18908', 'test/18954', 'test/18973', 'test/19165', 'test/19275', 'test/19668', 'test/19721', 'test/19821', 'test/20018', 'test/20366', 'test/20637', 'test/20645', 'test/20649', 'test/20723', 'test/20763', 'test/21091', 'test/21243', 'test/21493', 'training/10120', 'training/10139', 'training/10172', 'training/10175', 'training/10319', 'training/10339', 'training/10487', 'training/10489', 'training/10519', 'training/1067', 'training/10701', 'training/10882', 'training/10956', 'training/11012', 'training/11085', 'training/11091', 'training/11208', 'training/11269', 'training/1131', 'training/11316', 'training/11392', 'training/11436', 'training/11607', 'training/11612', 'training/11729', 'training/11739', 'training/11769', 'training/11885', 'training/11936', 'training/11939', 'training/11964', 'training/12002', 'training/12052', 'training/12055', 'training/1215', 'training/12160', 'training/12311', 'training/12323', 'training/12372', 'training/12417', 'training/12428', 'training/12436', 'training/12500', 'training/12583', 'training/12587', 'training/1268', 'training/1273', 'training/12872', 'training/13099', 'training/13173', 'training/13179', 'training/1369', 'training/13744', 'training/13795', 'training/1385', 'training/13852', 'training/13856', 'training/1395', 'training/1399', 'training/14483', 'training/1582', 'training/1652', 'training/1777', 'training/1843', 'training/193', 'training/1952', 'training/197', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2183', 'training/2191', 'training/2217', 'training/2232', 'training/2264', 'training/235', 'training/2382', 'training/2436', 'training/2456', 'training/2595', 'training/2599', 'training/2617', 'training/2727', 'training/2741', 'training/2749', 'training/2777', 'training/2848', 'training/2913', 'training/2922', 'training/2947', 'training/3132', 'training/3138', 'training/3191', 'training/327', 'training/3282', 'training/3299', 'training/3306', 'training/3324', 'training/3330', 'training/3337', 'training/3358', 'training/3401', 'training/3429', 'training/3847', 'training/3855', 'training/3881', 'training/3949', 'training/395', 'training/3979', 'training/3981', 'training/4047', 'training/4133', 'training/4280', 'training/4289', 'training/4296', 'training/4382', 'training/4490', 'training/4599', 'training/4825', 'training/4905', 'training/4939', 'training/4988', 'training/5', 'training/5003', 'training/501', 'training/5017', 'training/5033', 'training/5109', 'training/516', 'training/5185', 'training/5338', 'training/5467', 'training/5518', 'training/5531', 'training/5606', 'training/5610', 'training/5636', 'training/5637', 'training/5640', 'training/57', 'training/5847', 'training/5933', 'training/6', 'training/6142', 'training/6221', 'training/6236', 'training/6239', 'training/6259', 'training/6269', 'training/6386', 'training/6585', 'training/6588', 'training/6626', 'training/6735', 'training/6890', 'training/6897', 'training/694', 'training/7062', 'training/7205', 'training/7215', 'training/7336', 'training/7387', 'training/7389', 'training/7390', 'training/7395', 'training/7579', 'training/7700', 'training/7792', 'training/7917', 'training/7934', 'training/7943', 'training/8004', 'training/8140', 'training/8161', 'training/8166', 'training/8213', 'training/8257', 'training/8273', 'training/8400', 'training/8443', 'training/8446', 'training/8535', 'training/855', 'training/8759', 'training/8941', 'training/8983', 'training/8993', 'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989']
由于路透社的新闻类别是有互相重叠部分,故接受查找含有一个或多个类别的文档,也可以查找一个或多个文档涉及到的类别。
3、查看句子
>>> reuters.words(categories=['barley','corn'])
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]
>>> reuters.words(['training/9865','training/9880'])
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
就职演说语料库
查看语料库
>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']
世界人权宣言语料库
包含300种语言的世界人权宣言,其fileids包含所使用的编码,如utf-8或者Latin1编码
比较其中一种语言的字母分布频率
>>> from nltk.corpus import udhr
>>> raw_text=udhr.raw("English-Latin1")
>>> FreqDist(raw_text).plot()
函数合集
载入自己的语料库
2.2条件频率分布
一般用于比对属于不同文体、时间点、类别等不同条件下的文章所具有的差异(如单词长度、情态动词频率等不同事件)
- ConditionalFreqDist :输入是一个配对链表,前一个为条件,后一个为事件
如:[("news","aa"),("news","bb"),("news","cc"),("news","dd"),("fiction","aa"),("fiction","bb"),("fiction","cc"),("fiction","dd")]
- FreqDist : 输入是一个简单的链表
如 ["aa","bb","cc","dd"]
1、生成配对 genre_word=[(genre,word) for genre in ["news","romance"] for word in brown.words(categories=genre)]
>>> genre_word=[(genre,word)
... for genre in ["news","romance"]
... for word in brown.words(categories=genre)]
>>> len(genre_word)
170576
2、使用配对列表创建ConditionalFreqDist cfd=ConditionalFreqDist(genre_word)
>>> from nltk import *
>>> cfd=ConditionalFreqDist(genre_word)
>>> cfd
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance']
>>> cfd["news"]
FreqDist({'the': 5580, ',': 5188, '.': 4030, 'of': 2849, 'and': 2146, 'to': 2116, 'a': 1993, 'in': 1893, 'for': 943, 'The': 806, ...})
>>> cfd["romance"]
FreqDist({',': 3899, '.': 3736, 'the': 2758, 'and': 1776, 'to': 1502, 'a': 1335, 'of': 1186, '``': 1045, "''": 1044, 'was': 993, ...})
>>> list(cfd["romance"])
[',', '.', 'the', 'and', 'to', 'a', 'of',...]
>>> cfd["romance"]["could"]
193
可以看到针对(条件,事件)生成的配对生成的条件频率分布,实际上就是针对每个条件的每个事件分别计算频率。具体如图所示。
条件1 | 条件2 | |
事件1 | 频率1.1 | 频率2.1 |
事件2 | 频率1.2 | 频率2.2 |
事件3 | 频率1.3 | 频率2.3 |
事件4 | 频率1.4 | 频率2.4 |
事件5 | 频率1.5 | 频率2.5 |
事件6 | 频率1.6 | 频率2.6 |
其中把每一个条件单独拎出来,就是一个FreqDist,详见代码部分。
3、添加计数条件(if引导)
>>> cdf=ConditionalFreqDist(
... (target,file[:4])
... for target in ['america','citizen']
... for file in inaugural.fileids()
... for w in inaugural.words(file)
... if w.lower().startswith(target))
>>> cdf["america"]
FreqDist({'1993': 33, '1997': 31, '2005': 30, '1921': 24, '1973': 23, '1985': 21, '2001': 20, '1981': 16, '2009': 15, '1909': 12, ...})
>>> cdf["citizen"]
FreqDist({'1841': 38, '1821': 15, '1817': 14, '1885': 13, '1889': 12, '1929': 12, '1845': 11, '2001': 11, '1805': 10, '1893': 10, ...})
可以看到,在事件定义处,可以选择修改事件名称(file[:4]),另外可以在条件基础上添加if条件,使得满足if的才能计数。
4、绘制分布图与分布表
在plot()和tabulate()方法中, 可以使用condition=parameter来指定哪些条件被显示,使用samples=parameter来限制哪些样本(事件)被显示。避免过大的表或者过大的图。
>>> cdf.tabulate(samples=["1988","1989","1990"])
1988 1989 1990
america 0 11 0
citizen 0 3 0
>>> cdf.tabulate(samples=["1988","1989","1990","2000","2001"],cumulative=True)
1988 1989 1990 2000 2001
america 0 11 11 11 31
citizen 0 3 3 3 14
>>> cdf.plot(samples=["1988","1989","1990","2000","2001"],cumulative=True)
>>> cdf.plot()
5、使用双连词生成随机文本
思路是由于bigrams可以生成双连词配对(word1,word2)。对一个文本使用bigrams得到这个文本所有双连词,并统计每个双连词的个数,其中以搭配中的word1作为条件,word2作为事件。在文本生成模型中,num为生成文本的长度,word为起始词,选取双连词中以word为条件的所有搭配中频率最高的。如living后面跟着的{'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1},creature频率最高,这样生成的文本中living后面就会跟creature。
>>> def generate_model(cfdist,word,num=10):
... for i in range(num):
... print(word)
... word=cfdist[word].max()
...
>>> text=corpus.genesis.words("english-kjv.txt")
>>> bigrams=bigrams(text)
>>> cfd=ConditionalFreqDist(bigrams)
>>> cfd["living"]
FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})
>>> generate_model(cfd,"living")
living
creature
that
he
said
,
and
the
land
of
复用Python代码
1、通过使用文本编辑器而非命令行
2、创建函数实现功能复用
3、创建模块,集成函数。以后需要用到一个函数直接调用该模块,修改函数也在模块中修改。(如func.py from func import function1)
4、变量和函数集成为模块,模块集成为包,包集成为库。