python与自然语言处理-读书笔记2

control

if control structure
for loop

 >>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
...     if xyzzy.endswith('l'):
...         print(xyzzy)
...
Call
Ishmael
>>>
>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in tricky: #循环,找出所有以cie, cei结尾的单词。
...     print(word, end=' ') #end=""的意思是,单词与单词之间以空格键隔开。
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>

自然语言理解

主要是做介绍:
词义消歧word sense disambiguation
代词回指pronoun resolution
generating language output,包括机器翻译,回答问题等question answering, machine translation.

第二章

来源网址
要回答下面的问题

What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?

任务:写一段程序,计算平均词长,平均句子长度,每个词出现的平均次数(词汇多样性得分)

>>> for fileid in gutenberg.fileids():
 num_chars = len(gutenberg.raw(fileid)) #raw()指的是没有经过任何语言学处理之前把文件内容分析出来。
 num_words = len(gutenberg.words(fileid))
 num_sents = len(gutenberg.sents(fileid))
 num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])
   print int(num_chars/num_words, int(num_words/num_sents), int(num_words/num_vocab), fileid

1.gutenberg语料库 代表Literature这一类

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()

统计词频,和第一章直接调用text1.concordance()的方式不同。代码如下

#方法1:用words调用
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
#方法2:用import调用
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')

网络文本

本章与爬虫无关,讲的是NLTK中的网络文本。
来源

NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
...     print(fileid, webtext.raw(fileid)[:65], '...')
>>> from nltk.corpus import nps_chat
>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]

布朗语料库

类型列表

a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics.布朗语料库是一个研究文体之间的系统性差异(文体学)的资源。

>>> from nltk.corpus import brown #载入语料库
>>> news_text = brown.words(categories='news') #选择新闻类别,并分词。brown.words(categories="new")
>>> fdist = nltk.FreqDist(w.lower() for w in news_text) #定义fdist为统计词频,大小写不敏感
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] #用列表形式确定搜索对象是情态动词
>>> for m in modals: #遍历所有情态动词
...     print(m + ':', fdist[m], end=' ') #输出格式的规定。 end =是将所有结果排在同一行
...
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

练习题:任选一文体,统计wh词汇的词频。

>>> humor_text = brown.words(categories="humor")
>>> fdist = nltk.FreqDist(w.lower() for w in news_text)
>>> wh_words = ['what', 'when', 'where', 'who', 'why']
>>> for x in wh_words:
 print(x + ':', fdist[x], end=' ')
 
what: 95 when: 169 where: 59 who: 268 why: 14 

拓展的练习:统计学生作文中情态动词的使用情况

from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = r"D:\composition"
>>>> scomposition = PlaintextCorpusReader(corpus_root, '.*')
>>> mytest = scomposition.words()
>>> fdist = nltk.FreqDist([w.lower() for w in mytest])

conditional frequency distributions.某些文体中的某些词语的词频统计。这个比较酷!!!

>>> cfd = nltk.ConditionalFreqDist((genre, word)
...           for genre in brown.categories()
...           for word in brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals) #括号中可以加上限制条件。conditions = ['news', 'religion'], samples = 
                 can could  may might must will
           news   93   86   66   38   50  389
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13

Reuters corpus

语料库介绍:新闻语篇,形符数,分类标准 The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”

Inaugural Address Corpus

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]
>>> cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target)) #w.lower转化成小写形式。startwith检测以上面target的两个词语开头。
>>> cfd.plot()

Annotated Text Corpora

http://nltk.org/data.数据下载
http://nltk.org/howto

1.9 加载自己的语料库

>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict' #选择读取目录set the location
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') #读取文件列表.可使用正则表达式[正则表达](http://www.nltk.org/book/ch03.html#sec-regular-expressions-word-patterns)
>>> wordlists.fileids()#列出文件名
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')#列出某个文件下的单词
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

问:这个加载支持哪些格式呢?Excel或者word的格式是否可以?

加载已下载的Pen Treebank语料库可以用BracketParseCorpusReader工具

>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" 
>>> file_pattern = r".*/wsj_.*\.mrg" 
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()
['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]
>>> len(ptb.sents())
49208
>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]

函数function

A function is just a named block of code that performs some well-defined task, as we saw in 1. A function is usually defined to take some inputs, using special variables known as parameters, and it may produce a result, also known as a return value.

比如,写一个将单词变成复数的函数:

def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

modules定义模块

问:这一段没有太明白。

It makes life a lot easier if you can collect your work into a single place, and access previously defined functions without making copies.

A collection of variable and function definitions in a file is called a Python module. A collection of related modules is called a package.一个文件中的变量和函数定义的集合被称为python模块。相关模块的集合称为包。

4.lexical resourses

4.1 wordlist corpora

# Natural Language Toolkit: code_unusual

 def unusual_words(text):
     text_vocab = set(w.lower() for w in text if w.isalpha())
     english_vocab = set(w.lower() for w in nltk.corpus.words.words())
     unusual = text_vocab - english_vocab
     return sorted(unusual)

 >>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
 ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses',
 'accents', 'accepting', 'accommodations', 'accompanied', 'accounted', 'accounts',
 'accustomary', 'aches', 'acknowledging', 'acknowledgment', 'acknowledgments', ...]
 >>> unusual_words(nltk.corpus.nps_chat.words())
 ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack',
 'acros', 'actualy', 'adams', 'adds', 'adduser', 'adjusts', 'adoted', 'adreniline',
 'ads', 'adults', 'afe', 'affairs', 'affari', 'affects', 'afk', 'agaibn', 'ages', ...]

stopwords的库

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')

学习下面的这些代码
男女姓名里都有的名字:

>>> names = nltk.corpus.names
>>> names.fileids()
['female.txt', 'male.txt']
>>> male_names = names.words('male.txt')
>>> female_names = names.words('female.txt')
>>> [w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]

根据单词最后一个字母的条件词频

>>> cfd = nltk.ConditionalFreqDist(
...           (fileid, name[-1])
...           for fileid in names.fileids()
...           for name in names.words(fileid))
>>> cfd.plot()

4.2 A pronouncing dictionary

略去没看

4.3 comparative wordlists

4.4 Shoebox and Toolbox Lexicons

5 wordnet

further reading

Significant sources of published corpora are the Linguistic Data Consortium (LDC) and the European Language Resources Agency (ELRA). Hundreds of annotated text and speech corpora are available in dozens of languages. Non-commercial licences permit the data to
be used in teaching and research. For some corpora, commercial licenses are also available (but for a higher fee).
A good tool for creating annotated text corpora is called Brat,
and available from http://brat.nlplab.org/.
These and many other language resources have been documented using OLAC Metadata, and can be searched via the OLAC homepage at http://www.language-archives.org/. Corpora List is a mailing list
for discussions about corpora, and you can find resources by searching the list archives or posting to the list.
The most complete inventory of the world’s languages is Ethnologue, http://www.ethnologue.com/.
Of 7,000 languages, only a few dozen have substantial digital resources suitable for
use in NLP.
This chapter has touched on the field of Corpus Linguistics. Other useful books in this
area include (Biber, Conrad, & Reppen, 1998), (McEnery, 2006), (Meyer, 2002), (Sampson & McCarthy, 2005), (Scott & Tribble, 2006).
Further readings in quantitative data analysis in linguistics are:
(Baayen, 2008), (Gries, 2009), (Woods, Fletcher, & Hughes, 1986).

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值