语料库与python应用,NLTK(Python)中的语料库和词典有什么区别

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?

What is the movie data set ?

what is Wordnet ?

解决方案

Corpora is the plural for corpus.

Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text.

Lexicon is a vocabulary, a list of words, a dictionary (source: https://www.google.com.sg/search?q=lexicon)

In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. E.g. a list of stopwords can be found in NLTK corpus API:

>>> from nltk.corpus import stopwords

>>> print stopwords.words('english')

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']

The movie review dataset in NLTK (canonically known as Movie Reviews Corpus) is a text dataset of 2k movie reviews with sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)

And it is often used for tutorial purposes for introduction to NLP and sentiment analysis, see http://www.nltk.org/book/ch06.html and nltk NaiveBayesClassifier training for sentiment analysis

WordNet is lexical database for the English language (it's like a lexicon/dictionary with word-to-word relations) (source: https://wordnet.princeton.edu/).

In NLTK, it incorporates the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/) that allows you to query the words in other languages.

Since it is also a list of words (in this case with many other things included, relations, lemmas, POS, etc.), it's also invoked using nltk.corpus in NLTK.

The canonical idiom to use the wordnet in NLTK is as such:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('dog')

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

The easiest way to understand/learn the NLP jargons and the basics is to go through these tutorial in the NLTK book: http://www.nltk.org/book/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值