语料库与python应用,NLTK（Python）中的语料库和词典有什么区别

最新推荐文章于 2022-09-24 14:39:01 发布

姜兴

最新推荐文章于 2022-09-24 14:39:01 发布

阅读量430

点赞数

文章标签：语料库与python应用

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?

What is the movie data set ?

what is Wordnet ?

解决方案

Corpora is the plural for corpus.

Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text.

Lexicon is a vocabulary, a list of words, a dictionary (source: https://www.google.com.sg/search?q=lexicon)

In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. E.g. a list of stopwords can be found in NLTK corpus API:

>>> from nltk.corpus import stopwords

>>> print stopwords.words('english')

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']

The movie review dataset in NLTK (canonically known as Movie Reviews Corpus) is a text dataset of 2k movie reviews with sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)

And it is often used for tutorial purposes for introduction to NLP and sentiment analysis, see http://www.nltk.org/book/ch06.html and nltk NaiveBayesClassifier training for sentiment analysis

WordNet is lexical database for the English language (it's like a lexicon/dictionary with word-to-word relations) (source: https://wordnet.princeton.edu/).

In NLTK, it incorporates the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/) that allows you to query the words in other languages.

Since it is also a list of words (in this case with many other things included, relations, lemmas, POS, etc.), it's also invoked using nltk.corpus in NLTK.

The canonical idiom to use the wordnet in NLTK is as such:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('dog')

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

The easiest way to understand/learn the NLP jargons and the basics is to go through these tutorial in the NLTK book: http://www.nltk.org/book/

姜兴

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
语料库与python应用,NLTK（Python）中的语料库和词典有什么区别

Can someone tell me the difference between a Corpora ,corpus and lexicon in NLTK ?What is the movie data set ?what is Wordnet ?解决方案Corpora is the plural for corpus.Corpus basically means a body, and i...
复制链接

扫一扫