[Text_Mining]notes_2

An introduction to NLTK

NLTK:Natural Language Toolkit

Open sourse library in Python

 

>>>import nltk

 

Frequency of words

>>>dist = FreDist(text7)

>>>len(dist)

Freqwords = [w for w in vocab1 if len(w)>5 and dist[w]>100]

 

Normalization and Stemming

Different forms of the same ‘word’

>>>input1 = ‘List listed lists listing listings’

>>>words1 = input1.lower().split(‘ ‘)

Stemming is to find the root word or the root form of any given word

>>>porter = nltk.PorterStemmer()

>>>[porter.stem(t) for t in words]

 

Lemmatization

to have the words that come out to be actually meaningful.

>>>udhr = nltk.corpus.udhr.words(‘English-latin1’)

Udhr : that is the Universal Declaration of Human Rights.

Lemmatization::stemming,but resulting stems are all valid words

>>>WNlemma = nltk.WordNetLemmatizer()

>>>[WNlemma.lemmatize(t) for t in udhr[:20]]

 

Tokenization:

Recall splitting a sentence into words/tokens

 

NLTK has an in-built tokenizer

>>>nltk.word_tokenize(text11)

 

Sentence splitting

How would you split sentences from a long text string.

$2.99  U.S.

NLTK has an in-built sentences spliter too!

>>>sentences = nltk.sent_tokenize(text12)

>>>len(sentences)

 

NLP Tasks

Counting words, counting frequency of words

Finding sentence boundaries

Part of speech tagging

Parsing the sentence structure

Identifying semantic role labeling

Named Entity Reconition

Co-reference and pronoun resolution

 

Part-of-speech(POS) Tagging

>>>import nltk

>>>nltk.help.upenn_tagset(‘MD’)

 

POS tagging with NLTK

Recall splitting a sentence into words/tokens

>>>text11 = ‘Children shouldn’t drink a sugary drink before bed.’

>>>text13 = nltk.word_tokenize(text11)

NLTK’s Tokenizer

>>>nltk.pos_tag(text)

 

Ambiguity in POS Tagging

Ambiguity in common in English

 

Parsing Sentence Structure

Making sence of sentences is easy if they follow a well-defined grammatical structure

Ambiguity may exist even if sentences are grammatically correct!   

 

Ambiguity in Parsing

>>>Text16 = nltk.word_tokenize(‘I saw the man with a telescope’)

>>>Grammer1 = nltk.data.load(‘mygrammar1.cfg’)

>>>grammer

<Grammer with 13 productions>

>>>parser = nltk.ChartParser(grammar1)

>>>trees = parser.parse_all(text16)

>>>for tree in trees:

   Print tree

NLTK and Parse Tree Collection

>>>from nltk.corpus import treebank

>>>text17 = treebank.parsed_sents(‘wsj_0001.mrg’)[0]

>>>print text17

 

POS Tagging&Parsing Complexity

Uncommon usages of words

>>>text18 = nltk.word_tokenize(‘The old man the boat’)

>>>nltk.pos_tag(text19)

[(‘The’, ‘DT’),(‘old’, ‘JJ’),(‘man’, ‘NN’),(‘the’, ‘DT’),(‘boat’, ‘NN’)]  not correct.

Well-formed sentences may still be meaningless

 

Take Home Concepts

POS tagging provides insights into the word classes/types in a sentence.

Parsing the grammatical structures helps derive meaning

Both tasks are difficult,linguistic ambiguity increases the difficulty even more.

NLTK provides access to tools and data for training.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值