[Text_Mining]notes_2

最新推荐文章于 2021-11-14 14:37:27 发布

哥斯拉不会打撸啊撸

最新推荐文章于 2021-11-14 14:37:27 发布

阅读量270

点赞数

文章标签：文本分类 coursera

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_38536057/article/details/78751426

版权

An introduction to NLTK

NLTK:Natural Language Toolkit

Open sourse library in Python

>>>import nltk

Frequency of words

>>>dist = FreDist(text7)

>>>len(dist)

Freqwords = [w for w in vocab1 if len(w)>5 and dist[w]>100]

Normalization and Stemming

Different forms of the same ‘word’

>>>input1 = ‘List listed lists listing listings’

>>>words1 = input1.lower().split(‘ ‘)

Stemming is to find the root word or the root form of any given word

>>>porter = nltk.PorterStemmer()

>>>[porter.stem(t) for t in words]

Lemmatization

to have the words that come out to be actually meaningful.

>>>udhr = nltk.corpus.udhr.words(‘English-latin1’)

Udhr : that is the Universal Declaration of Human Rights.

Lemmatization::stemming,but resulting stems are all valid words

>>>WNlemma = nltk.WordNetLemmatizer()

>>>[WNlemma.lemmatize(t) for t in udhr[:20]]

Tokenization:

Recall splitting a sentence into words/tokens

NLTK has an in-built tokenizer

>>>nltk.word_tokenize(text11)

Sentence splitting

How would you split sentences from a long text string.

$2.99 U.S.

NLTK has an in-built sentences spliter too!

>>>sentences = nltk.sent_tokenize(text12)

>>>len(sentences)

NLP Tasks

Counting words, counting frequency of words

Finding sentence boundaries

Part of speech tagging

Parsing the sentence structure

Identifying semantic role labeling

Named Entity Reconition

Co-reference and pronoun resolution

Part-of-speech(POS) Tagging

>>>import nltk

>>>nltk.help.upenn_tagset(‘MD’)

POS tagging with NLTK

Recall splitting a sentence into words/tokens

>>>text11 = ‘Children shouldn’t drink a sugary drink before bed.’

>>>text13 = nltk.word_tokenize(text11)

NLTK’s Tokenizer

>>>nltk.pos_tag(text)

Ambiguity in POS Tagging

Ambiguity in common in English

Parsing Sentence Structure

Making sence of sentences is easy if they follow a well-defined grammatical structure

Ambiguity may exist even if sentences are grammatically correct!

Ambiguity in Parsing

>>>Text16 = nltk.word_tokenize(‘I saw the man with a telescope’)

>>>Grammer1 = nltk.data.load(‘mygrammar1.cfg’)

>>>grammer

<Grammer with 13 productions>

>>>parser = nltk.ChartParser(grammar1)

>>>trees = parser.parse_all(text16)

>>>for tree in trees:

Print tree

NLTK and Parse Tree Collection

>>>from nltk.corpus import treebank

>>>text17 = treebank.parsed_sents(‘wsj_0001.mrg’)[0]

>>>print text17

POS Tagging&Parsing Complexity

Uncommon usages of words

>>>text18 = nltk.word_tokenize(‘The old man the boat’)

>>>nltk.pos_tag(text19)

[(‘The’, ‘DT’),(‘old’, ‘JJ’),(‘man’, ‘NN’),(‘the’, ‘DT’),(‘boat’, ‘NN’)] not correct.

Well-formed sentences may still be meaningless

Take Home Concepts

POS tagging provides insights into the word classes/types in a sentence.

Parsing the grammatical structures helps derive meaning

Both tasks are difficult,linguistic ambiguity increases the difficulty even more.

NLTK provides access to tools and data for training.

哥斯拉不会打撸啊撸

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Text_Mining]notes_2

An introduction to NLTKNLTK:Natural Language ToolkitOpen sourse library in Python >>>import nltk Frequency of words>>>dist = FreDist(text7)>>>len(dist)Freqwords = [w for w in vocab1
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。