NLP之nltk 基本方法学习

  1. Tokenization(分词)

分词是指按照特定需求,把文本切分成一个字符串序列(其元素一般称为token,或者叫词语)。一般来说,我们要求序列的元素有一定的意义。

Sentence Tokenization

下图为使用nltk进行句子分词

import nltk
ulysses = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."
doc = nltk.sent_tokenize(ulysses)
for s in doc:
    print(">",s)

Word Tokenization

There are different methods for tokenizing text into words, such as:

1. TreebankWordTokenizer

2. WordPunctTokenizer

3. WhitespaceTokenizer

不同种类的词汇分组

from nltk import word_tokenize

sentence = "Mary had a little lamb it's fleece was white as snow."
# Default Tokenization
tree_tokens = word_tokenize(sentence)   # nltk.download('punkt') for this

# Other Tokenizers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

print("DEFAULT: ", tree_tokens)
print("PUNCT  : ", punct_tokens)
  1. Part of speech tagging

将一段话分成不同词性的词语

sentence = "Mary had a little lamb it's fleece was white as snow."
# Default Tokenization
tree_tokens = word_tokenize(sentence)   # nltk.download('punkt') for this

# Other Tokenizers
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

print("DEFAULT: ", tree_tokens)
print("PUNCT  : ", punct_tokens)

pos = nltk.pos_tag(tree_tokens)
print(pos)
pos_punct = nltk.pos_tag(punct_tokens)
print(pos_punct)

import re
regex = re.compile("^N.*") #check whether a string starts with N
nouns = []
for l in pos:
    if regex.match(l[1]):
        nouns.append(l[0])
print("Nouns:", nouns)
  1. Stemming(词干提取)

词干提取 – Stemming

词干提取是去除单词的前后缀得到词根的过程。

大家常见的前后词缀有“名词的复数”、“进行式”、“过去分词”…

中文通常是不需要这个步骤的,因为没有前后缀,也是有不同的stemming方法

import nltk
from nltk import word_tokenize

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")


sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)

print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens2])
  1. Lemmatisation(词性还原)

词形还原 – Lemmatisation

词形还原是基于词典,将单词的复杂形态转变成最基础的形态。

词形还原不是简单地将前后缀去掉,而是会根据词典将单词进行转换。比如“drove”会转换为“drive”。

import nltk
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")


sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
tokens2 = word_tokenize(sentence2)

print("\n",sentence2)
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens2])

wnl = nltk.WordNetLemmatizer()
tokens2_pos = nltk.pos_tag(tokens2)  #nltk.download("averaged_perceptron_tagger")

print([wnl.lemmatize(t) for t in tokens2])
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值