NLTK 词干化
在NLP中,我们对一句话或一个文档分词之后,一般要进行词干化处理。词干化处理就是把一些名词的复数去掉,动词的不同时态去掉等等类似的处理。
对于切词得到的英文单词要进行词干化处理,主要包括将名词的复数变为单数和将动词的其他形态变为基本形态
在nltk当中有两种方法做词干化处理:“porter” “snowball”
import nltk
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
words = word_data.split(" ")
porterStemmer = nltk.stem.PorterStemmer()
snowballStemmer = nltk.stem.SnowballStemmer('english')
def stem_tokens(tokens, stemmer):
stemmed = []
for token in tokens:
stemmed.append(stemmer.stem(token))
return stemmed
print(word_data)
print(stem_tokens(words,porterStemmer))
print(stem_tokens(words,snowballStemmer))
It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms
['It', 'origin', 'from', 'the', 'idea', 'that', 'there', 'are', 'reader', 'who', 'prefer', 'learn', 'new', 'skill', 'from', 'the', 'comfort', 'of', 'their', 'draw', 'room']
['it', 'origin', 'from', 'the', 'idea', 'that', 'there', 'are', 'reader', 'who', 'prefer', 'learn', 'new', 'skill', 'from', 'the', 'comfort', 'of', 'their', 'draw', 'room']