一 提取词干
在英文中同一个词的形式是有多种的,名词的单数复数、动词的现在和过去式等等,所以在处理英文时要考虑词干的抽取问题。这里直接调用Nltk自带的两个词干抽取器
import re
import nltk
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw) # 分词 如果该方法调用错误请运行 nltk.download('punkt')
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in tokens])
lancaster = nltk.LancasterStemmer()
print([lancaster.stem(t) for t in tokens])
结果如下
porter:['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', &