安装语料库
import nltk
nltk.download()
分词
- 英文分词:nltk.word_tokenize() # 按照单词进行分词
- 中文分词:jieba.cut()
词性处理
- stemming词干提取: 保留最长词根
nltk库中有多种函数实现:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem(‘maximum’)
# output: u’maximum’
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem(‘maximum’)
#output: ‘maxim’
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer(“english”)
snowball_stemmer.stem(‘maximum’)
u’maximum’
- lemmatization词形归一:将词的各种变形都归为一个形式(wordnet)
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(‘dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(‘churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
u’aardwolf’
>>> wordnet_lemmatizer.lemmatize(‘abaci’)
u’abacus’
>>> wordnet_lemmatizer.lemmatize(‘hardrock’)
‘hardrock’
去除stopwords
from nltk.corpus import stopwords
# 先token⼀把,得到⼀个word_list
# ...
# 然后filter⼀把
filtered_words =
[word for word in word_list if word not in stopwords.words('english')]