1、分词
NLTK内置的分词器
from nltk.tokenize import LineTokenizer,SpaceTokenizer,TweetTokenizer
from nltk import word_tokenize
LineTokenizer字符串拆分成行:
lTokenizer=LineTokenizer();
print(“output:”, lTokenizer.tokenize(“”))
SpaceTokenizer空格符分词:
rawText=”line…”
sTokenizer= SpaceTokenizer()
print(“output:”, sTokenizer.tokenize(rawText))
TweetTokenizer处理特殊字符
tTokenizer=TweetTokenizer()
print(“output:”,tTokenizer.tokenize(“”))
2、词干提取
from nltk import PorterStemmer,LancasterStemmer,word_tokenize
raw=”line…” //分词
tokens = word_t