nltk.sent_tokenize()
for r in reader:
print(r[0])
print(nltk.sent_tokenize(r[0].lower()))
print('\n')
输出:
They wont nerf it. I just hope people decide to run fun decks once TGT hits and stop being assholes.
['they wont nerf it.', 'i just hope people decide to run fun decks once tgt hits and stop being assholes.']
Seemed to start by a lot of falling over each other.
['seemed to start by a lot of falling over each other.']
That whole show was powerful. Landed a spot in my top 5
['that whole show was powerful.', 'landed a spot in my top 5']
nltk.sent_tokenize()是按符号对评论进行分隔
nltk.word_tokenize()
for r in reader:
print(r[0])
print(nltk.word_tokenize(r[0].lower()))
print('\n')
输出
Well, about that Ninth Circle...
['well', ',', 'about', 'that', 'ninth', 'circle', '...']
Goddamn you're retarded.
['goddamn', 'you', "'re", 'retarded', '.']
I'm in Tampa, you piece of shit. Come visit me.
['i', "'m", 'in', 'tampa', ',', 'you', 'piece', 'of', 'shit', '.', 'come', 'visit', 'me', '.']
按照 word分割
nltk.FreqDist()
word_freq = nltk.FreqDist(itertools.chain(*sent_words))
for w in word_freq:
print(w, word_freq[w])
输出:
degrasse 1
hanks 1
marajuana 1
anti-vaxxers 1
felicidades 1
loader 1
输出列表中重复项的次数