NLP(VII):使用sklearn进行文本情感分类(下)
这一节我们使用gensim来进行单词的向量化。
使用spacy进行tokenize
import spacy
all_texts = np.array(twitter_train_df['text']).tolist() + np.array(twitter_test_df['text']).tolist()
all_tokenized_texts = []
token_freq_dict = {}
nlp = spacy.load("en_core_web_sm")
for twitt in all_texts:
doc = nlp(twitt)
token_twitt = []
for token in doc:
token = token.text.lower()
token_twitt.append(token)
if token in token_freq_dict:
token_freq_dict[token] += 1
else:
token_freq_dict[token] = 1
all_tokenized_texts.append(token_twitt)
使用gensim将token向量化
gensim包的用法可以参考官方网站:
https://radimrehurek.com/gensim/models/word2vec.html
from gensim.models import Word2Vec
model = Word2Vec(all_tokenized_texts, size=300)
每一条推文的向量表示可以通过其所有token的向量取平均来计算:
all_vec_tweets = []
for tweet in all_tokenized_texts:
tw_vecs = []
for token in tweet:
if token_freq_dict[token]>=5:
tw_vecs.append(model.wv[token].tolist())
if len(tw_vecs)==0:
all_vec_tweets.append(np.zeros(300).tolist())
else:
all_vec_tweets.append(np.mean(np.array(tw_vecs), 0).tolist())
使用sklearn训练模型
这里就和上一节一样了。
from sklearn.linear_model import LogisticRegression
train_X = np.array(all_vec_tweets[:len(twitter_train_df)])
train_y = twitter_train_df['sentiment']
test_X = all_vec_tweets[len(twitter_train_df):]
test_y = twitter_test_df['sentiment']
clf = LogisticRegression(random_state=0).fit(train_X, train_y)
print("The accuracy of the trained classifier is "+str(clf.score(test_X, test_y)*100)+"%")