数据源:36万条微博文本,已标注情感。源数据中label0:开心,label1-3:低落或忧伤。本文只考虑情感正负极性,所以1-3都划为负样本。
项目思路:分词后利用gensim.models.word2vec训练词向量,词向量表示训练集文本,sklearn训练随机森林模型,auc=0.86。
加载相关python包:
import jieba
import re
import pandas as pd
from gensim.models import word2vec
import numpy as np
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier as RF
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
from sklearn.cross_validation import train_test_split
word2vec训练词向量
利用这36万微博数据训练词向量,word2vec需要语料分词。
data = pd.read_csv('F:/weibo_4_moods.csv',delimiter=',',header=0,encoding='utf-8')
file_train = 'F:/word_train.txt'
def get_word_train(filename):
with open(filename,'w',encoding='utf-8') as f:
for line in data['review']:
word_l = ' '.join(jieba.cut(line,cut_all=False))
word_l.replace(u',',u''</