[code review] lstm - sentiment analysis

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture (an artificial neural network) published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Like most RNNs, an LSTM network is universal in the sense that given enough network units it can compute anything a conventional computer can compute, provided it has the proper weight matrix, which may be viewed as its program


LSTM help us remember the history character , which will give us the context to generator next prediction

the following code :


1.  read training data

2.  segment sentence

3.  compute the frequency of each term ('term' is word after segmented)  

4.  generator id for each term

5.  generator training data : convert term to id series

6.  add an embedding layer:  Turns positive integers (indexes) into dense vectors of fixed size

7. add lstm layer, 

8. compile model :  

bi-class: use loss function : 'binary_crossentropy'
if multi-class: use loss function : categorical_crossentropy 

9. train model model.fit


problem:

before training model , must shuffle train data

after train the model :


save the dict & model : save dict use pickle : https://www.saltycrane.com/blog/2008/01/saving-python-dict-to-file-using-pickle/

output = open('w2id.dict','wb')
pickle.dump(dict,output)
model.save('lstm_santiment.model')


# split the data into a training set and a validation set
# data.shape[0] is row index
indices = np.arange(data.shape[0])
# shuffle row index
np.random.shuffle(indices)
# shuffle data according to row index
data = data[indices]
# shuffle labels according to row index
labels = labels[indices]

import pandas as pd #导入Pandas
import numpy as np #导入Numpy
import jieba #导入结巴分词
import pickle

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU

neg=pd.read_excel('comments/neg.xls',header=None,index=None)
pos=pd.read_excel('comments/pos.xls',header=None,index=None) #读取训练语料完毕
pos['mark']=1
neg['mark']=0 #给训练语料贴上标签
pn=pd.concat([pos,neg],ignore_index=True) #合并语料
neglen=len(neg)
poslen=len(pos) #计算语料数目

cw = lambda x: list(jieba.cut(x)) #定义分词函数
pn['words'] = pn[0].apply(cw)

comment = pd.read_excel('comments/1.xls') #读入评论内容
comment = comment[comment[u'短评'].notnull()] #仅读取非空评论
comment['words'] = comment[u'短评'].apply(cw) #评论分词

d2v_train = pd.concat([pn['words'], comment['words']], ignore_index = True)

w = [] #将所有词语整合在一起
for i in d2v_train:
  w.extend(i)

dict = pd.DataFrame(pd.Series(w).value_counts()) #统计词的出现次数
del w,d2v_train
dict['id']=list(range(1,len(dict)+1))

get_sent = lambda x: list(dict['id'][x])
pn['sent'] = pn['words'].apply(get_sent) #速度太慢

maxlen = 50

print("Pad sequences (samples x time)")
pn['sent'] = list(sequence.pad_sequences(pn['sent'], maxlen=maxlen))

x = np.array(list(pn['sent']))[::2] #训练集
y = np.array(list(pn['mark']))[::2]
xt = np.array(list(pn['sent']))[1::2] #测试集
yt = np.array(list(pn['mark']))[1::2]
xa = np.array(list(pn['sent'])) #全集
ya = np.array(list(pn['mark']))

print('Build model...')
model = Sequential()
model.add(Embedding(len(dict)+1, 256))
model.add(LSTM(output_dim=32, activation='sigmoid', inner_activation='hard_sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(input_dim = 32, output_dim = 1))
model.add(Activation('sigmoid'))
print ('Model bulid complete...')

model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary", metrics=['accuracy'])
print ("Model compile complete ...")
for iteration in range(1, 3):
    print('Iteration=', iteration)
    model.fit(x, y, batch_size=16, nb_epoch=1,validation_data=(xt, yt))

classes = model.predict_classes(xa)
#for c in classes:
#    print c

score = model.evaluate(xt, yt, verbose=1)
print ("Test accuracy:",score[1])
#acc = np_utils.accuracy(classes, yt)
#print('Test accuracy:', acc)
output = open('w2id.dict','wb')
pickle.dump(dict,output)
model.save('lstm_santiment.model')


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值