[code review] lstm - sentiment analysis

最新推荐文章于 2023-02-22 11:03:25 发布

xiewenbo

最新推荐文章于 2023-02-22 11:03:25 发布

阅读量975

点赞数

分类专栏： keras deep learning

本文链接：https://blog.csdn.net/xiewenbo/article/details/74560619

版权

deep learning 同时被 2 个专栏收录

31 篇文章 0 订阅

订阅专栏

keras

24 篇文章 0 订阅

订阅专栏

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture (an artificial neural network) published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Like most RNNs, an LSTM network is universal in the sense that given enough network units it can compute anything a conventional computer can compute, provided it has the proper weight matrix, which may be viewed as its program

LSTM help us remember the history character , which will give us the context to generator next prediction

the following code :

1. read training data

2. segment sentence

3. compute the frequency of each term ('term' is word after segmented)

4. generator id for each term

5. generator training data : convert term to id series

6. add an embedding layer: Turns positive integers (indexes) into dense vectors of fixed size

7. add lstm layer,

8. compile model :

bi-class: use loss function : 'binary_crossentropy'
if multi-class: use loss function : categorical_crossentropy

9. train model model.fit

problem:

before training model , must shuffle train data

after train the model :

save the dict & model : save dict use pickle : https://www.saltycrane.com/blog/2008/01/saving-python-dict-to-file-using-pickle/

output = open('w2id.dict','wb')
pickle.dump(dict,output)
model.save('lstm_santiment.model')

# split the data into a training set and a validation set
# data.shape[0] is row index
indices = np.arange(data.shape[0])
# shuffle row index
np.random.shuffle(indices)
# shuffle data according to row index
data = data[indices]
# shuffle labels according to row index
labels = labels[indices]

import pandas as pd #导入Pandas
import numpy as np #导入Numpy
import jieba #导入结巴分词
import pickle

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU

neg=pd.read_excel('comments/neg.xls',header=None,index=None)
pos=pd.read_excel('comments/pos.xls',header=None,index=None) #读取训练语料完毕
pos['mark']=1
neg['mark']=0 #给训练语料贴上标签
pn=pd.concat([pos,neg],ignore_index=True) #合并语料
neglen=len(neg)
poslen=len(pos) #计算语料数目

cw = lambda x: list(jieba.cut(x)) #定义分词函数
pn['words'] = pn[0].apply(cw)

comment = pd.read_excel('comments/1.xls') #读入评论内容
comment = comment[comment[u'短评'].notnull()] #仅读取非空评论
comment['words'] = comment[u'短评'].apply(cw) #评论分词

d2v_train = pd.concat([pn['words'], comment['words']], ignore_index = True)

w = [] #将所有词语整合在一起
for i in d2v_train:
  w.extend(i)

dict = pd.DataFrame(pd.Series(w).value_counts()) #统计词的出现次数
del w,d2v_train
dict['id']=list(range(1,len(dict)+1))

get_sent = lambda x: list(dict['id'][x])
pn['sent'] = pn['words'].apply(get_sent) #速度太慢

maxlen = 50

print("Pad sequences (samples x time)")
pn['sent'] = list(sequence.pad_sequences(pn['sent'], maxlen=maxlen))

x = np.array(list(pn['sent']))[::2] #训练集
y = np.array(list(pn['mark']))[::2]
xt = np.array(list(pn['sent']))[1::2] #测试集
yt = np.array(list(pn['mark']))[1::2]
xa = np.array(list(pn['sent'])) #全集
ya = np.array(list(pn['mark']))

print('Build model...')
model = Sequential()
model.add(Embedding(len(dict)+1, 256))
model.add(LSTM(output_dim=32, activation='sigmoid', inner_activation='hard_sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(input_dim = 32, output_dim = 1))
model.add(Activation('sigmoid'))
print ('Model bulid complete...')

model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary", metrics=['accuracy'])
print ("Model compile complete ...")
for iteration in range(1, 3):
    print('Iteration=', iteration)
    model.fit(x, y, batch_size=16, nb_epoch=1,validation_data=(xt, yt))

classes = model.predict_classes(xa)
#for c in classes:
#    print c

score = model.evaluate(xt, yt, verbose=1)
print ("Test accuracy:",score[1])
#acc = np_utils.accuracy(classes, yt)
#print('Test accuracy:', acc)
output = open('w2id.dict','wb')
pickle.dump(dict,output)
model.save('lstm_santiment.model')

xiewenbo

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
[code review] lstm - sentiment analysis

Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture (an artificial neural network) published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Like most RNNs, an LSTM netwo
复制链接

扫一扫

专栏目录