中文酒店语料分类:基于TensorFlow的LSTM
中文酒店评论文本分类
数据集来源于谭松柏老师的酒店评论数据集
-1 标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.
1 距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较为简单.
工具准备
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import numpy as np
import paddlehub as hub
import re
lac=hub.Module(name='lac')
paddlehub包只是为了更好的分词,也可使用jieba、jiagu等分词工具
paddlepaddle-gpu 安装命令:
python -m pip install paddlepaddle-gpu==2.0.0rc0 -i https://mirror.baidu.com/pypi/simple
paddlehub 安装命令:
pip install paddlehub --upgrade -i https://mirror.baidu.com/pypi/simple
去除符号和停用词函数
# 停用词可自行从网上下载
def remove_punctuation(line):
line = str(line)
if line.strip()=='':
return ''
rule = re.compile(u"[^a-zA-Z0-9\u4E00-\u9FA5]")
line = rule.sub('',line)
return line
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords
stopwords = stopwordslist('ml-python-master/chineseStopWords.txt')
预处理
读入数据
neg_original_data = open('谭松波--酒店评论语料/utf-8/汇总/neg.txt','r').read().strip().split('\n')
pos_original_data = open('谭松波--酒店评论语料/utf-8/汇总/pos.txt','r').read().strip().split('\n')
all_original_label = []
all_original_data = []
for line in neg_original_data+pos_original_data:
line = line.strip().split()
try:
all_original_data.append(' '.join(line[1:]))
all_original_label.append(int(line[0]))
except:
continue
len(all_original_data) # 10000
去除符号和停用词
clean_x = [' '.join([w for w in lac.cut(remove_punctuation(line)) if w not in stopwords]) for line in all_original_data]
词嵌入
方法1.使用TensorFlow工具
# 利用Tokenizer工具,num_words设置为None
# 转化text为sequence,并统一文本长度为100,也可设置为150
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(clean_x) # 30178 对搭建模型有用
x_train_sequences = tokenizer.texts_to_sequences(clean_x)
all_train_text = pad_sequences(x_train_sequences,maxlen=100,padding='post')
方法2.使用预训练的中文词向量
来源:ACL2018论文《Analogical Reasoning on Chinese Morphological and Semantic Relations》
Chinese Word Vectors
from gensim.models import KeyedVectors
cn_model = KeyedVectors.load_word2vec_format('sgns.sogou.word', binary=False)
all_train_text = np.zeros((10000,100,300))
for i in range(len(clean_x)):
line = clean_x[i].strip().split()
j=0
for word in line:
if j==100: break
if word in cn_model:
all_train_text[i][j] = cn_model[word]
j += 1
else:
all_train_text[i][j] = np.array([np.random.uniform(-0.1,0.1) for k in range(300)])
j += 1
训练测试集划分
# 划分比例0.2
x_train, x_test, y_train, y_test = train_test_split(all_train_text,y_binary,test_size=0.2, random_state=0)
模型搭建及结果
方法1.使用TensorFlow嵌入
模型搭建
model=keras.Sequential()
model.add(keras.layers.Embedding(31000,100))
model.add(keras.layers.LSTM(60,return_sequences=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(30,return_sequences=True)))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.LSTM(30,return_sequences=False))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(2,activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.mae,
metrics=['acc'])
model.summary()
Layer (type) Output Shape Param #
=================================================================
embedding_3 (Embedding) (None, None, 100) 3100000
_________________________________________________________________
lstm_9 (LSTM) (None, None, 60) 38640
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 60) 21840
_________________________________________________________________
dropout_6 (Dropout) (None, None, 60) 0
_________________________________________________________________
lstm_11 (LSTM) (None, 30) 10920
_________________________________________________________________
dropout_7 (Dropout) (None, 30) 0
_________________________________________________________________
dense_3 (Dense) (None, 2) 62
=================================================================
Total params: 3,171,462
Trainable params: 3,171,462
Non-trainable params: 0
训练及测试结果
# 迭代3次 已拟合
model.fit(x_train,y_train,epochs=3,batch_size=64,validation_data=[x_test,y_test])
Train on 8000 samples, validate on 2000 samples
Epoch 1/3
8000/8000 [==============================] - 41s 5ms/step - loss: 0.0406 - acc: 0.9939 - val_loss: 1.6199e-05 - val_acc: 1.0000
Epoch 2/3
8000/8000 [==============================] - 39s 5ms/step - loss: 2.1606e-04 - acc: 1.0000 - val_loss: 8.0200e-06 - val_acc: 1.0000
Epoch 3/3
8000/8000 [==============================] - 39s 5ms/step - loss: 1.5019e-04 - acc: 1.0000 - val_loss: 5.6040e-06 - val_acc: 1.0000
方法2.使用Chinese Word Vector预训练词向量
模型搭建
model=keras.Sequential()
model.add(keras.layers.LSTM(60,input_shape=(100,300),return_sequences=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(30,return_sequences=True)))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.LSTM(30,return_sequences=False))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(2,activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.mae,
metrics=['acc'])
model.summary()
Layer (type) Output Shape Param #
=================================================================
lstm_10 (LSTM) (None, 100, 60) 86640
_________________________________________________________________
bidirectional_3 (Bidirection (None, 100, 60) 21840
_________________________________________________________________
dropout_6 (Dropout) (None, 100, 60) 0
_________________________________________________________________
lstm_12 (LSTM) (None, 30) 10920
_________________________________________________________________
dropout_7 (Dropout) (None, 30) 0
_________________________________________________________________
dense_3 (Dense) (None, 2) 62
=================================================================
Total params: 119,462
Trainable params: 119,462
Non-trainable params: 0
训练及测试结果
# 迭代3次 已拟合
Train on 8000 samples, validate on 2000 samples
Epoch 1/3
8000/8000 [==============================] - 45s 6ms/step - loss: 0.0464 - acc: 0.9938 - val_loss: 1.9094e-05 - val_acc: 1.0000
Epoch 2/3
8000/8000 [==============================] - 43s 5ms/step - loss: 1.7878e-04 - acc: 1.0000 - val_loss: 1.0102e-05 - val_acc: 1.0000
Epoch 3/3
8000/8000 [==============================] - 43s 5ms/step - loss: 1.3861e-04 - acc: 1.0000 - val_loss: 6.4585e-06 - val_acc: 1.0000