中文酒店语料分类:基于TensorFlow的LSTM

中文酒店语料分类:基于TensorFlow的LSTM

中文酒店评论文本分类

数据集来源于谭松柏老师的酒店评论数据集

-1    标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.
 1    距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较为简单.

工具准备

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import numpy as np
import paddlehub as hub
import re

lac=hub.Module(name='lac')

paddlehub包只是为了更好的分词,也可使用jieba、jiagu等分词工具
paddlepaddle-gpu 安装命令:

python -m pip install paddlepaddle-gpu==2.0.0rc0 -i https://mirror.baidu.com/pypi/simple 

paddlehub 安装命令:

pip install paddlehub --upgrade -i https://mirror.baidu.com/pypi/simple

去除符号和停用词函数

 # 停用词可自行从网上下载
def remove_punctuation(line):
   line = str(line)
   if line.strip()=='':
       return ''
   rule = re.compile(u"[^a-zA-Z0-9\u4E00-\u9FA5]")
   line = rule.sub('',line)
   return line
def stopwordslist(filepath):
   stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
   return stopwords
stopwords = stopwordslist('ml-python-master/chineseStopWords.txt')

预处理

读入数据

neg_original_data = open('谭松波--酒店评论语料/utf-8/汇总/neg.txt','r').read().strip().split('\n')
pos_original_data = open('谭松波--酒店评论语料/utf-8/汇总/pos.txt','r').read().strip().split('\n')

all_original_label = []
all_original_data = []
for line in neg_original_data+pos_original_data:
    line = line.strip().split()
    try:
        all_original_data.append(' '.join(line[1:]))
        all_original_label.append(int(line[0]))
    except:
        continue
 
len(all_original_data)  # 10000

去除符号和停用词

clean_x = [' '.join([w for w in lac.cut(remove_punctuation(line)) if w not in stopwords]) for line in all_original_data]

词嵌入

方法1.使用TensorFlow工具
# 利用Tokenizer工具,num_words设置为None
# 转化text为sequence,并统一文本长度为100,也可设置为150
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(clean_x)  # 30178 对搭建模型有用

x_train_sequences = tokenizer.texts_to_sequences(clean_x)
all_train_text = pad_sequences(x_train_sequences,maxlen=100,padding='post')

方法2.使用预训练的中文词向量

来源:ACL2018论文《Analogical Reasoning on Chinese Morphological and Semantic Relations》
Chinese Word Vectors

from gensim.models import KeyedVectors

cn_model = KeyedVectors.load_word2vec_format('sgns.sogou.word', binary=False)

all_train_text = np.zeros((10000,100,300))
for i in range(len(clean_x)):
    line = clean_x[i].strip().split()
    j=0
    for word in line:
        if j==100: break
        if word in cn_model:
            all_train_text[i][j] = cn_model[word]
            j += 1
        else:
            all_train_text[i][j] = np.array([np.random.uniform(-0.1,0.1) for k in range(300)])
            j += 1

训练测试集划分

# 划分比例0.2
x_train, x_test, y_train, y_test = train_test_split(all_train_text,y_binary,test_size=0.2, random_state=0)

模型搭建及结果

方法1.使用TensorFlow嵌入

模型搭建
model=keras.Sequential()

model.add(keras.layers.Embedding(31000,100))
model.add(keras.layers.LSTM(60,return_sequences=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(30,return_sequences=True)))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.LSTM(30,return_sequences=False))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(2,activation='softmax'))

model.compile(optimizer=keras.optimizers.Adam(),
             loss=keras.losses.mae,
             metrics=['acc'])
model.summary()
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 100)         3100000   
_________________________________________________________________
lstm_9 (LSTM)                (None, None, 60)          38640     
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 60)          21840     
_________________________________________________________________
dropout_6 (Dropout)          (None, None, 60)          0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 30)                10920     
_________________________________________________________________
dropout_7 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 62        
=================================================================
Total params: 3,171,462
Trainable params: 3,171,462
Non-trainable params: 0

训练及测试结果

# 迭代3次 已拟合
model.fit(x_train,y_train,epochs=3,batch_size=64,validation_data=[x_test,y_test])

Train on 8000 samples, validate on 2000 samples
Epoch 1/3
8000/8000 [==============================] - 41s 5ms/step - loss: 0.0406 - acc: 0.9939 - val_loss: 1.6199e-05 - val_acc: 1.0000
Epoch 2/3
8000/8000 [==============================] - 39s 5ms/step - loss: 2.1606e-04 - acc: 1.0000 - val_loss: 8.0200e-06 - val_acc: 1.0000
Epoch 3/3
8000/8000 [==============================] - 39s 5ms/step - loss: 1.5019e-04 - acc: 1.0000 - val_loss: 5.6040e-06 - val_acc: 1.0000

方法2.使用Chinese Word Vector预训练词向量

模型搭建

model=keras.Sequential()

model.add(keras.layers.LSTM(60,input_shape=(100,300),return_sequences=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(30,return_sequences=True)))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.LSTM(30,return_sequences=False))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(2,activation='softmax'))

model.compile(optimizer=keras.optimizers.Adam(),
             loss=keras.losses.mae,
             metrics=['acc'])
             
model.summary()
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_10 (LSTM)               (None, 100, 60)           86640     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 100, 60)           21840     
_________________________________________________________________
dropout_6 (Dropout)          (None, 100, 60)           0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 30)                10920     
_________________________________________________________________
dropout_7 (Dropout)          (None, 30)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 62        
=================================================================
Total params: 119,462
Trainable params: 119,462
Non-trainable params: 0

训练及测试结果

# 迭代3次 已拟合
Train on 8000 samples, validate on 2000 samples
Epoch 1/3
8000/8000 [==============================] - 45s 6ms/step - loss: 0.0464 - acc: 0.9938 - val_loss: 1.9094e-05 - val_acc: 1.0000
Epoch 2/3
8000/8000 [==============================] - 43s 5ms/step - loss: 1.7878e-04 - acc: 1.0000 - val_loss: 1.0102e-05 - val_acc: 1.0000
Epoch 3/3
8000/8000 [==============================] - 43s 5ms/step - loss: 1.3861e-04 - acc: 1.0000 - val_loss: 6.4585e-06 - val_acc: 1.0000
  • 0
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值