基于单行的文本分类
文本分类是自然语言处理中的常见的任务,随着深度学习在自然语言处理任务中的广泛应用,在传统的分类方法的基础上,发展了基于CNN, RNN的分类方法,在这里做了一下记录:
基于RNN的文本分类
input0 = keras.layers.Input(shape=[self.MAX_SEQUENCE_LENGTH])
embeding = keras.layers.Embedding(input_dim=self.vocabulary_size,
output_dim=self.embedding_dim)(input0)
#单向的lstm
# lstm = keras.layers.LSTM(units=self.lstm_unit)(embeding)
#双向的lstm
# bilstm = keras.layers.Bidirectional(keras.layers.LSTM(units=self.lstm_unit))(embeding)
#bi + attention
l_lstm = keras.layers.Bidirectional(keras.layers.LSTM(units=self.lstm_unit,
return_sequences=True))(embeding)
l_dense = keras.layers.TimeDistributed(keras.layers.Dense(200))(l_lstm) # 对句子中的每个词
l_att = AttentionLayer()(l_dense)
print("bilstm")
print(l_att)
out1 = keras.layers.Dense(units=128, activation="tanh")(l_att)
output = keras.layers.Dense(units=self.num_tags, activation='softmax')(out1)
model = keras.models.Model(input0, output)
model.compile(loss='categorical_crossentropy',
optimizer='Adadelta',
metrics=['accuracy'])
print(model.summary())
基于CNN的文本分类
comment_seq = Input(shape=[MAX_SEQUENCE_LENGTH], name='x_seq')
embedding_layer = Embedding(nb_words + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
)(comment_seq)
model2_in = Conv1D(filters=128, kernel_size=2, activation='tanh', padding="same")(embedding_layer)
model3_in = MaxPooling1D(pool_size=2, padding="same")(model2_in)
model4_in = Conv1D(128, 2, activation='tanh', padding="same")(model3_in)
model5_in = MaxPooling1D(2, padding="same")(model4_in)
model6_in = Conv1D(128, 2, activation='tanh', padding="same")(model5_in)
model8_in = MaxPooling1D(2, padding="same")(model6_in)
model9_in = Flatten()(model8_in)
model2_in1 = Conv1D(128, 4, activation='tanh', padding="same")(embedding_layer)
model3_in1 = MaxPooling1D(4, padding="same")(model2_in1)
model4_in1 = Conv1D(128, 4, activation='tanh', padding="same")(model3_in1)
model5_in1 = MaxPooling1D(4, padding="same")(model4_in1)
model6_in1 = Conv1D(128, 4, activation='tanh', padding="same")(model5_in1)
model8_in1 = MaxPooling1D(2, padding="same")(model6_in1)
model9_in1 = Flatten()(model8_in1)
model2_in2 = Conv1D(128, 6, activation='tanh', padding="same")(embedding_layer)
model3_in2 = MaxPooling1D(6, padding="same")(model2_in2)
model4_in2 = Conv1D(128, 6, activation='tanh', padding="same")(model3_in2)
model5_in2 = MaxPooling1D(6, padding="same")(model4_in2)
model6_in2 = Conv1D(128, 6, activation='tanh', padding="same")(model5_in2)
model8_in2 = MaxPooling1D(2, padding="same")(model6_in2)
model9_in2 = Flatten()(model8_in2)
merged = concatenate([model9_in, model9_in1,model9_in2], axis=-1) #merge
out = Dropout(0.5)(merged)
out1 = Dense(units=128, activation="tanh")(out)
output = Dense(len(labels_index), activation='softmax')(out1)
model = Model([comment_seq], output)
RNN + CNN 的文本分类
这里的CNN做的比较简单,只有一层卷积和池化
input0 = keras.layers.Input(shape=[self.MAX_SEQUENCE_LENGTH])
embeding = keras.layers.Embedding(input_dim=self.vocabulary_size,
output_dim=self.embedding_dim)(input0)
l_lstm = keras.layers.Bidirectional(keras.layers.LSTM(units=self.lstm_unit,
return_sequences=True))(embeding)
l_dense = keras.layers.TimeDistributed(keras.layers.Dense(200))(l_lstm)
# conv = keras.layers.Conv1D(filters=128, kernel_size=5, activation='tanh')(l_dense)
pool = keras.layers.MaxPooling1D(pool_size=5)(l_dense)
flat = keras.layers.Flatten()(pool)
out1 = keras.layers.Dense(units=128, activation="tanh")(flat)
output = keras.layers.Dense(units=self.num_tags, activation='softmax')(out1)
model = keras.models.Model(input0, output)
模型的优化函数以及训练和评价
model.compile(loss='categorical_crossentropy',
optimizer='Adadelta',
metrics=['accuracy'])
print(model.summary())
history = model.fit(x_train, y_train, nb_epoch=20,batch_size=16, validation_data=(x_val, y_val))
model.save('model.h5')
score = model.evaluate(x_train, y_train, verbose=0)
print('train score:', score[0])
print('train accuracy:', score[1])
score = model.evaluate(x_val, y_val, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
实验结果:
仅仅对人本人的训练数据上的一些表现,不能说明其普适性。
从实践的结果来看,目前这几种模型的结果相差并不大。
对比xgboost的实验结果来看,略微比xgboost差一些。
一些疑问和感悟:
- 特征表现
深度学习是一个端到端的训练方式,会自动提取特征,之前看过深度学习在图像领域应用时,特征可视化后,基本都是图像的边缘的各个线条,对图像的构成具有良好的解释性。但是在NLP领域,其特征感觉上会是词,因为词是构成句子的基本元素。词的次序信息和线条的空间信息是有一种内在的对应关系,但是如果脱离了位置信息,仅仅依靠权重信息在感觉上只能刻画内容的外在特征并不能刻画内容的内部细节特征。 - 数据表现
分类问题有其自己的属性,比如说 “你真牛逼” 有30%的数据标成了反讽,有70%的数据标成了夸赞,那么针对这样的分类分类器的结果并不会取得更好的结果。还需要其他信息刻画其条件,但是条件面临同样的问题。 - 持续关注
对于深度学习是如何表示信息,为什么能够表示信息还在持续关注中。