本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。之前通过(1)(2)(3)(4)共4篇文章来讲解骚扰短信数据集特征向量的提取,包括词袋和TF-IDF模型、词汇表模型以及Word2Vec和Doc2Vec模型,分别如下所示
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)_mooyuan的博客-CSDN博客
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(2)_mooyuan的博客-CSDN博客
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(3)_mooyuan的博客-CSDN博客
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(4)_mooyuan的博客-CSDN博客
这一次介绍机器学习和深度学习模型以及对应的验证结果,包括朴素贝叶斯、支持向量机、XGBoost和MLP算法。这一节与第六章的垃圾邮件、第七章的负面评论类似、只是识别的内容变为了骚扰短信,均为2分类问题。
三、构建模型
(一)NB模型
1.基于词袋、词集模型的NB算法
def do_nb_wordbag(x_train, x_test, y_train, y_test):
print ("NB and wordbag")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果如下
NB and wordbag
precision recall f1-score support
0 0.99 0.66 0.79 1918
1 0.31 0.96 0.47 312
accuracy 0.70 2230
macro avg 0.65 0.81 0.63 2230
weighted avg 0.90 0.70 0.74 2230
[[1258 660]
[ 12 300]]
2.基于word2vec模型的NB算法
def do_nb_word2vec(x_train, x_test, y_train, y_test):
print ("NB and word2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
运行结果如下
NB and word2vec
0.973542600896861
[[1884 35]
[ 24 287]]
precision recall f1-score support
0 0.99 0.98 0.98 1919
1 0.89 0.92 0.91 311
accuracy 0.97 2230
macro avg 0.94 0.95 0.95 2230
weighted avg 0.97 0.97 0.97 2230
3.基于doc2vec的NB模型
def do_nb_doc2vec(x_train, x_test, y_train, y_test):
print ("NB and doc2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果如下
NB and doc2vec
0.647085201793722
[[1372 574]
[ 213 71]]
(二)SVM模型
这部分逻辑与NB类似,只是模型使用的是SVM,处理源码如下
def do_svm_wordbag(x_train, x_test, y_train, y_test):
print ("SVM and wordbag")
clf = svm.SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_svm_word2vec(x_train, x_test, y_train, y_test):
print ("SVM and word2vec")
clf = svm.SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_svm_doc2vec(x_train, x_test, y_train, y_test):
print ("SVM and doc2vec")
clf = svm.SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果
SVM and doc2vec
0.8726457399103139
[[1946 0]
[ 284 0]]
(三)XGBoost算法
XGBoost是近几年流行起来的一种分类算法,由Tianqi Chen最初开发的实现可扩展、便携、分布式gradient boosting算法的一个库,可以下载安装并应用于C++、Python、R等语言,现在由很多协作者共同开发维护。XGBoost所应用的算法就是gradient boosting decision tree,既可以用于分类也可以用于回归问题中。XGBoost最大的特点在于,它能够自动利用CPU的多线程进行并行,同时在算法上加以改进提高了精度。它的处女秀是Kaggle的希格斯子信号识别竞赛,因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注,在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高,最近也有队伍借助XGBoost在比赛中夺得第一。这里提到的Kaggle是由联合创始人、首席执行官安东尼·高德布卢姆2010年在墨尔本创立的,主要为开发商和数据科学家提供举办机器学习竞赛、托管数据库、编写和分享代码的平台。该平台已经吸引了80万名数据科学家的关注。
def do_xgboost_wordbag(x_train, x_test, y_train, y_test):
print ("xgboost and wordbag")
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
print ("xgboost and word2vec")
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
如下所示
xgboost and word2vec
precision recall f1-score support
0 0.99 1.00 0.99 1919
1 0.98 0.92 0.95 311
accuracy 0.99 2230
macro avg 0.98 0.96 0.97 2230
weighted avg 0.99 0.99 0.99 2230
[[1912 7]
[ 26 285]]
Process finished with exit code 0
(四)随机森林
def do_rf_doc2vec(x_train, x_test, y_train, y_test):
print ("rf and doc2vec")
clf = RandomForestClassifier(n_estimators=10)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
doc2vec运行结果如下
rf and doc2vec
0.862780269058296
[[1919 27]
[ 279 5]]
(五)MLP
def do_dnn_wordbag(x_train, x_test, y_train, y_test):
print ("MLP and wordbag")
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
print (clf)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
def do_dnn_word2vec(x_train, x_test, y_train, y_test):
print ("MLP and word2vec")
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
print (clf)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_dnn_doc2vec(x_train, x_test, y_train, y_test):
print ("MLP and doc2vec")
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
print (clf)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
这里举mlp wordbag 运行结果
precision recall f1-score support
0 0.86 1.00 0.92 1918
1 0.00 0.00 0.00 312
accuracy 0.86 2230
macro avg 0.43 0.50 0.46 2230
weighted avg 0.74 0.86 0.80 2230
0.8600896860986547
另一种doc2vec算法运行结果如下
MLP and doc2vec
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(5, 2), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
0.8582959641255605
[[1906 40]
[ 276 8]]
(六)CNN
这里需要注意,仅对于wordbag模型进行了pad处理,而对于word2vec模型,并没有这样处理
1.wordbag模型
def do_cnn_wordbag(trainX, testX, trainY, testY):
global max_document_length
print ("CNN and tf")
trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="review")
2.word2vec
def do_cnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("CNN and word2vec")
y_test = testY
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_features], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print(metrics.confusion_matrix(y_test, y_predict))
3.Word2Vec 2d模型1
def do_cnn_word2vec_2d(trainX, testX, trainY, testY):
global max_features
global max_document_length
print ("CNN and word2vec2d")
y_test = testY
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length,max_features,1], name='input')
network = conv_2d(network, 32, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = conv_2d(network, 64, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = fully_connected(network, 128, activation='tanh')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='tanh')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.01,
loss='categorical_crossentropy', name='target')
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
4.word2vec_2d 模型2
def do_cnn_word2vec_2d_345(trainX, testX, trainY, testY):
global max_features
global max_document_length
print ("CNN and word2vec_2d_345")
y_test = testY
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length,max_features,1], name='input')
network = tflearn.embedding(network, input_dim=1, output_dim=128,validate_indices=False)
branch1 = conv_2d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_2d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_2d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool_2d(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
5、doc2vec
ef do_cnn_doc2vec(trainX, testX, trainY, testY):
global max_features
print ("CNN and doc2vec")
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_features], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="review")
(七)RNN
Wordbag源码如下
def do_rnn_wordbag(trainX, testX, trainY, testY):
global max_document_length
print ("RNN and wordbag")
y_test=testY
trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_document_length])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
y_predict_list = model.predict(testX)
print(y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
Word2Vec源码相对wordbag区别不大,只是没有打印出具体的报告,如下所示
def do_rnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("RNN and wordbag")
trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_features])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
这里加上增加打印测试结果的代码,整体如下所示
def do_rnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("RNN and wordbag")
trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_features])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
y_predict_list = model.predict(testX)
print(y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
在整个源码中,出现很多算法函数调用pad_sequences,大家也会发现有的算法函数未调用pad_sequences(前面被注释掉),实际上pad_sequences的调研是为了保证特征向量长度为max_features,如果不足max_features的使用0填充。实际上在get_features的处理中已经保证了这一点,大家可以实际看一下前面代码逻辑的处理,故而关于pad_sequences这一步的处理,要理解到位。
总体来讲,作者的代码注释较少,关于一些关键的代码细节并没有详解,这也是这本书的书评中提到的一个问题,基本上对于初学者来说,这本书不适合,因为没有基础完全看不懂,很多内容需要自己去查找。而对于真正需要了解细节的人,讲的又不细致,似乎只是讲了一个应用方法,想在这个方向做深入研究的人,还是要多看看论文。