《Web安全之深度学习实战》笔记：第八章骚扰短信识别（5）

本文链接：https://blog.csdn.net/mooyuan/article/details/123411532

本章主要以SMS Spam Collection数据集为例介绍骚扰短信的识别技术。之前通过（1）（2）（3）（4）共4篇文章来讲解骚扰短信数据集特征向量的提取，包括词袋和TF-IDF模型、词汇表模型以及Word2Vec和Doc2Vec模型，分别如下所示

《Web安全之深度学习实战》笔记：第八章骚扰短信识别（1）_mooyuan的博客-CSDN博客

《Web安全之深度学习实战》笔记：第八章骚扰短信识别（2）_mooyuan的博客-CSDN博客

《Web安全之深度学习实战》笔记：第八章骚扰短信识别（3）_mooyuan的博客-CSDN博客

《Web安全之深度学习实战》笔记：第八章骚扰短信识别（4）_mooyuan的博客-CSDN博客

这一次介绍机器学习和深度学习模型以及对应的验证结果，包括朴素贝叶斯、支持向量机、XGBoost和MLP算法。这一节与第六章的垃圾邮件、第七章的负面评论类似、只是识别的内容变为了骚扰短信，均为2分类问题。

三、构建模型

（一）NB模型

1.基于词袋、词集模型的NB算法

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    print ("NB and wordbag")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

NB and wordbag
              precision    recall  f1-score   support

           0       0.99      0.66      0.79      1918
           1       0.31      0.96      0.47       312

    accuracy                           0.70      2230
   macro avg       0.65      0.81      0.63      2230
weighted avg       0.90      0.70      0.74      2230

[[1258  660]
 [  12  300]]

2.基于word2vec模型的NB算法

def do_nb_word2vec(x_train, x_test, y_train, y_test):
    print ("NB and word2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

运行结果如下

NB and word2vec
0.973542600896861
[[1884   35]
 [  24  287]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1919
           1       0.89      0.92      0.91       311

    accuracy                           0.97      2230
   macro avg       0.94      0.95      0.95      2230
weighted avg       0.97      0.97      0.97      2230

3.基于doc2vec的NB模型

def do_nb_doc2vec(x_train, x_test, y_train, y_test):
    print ("NB and doc2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

NB and doc2vec
0.647085201793722
[[1372  574]
 [ 213   71]]

（二）SVM模型

这部分逻辑与NB类似，只是模型使用的是SVM，处理源码如下

def do_svm_wordbag(x_train, x_test, y_train, y_test):
    print ("SVM and wordbag")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

def do_svm_word2vec(x_train, x_test, y_train, y_test):
    print ("SVM and word2vec")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

def do_svm_doc2vec(x_train, x_test, y_train, y_test):
    print ("SVM and doc2vec")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果

SVM and doc2vec

0.8726457399103139
[[1946    0]
 [ 284    0]]

（三）XGBoost算法

XGBoost是近几年流行起来的一种分类算法，由Tianqi Chen最初开发的实现可扩展、便携、分布式gradient boosting算法的一个库，可以下载安装并应用于C++、Python、R等语言，现在由很多协作者共同开发维护。XGBoost所应用的算法就是gradient boosting decision tree，既可以用于分类也可以用于回归问题中。XGBoost最大的特点在于，它能够自动利用CPU的多线程进行并行，同时在算法上加以改进提高了精度。它的处女秀是Kaggle的希格斯子信号识别竞赛，因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注，在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高，最近也有队伍借助XGBoost在比赛中夺得第一。这里提到的Kaggle是由联合创始人、首席执行官安东尼·高德布卢姆2010年在墨尔本创立的，主要为开发商和数据科学家提供举办机器学习竞赛、托管数据库、编写和分享代码的平台。该平台已经吸引了80万名数据科学家的关注。

def do_xgboost_wordbag(x_train, x_test, y_train, y_test):
    print ("xgboost and wordbag")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
    print ("xgboost and word2vec")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

如下所示

xgboost and word2vec

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1919
           1       0.98      0.92      0.95       311

    accuracy                           0.99      2230
   macro avg       0.98      0.96      0.97      2230
weighted avg       0.99      0.99      0.99      2230

[[1912    7]
 [  26  285]]

Process finished with exit code 0

（四）随机森林

def do_rf_doc2vec(x_train, x_test, y_train, y_test):
    print ("rf and doc2vec")
    clf = RandomForestClassifier(n_estimators=10)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

doc2vec运行结果如下

rf and doc2vec

0.862780269058296
[[1919   27]
 [ 279    5]]

（五）MLP

def do_dnn_wordbag(x_train, x_test, y_train, y_test):
    print ("MLP and wordbag")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print (clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))
    
def do_dnn_word2vec(x_train, x_test, y_train, y_test):
    print ("MLP and word2vec")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print  (clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

def do_dnn_doc2vec(x_train, x_test, y_train, y_test):
    print ("MLP and doc2vec")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print  (clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

这里举mlp wordbag 运行结果

              precision    recall  f1-score   support

           0       0.86      1.00      0.92      1918
           1       0.00      0.00      0.00       312

    accuracy                           0.86      2230
   macro avg       0.43      0.50      0.46      2230
weighted avg       0.74      0.86      0.80      2230

0.8600896860986547

另一种doc2vec算法运行结果如下

MLP and doc2vec
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(5, 2), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
0.8582959641255605
[[1906   40]
 [ 276    8]]

（六）CNN

这里需要注意，仅对于wordbag模型进行了pad处理，而对于word2vec模型，并没有这样处理

1.wordbag模型

def do_cnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print ("CNN and tf")

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="review")

2.word2vec

def do_cnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("CNN and word2vec")
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print(metrics.confusion_matrix(y_test, y_predict))

3.Word2Vec 2d模型1

def do_cnn_word2vec_2d(trainX, testX, trainY, testY):
    global max_features
    global max_document_length
    print ("CNN and word2vec2d")
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length,max_features,1], name='input')

    network = conv_2d(network, 32, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = conv_2d(network, 64, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = fully_connected(network, 128, activation='tanh')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='tanh')
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.01,
                         loss='categorical_crossentropy', name='target')

    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

4.word2vec_2d 模型2

def do_cnn_word2vec_2d_345(trainX, testX, trainY, testY):
    global max_features
    global max_document_length
    print ("CNN and word2vec_2d_345")
    y_test = testY

    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length,max_features,1], name='input')
    network = tflearn.embedding(network, input_dim=1, output_dim=128,validate_indices=False)
    branch1 = conv_2d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_2d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_2d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool_2d(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

5、doc2vec

ef do_cnn_doc2vec(trainX, testX, trainY, testY):
    global max_features
    print ("CNN and doc2vec")

    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="review")

（七）RNN

Wordbag源码如下

def do_rnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print ("RNN and wordbag")
    y_test=testY
    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_document_length])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)

    y_predict_list = model.predict(testX)
    print(y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

Word2Vec源码相对wordbag区别不大，只是没有打印出具体的报告，如下所示

def do_rnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("RNN and wordbag")

    trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_features])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)

这里加上增加打印测试结果的代码，整体如下所示

def do_rnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("RNN and wordbag")

    trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_features])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)
    y_predict_list = model.predict(testX)
    print(y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

在整个源码中，出现很多算法函数调用pad_sequences，大家也会发现有的算法函数未调用pad_sequences（前面被注释掉），实际上pad_sequences的调研是为了保证特征向量长度为max_features，如果不足max_features的使用0填充。实际上在get_features的处理中已经保证了这一点，大家可以实际看一下前面代码逻辑的处理，故而关于pad_sequences这一步的处理，要理解到位。

总体来讲，作者的代码注释较少，关于一些关键的代码细节并没有详解，这也是这本书的书评中提到的一个问题，基本上对于初学者来说，这本书不适合，因为没有基础完全看不懂，很多内容需要自己去查找。而对于真正需要了解细节的人，讲的又不细致，似乎只是讲了一个应用方法，想在这个方向做深入研究的人，还是要多看看论文。