《Web安全之深度学习实战》笔记：第八章骚扰短信识别-CSDN博客

本章主要以SMS Spam Collection数据集为例介绍骚扰短信的识别技术。介绍识别骚扰短信使用的征提取方法，包括词袋和TF-IDF模型、词汇表模型以及Word2Vec和Doc2Vec模型，介绍使用的模型以及对应的验证结果，包括朴素贝叶斯、支持向量机、XGBoost和MLP算法。这一节与第六章的垃圾邮件、第七章的负面评论类似、只是识别的内容变为了骚扰短信，均为2分类问题。

一、数据集

测试数据来自SMS Spam Collection数据集，SMS Spam Collection是用于骚扰短信识别的经典数据集，完全来自真实短信内容，包括4831条正常短信和747条骚扰短信。从官网下载数据集压缩包并解压，正常短信和骚扰短信保存在一个文本文件中。

逐行读取数据文件SMSSpamCollection.txt，由于每行数据都由标记和短信内容组成，两者之间使用制表符分割，所以可以通过split函数进行切分，直接获取标记和短信内容：

def load_all_files():
    x=[]
    y=[]
    datafile="../data/sms/smsspamcollection/SMSSpamCollection.txt"
    with open(datafile, encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            label,text=line.split('\t')
            x.append(text)
            if label == 'ham':
                y.append(0)
            else:
                y.append(1)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

调用源码

x_train, x_test, y_train, y_test=load_all_files()

二、特征提取

(一) 词集（词汇表）模型

def  get_features_by_tf():
    global  max_document_length
    x_train, x_test, y_train, y_test=load_all_files()

    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x_train=vp.fit_transform(x_train, unused_y=None)
    x_train=np.array(list(x_train))

    x_test=vp.transform(x_test)
    x_test=np.array(list(x_test))

    return x_train, x_test, y_train, y_test

(二) 词袋模型

def get_features_by_wordbag():
    global max_features
    x_train, x_test, y_train, y_test=load_all_files()

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print (vectorizer)
    x_train=vectorizer.fit_transform(x_train)
    x_train=x_train.toarray()
    vocabulary=vectorizer.vocabulary_

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 vocabulary=vocabulary,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print (vectorizer)
    x_test=vectorizer.fit_transform(x_test)
    x_test=x_test.toarray()

    return x_train, x_test, y_train, y_test

(三) TF-IDF模型

def get_features_by_wordbag_tfidf():
    global max_features
    x_train, x_test, y_train, y_test=load_all_files()

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1,
                                 binary=True)
    print (vectorizer)
    x_train=vectorizer.fit_transform(x_train)
    x_train=x_train.toarray()
    vocabulary=vectorizer.vocabulary_

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 vocabulary=vocabulary,
                                 stop_words='english',
                                 max_df=1.0,binary=True,
                                 min_df=1 )
    print (vectorizer)
    x_test=vectorizer.fit_transform(x_test)
    x_test=x_test.toarray()

    transformer = TfidfTransformer(smooth_idf=False)
    x_train=transformer.fit_transform(x_train)
    x_train=x_train.toarray()
    x_test=transformer.transform(x_test)
    x_test=x_test.toarray()

    return x_train, x_test, y_train, y_test

这里以train[0]数据为例，讲解其经过每个步骤的向量变化：

原始数据， train[0]的数据如下：

Sorry,in meeting I'll call later

接下来通过调用 scikit-learn 的 CountVectorizer 类来进行文本的词频统计的调用函数。

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1,
                                 binary=True)

    x_train=vectorizer.fit_transform(x_train)
    x_train = np.array(list(x_train))

这时候打印x_train[0]的结果，这里要强调因为max_features限制数量，故而有效的x_train[0]值，如下所示

  (0, 308)	1
  (0, 211)	1
  (0, 194)	1
  (0, 180)	1

将其向量化

x_train=x_train.toarray()

结果如下

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

对其进行TF-IDF处理

    transformer = TfidfTransformer(smooth_idf=False)
    x_train=transformer.fit_transform(x_train)

处理后，x_train[0]值如下所示：

(0, 366)	0.6954286575055049
(0, 238)	0.7185951449321735

将其向量化

    x_train=x_train.toarray()

向量化处理如下

 [0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.71859514 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.69542866 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.        ]

(四) N-gram

与TF-IDF相比，大概区别如下

代码如下所示

def get_features_by_ngram():
    global max_features
    x_train, x_test, y_train, y_test=load_all_files()

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                ngram_range=(3, 3),
                                strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1,
                                 token_pattern=r'\b\w+\b',
                                 binary=True)
    print (vectorizer)
    x_train=vectorizer.fit_transform(x_train)
    x_train=x_train.toarray()
    vocabulary=vectorizer.vocabulary_

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                ngram_range=(3, 3),
                                strip_accents='ascii',
                                 vocabulary=vocabulary,
                                 stop_words='english',
                                 max_df=1.0,binary=True,
                                 token_pattern=r'\b\w+\b',
                                 min_df=1 )
    print (vectorizer)
    x_test=vectorizer.fit_transform(x_test)
    x_test=x_test.toarray()

    transformer = TfidfTransformer(smooth_idf=False)
    x_train=transformer.fit_transform(x_train)
    x_train=x_train.toarray()
    x_test=transformer.transform(x_test)
    x_test=x_test.toarray()

    return x_train, x_test, y_train, y_test

(五) Word2Vec

1.数据预处理

相对于词袋、词集模型，word2vec模型增加了如下处理逻辑，因为短信内容中可能存在一些特殊符号，这类特殊符号也对判断骚扰邮件有一定帮助，需要处理的特殊符号如下所示：

    punctuation = """.,?!:;(){}[]"""

常见的处理方法是在特殊符号前后增加空格，然后使用split函数切分时就可以完整保留这些特殊符号：

def cleanText(corpus):
    punctuation = """.,?!:;(){}[]"""
    corpus = [z.lower().replace('\n', '') for z in corpus]
    corpus = [z.replace('<br />', ' ') for z in corpus]

    # treat punctuation as individual words
    for c in punctuation:
        corpus = [z.replace(c, ' %s ' % c) for z in corpus]
    corpus = [z.split() for z in corpus]
    return corpus

将训练数据和测试数据分别使用cleanText函数处理，合并成完整数据集合x：

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)
    x=x_train+x_test

2.构建模型

初始化Word2Vec对象，size表示训练Word2Vec的神经网络隐藏层节点数，同时也表示了ord2Vec向量的维数；window表示训练Word2Vec的窗口长度；min_count表示出现次数小于min_count的单词将不计算；iter表示了训练Word2Vec的次数，gensim官方文档强烈建议增加iter次数以提高生成的Word2Vec的质量，默认值为5：

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

3.Word2Vec特征向量化

训练完成后，单词对应的Word2Vec会保存在model变量中，可以使用类似字典的方式直接访问，比如获取单词love对应的Word2Vec的方法为：

model['love']

Word2Vec有个特性，一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取，比如：

model['good boy']= (model['good]+ model['boy])/2

利用这个特性，可以将组成短信的单词和字符的Word2Vec相加并取平均值：

def buildWordVector(imdb_w2v,text, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in text:
        try:
            vec += imdb_w2v[word].reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

另外，由于出现次数小于min_count的单词将不计算，并且测试样本中也可能存在未处理的特殊字符，所以需要通过捕捉KeyError避免程序异常退出。将训练集和测试集依次处理，获取对应的Word2Vec值，同时使用scale函数将其标准化处理：

    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = scale(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = scale(x_test)

4.scale标准化

使用scale函数的作用是，避免多维数据中个别维度的数据过大或者过小从而影响算法分类效果。scale函数会把各个维度的数据转换后使分布更加“平均”

from sklearn import preprocessing
import numpy as np

X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])

X_scaled = preprocessing.scale(X)
print(X_scaled)

输出结果如下所示：

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

5.完整源码

总体来讲，这一处理流程如下所示：

def  get_features_by_word2vec():
    global  max_features
    global word2ver_bin
    x_train, x_test, y_train, y_test=load_all_files()
    print(len(x_train), len(y_train))
    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    print('before', len(y_train))
    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = scale(x_train)
    print('after', len(x_train))
    print(x_train.shape)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = scale(x_test)

    return x_train, x_test, y_train, y_test

6.举例

这里以train[0]为例，讲述向量化的过程，首先初始化从文件中导入

x_train, x_test, y_train, y_test=load_all_files()

在调用如上代码后，x_train[0]结果如下

If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy

在执行cleanText函数后，经过分词之后，x_train[0]结果如下

['if', 'you', "don't", ',', 'your', 'prize', 'will', 'go', 'to', 'another', 'customer', '.', 't&c', 'at', 'www', '.', 't-c', '.', 'biz', '18+', '150p/min', 'polo', 'ltd', 'suite', '373', 'london', 'w1j', '6hl', 'please', 'call', 'back', 'if', 'busy']

接下来使用word2vec模型处理

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])

Word2Vec有个特性，一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取，处理后的train[0]如下所示


[ 0.5688461  -0.7963458  -0.53969711  0.42368432  1.7073138   1.17516173
  0.32935769  0.1749727  -1.10261336 -1.14618023 -0.64693019  0.03879264
 -0.28986312 -0.15053948  0.86447008  1.03759495 -0.22362847  0.54810378
 -0.09579477  0.06696273  0.53213082  1.13446066  0.70176198 -0.09194162
 -1.00245396 -1.01783227 -0.72731505  0.43077651 -0.00673702  0.54794111
  0.28392318  1.21258038  0.6954477   1.35741696  0.52566294 -0.11437557
 -0.0698448  -0.06264644  0.00359846  0.19755338  0.02252081 -0.45468214
  0.03074975 -0.97560132 -1.3320358  -0.191184   -0.99694834 -0.05791205
  0.38126789  1.41985205  0.06165056  0.21995296 -0.25111755 -0.61057136
  0.30779555  1.45024929 -1.25652236  0.77137314  0.14340256 -0.48314989
  0.6579341  -1.64457267 -0.33124644  0.4243934  -1.32630979  0.37559585
 -0.01618847 -0.72842787  0.75744382  0.22936961  0.38842295  0.70630939
 -0.5755018   2.28154287  0.1041452   0.35924263  1.8132245  -0.10724146
 -1.49230761 -0.32379927 -0.89156985  0.37247643  0.34482669 -0.10076832
 -0.53934116 -0.38991501 -0.14401814  1.64303595 -0.50050573  0.32035356
 -0.51832154  0.45338105 -1.35904802 -0.74532751 -0.31660083  0.15160747
  0.76809469 -0.34191613  0.07772422  0.16559841  0.08473047 -0.10939166
  0.1857267   0.02878834  0.64387584  0.45749407  0.69939248 -0.85222505
 -1.57294277 -1.62788899  0.35674762 -0.24114483  0.29261773  0.18306259
 -1.18492453 -0.52101244  1.15009746  0.97466267 -0.33838688 -1.17274655
  0.57668485  1.56703609  1.27791816 -1.14988041  0.28182096 -0.09135877
 -0.03609932  0.66226854 -0.35863005 -0.36398623  0.26722192  0.98188737
 -0.33385907  0.445007    0.75214935 -0.81884158  1.0510767   0.63771857
  0.19482218 -1.80268676 -0.34549945 -0.35621238  0.46528964 -0.55987857
 -0.87382452  0.75147679 -0.66485836 -0.15657116  0.18748415  1.10138361
 -0.0078085   0.50333768  1.3417442   1.10197353 -0.05941141  0.07282477
 -0.19017513 -0.83439085 -0.00832164  0.06593468 -0.53035842  0.95551142
  0.35307575 -0.31915962  0.20121204 -0.81100877 -0.91266333  0.03278571
  0.26023407 -0.54093813  0.02997342  1.41653465 -0.12427418 -0.82120634
 -1.17340946 -1.75454109 -0.76027333  1.2884738   0.17876992  0.26112962
 -0.88782072  0.03205944 -0.16476012 -0.14802052 -1.12993536  0.4738586
  0.72952233  1.57389264 -0.77677785 -0.6256085  -0.22538952  0.34228583
 -0.56924201  0.7434089   1.40698768  0.52310801 -0.87181962  0.32473917
 -1.27615191  1.0771901   1.12765643  1.1128303   0.28027994  0.23365211
 -1.32999254  1.16263406 -0.24584286  1.32610628 -1.07430174  0.04978786
  0.84560452  0.51568605  0.29324713  1.01046356  0.89309483 -0.68883869
 -0.10943733 -1.14162474  0.43906249 -1.64726855  0.62657474  0.89747922
  0.25619183  0.88133258  0.53152881  0.800173    1.07257533 -0.91345605
  1.511324   -0.37129249 -1.21065258  1.41421037  0.63753296  0.77966061
  0.34219329 -1.62505142 -0.50154156 -0.84119517 -0.10794676  0.14238391
 -0.18933125  0.96618836 -0.09447222 -0.01457627  0.25379729 -0.00239968
 -0.01879948  0.24551755 -0.19717246  1.49390844  0.41681463 -1.16918163
 -0.7748389   0.6664235  -0.03348684 -0.13785069 -1.38920251 -0.65347069
 -0.30330183  0.84497368  1.01966753  0.62513464 -0.61398801  0.17254219
  0.47432455 -0.4636558  -0.2835449  -0.38155617 -0.47108328 -1.27081993
 -0.09585531  0.49909651 -0.99359911 -0.07502736 -1.39910104 -0.34759668
  0.21337289 -1.10769739  0.15850016  0.64950728  0.96845903 -0.71599037
 -0.35235105 -0.64243931 -0.31335287 -1.04057976 -0.75755116  0.2656545
 -0.91747103  0.51288032  1.12705359 -0.3520552   0.82732871  2.18699482
  0.17190736  0.01063382  0.60349575  0.18857134  0.63457147 -1.40435699
 -0.24523577  1.07861446 -1.93594107 -0.35640277  0.56313198  0.92719632
 -1.19504538 -0.40542724 -0.16996568 -0.03463793 -0.97696206 -0.12556016
  0.21483948  0.15585242  0.76265303 -0.65085202  0.65287212 -0.85362963
  0.33149502  0.5701018   0.40867361  0.21806052 -1.14224126  1.42919841
 -0.22902177  0.5451145  -0.1141489   0.25853344  1.02713966 -0.16200372
 -0.23339327  0.87608441  0.75910643  0.18785408  1.23609536 -0.72335459
  0.53511046  0.08358996 -0.5598393   0.5004547  -0.11572333 -0.47238923
  1.20602503 -0.27158068 -0.65528766  0.25551535  0.32559297 -1.09997926
  0.20791183  0.12843725  0.09087898 -0.22888646 -0.71270039  0.78723415
  0.4676776  -0.3136612   0.4368007   0.56427676 -0.95792959 -0.12123958
  0.25772387  0.27141381  1.62133518  1.0806067  -0.21620892  0.72400364
  0.23908486  1.32545448  1.37374568  0.80119068 -1.11050208  0.61139748
  0.19350813 -0.42820846 -0.09775806  0.37327079 -1.30432311  0.20804753
  0.81459754 -0.36544708  0.00990999 -1.75476784 -1.18515867 -0.15301021
 -0.02726374  0.63801717  0.70284723  0.69907364 -0.54179232 -1.13846505
 -0.00501522 -0.95063539 -0.3019417  -0.72958836 -0.65496463 -0.22132243
  1.35748601  1.41187301  0.82758802  1.23182959]

(六) Word2Vec_1d

与word2vec不同之处在于，这一处理流程如下所示：

1.MinMaxScaler标准化

如果想把各个维度的数据都转换到0和1之间，可以使用MinMaxScaler函数，比如转换前的数据为：

from sklearn import preprocessing
import numpy as np

X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)

输出结果如下所示

[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]

相对于word2vec的scale处理，word2vec的标准化处理流程如下

    min_max_scaler = preprocessing.MinMaxScaler()
    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

2.完整源码

def  get_features_by_word2vec_cnn_1d():
    global  max_features
    global word2ver_bin
    x_train, x_test, y_train, y_test=load_all_files()

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)

        model.build_vocab(x)

        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    min_max_scaler = preprocessing.MinMaxScaler()

    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

    return x_train, x_test, y_train, y_test

(七) Word2Vec_2d

def  get_features_by_word2vec_cnn_2d():
    global max_features
    global max_document_length
    global word2ver_bin

    x_train, x_test, y_train, y_test=load_all_files()

    x_train_vecs=[]
    x_test_vecs=[]

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    #x_train_vec=np.zeros((max_document_length,max_features))
    #x_test_vec=np.zeros((max_document_length, max_features))
    """
    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)
    vec += imdb_w2v[word].reshape((1, size))
    """
    #x_train = np.concatenate([buildWordVector_2d(model, z, max_features) for z in x_train])
    x_all=np.zeros((1,max_features))
    for sms in x_train:
        sms=sms[:max_document_length]
        #print sms
        x_train_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_train_vec[i-1]=vec.copy()
            #x_all=np.concatenate((x_all,vec))
        x_train_vecs.append(x_train_vec)
        #print x_train_vec.shape
    for sms in x_test:
        sms=sms[:max_document_length]
        #print sms
        x_test_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_test_vec[i-1]=vec.copy()
            #x_all.append(vec)
        x_test_vecs.append(x_test_vec)

    #print x_train
    #print x_all
    min_max_scaler = preprocessing.MinMaxScaler()
    print ("fix min_max_scaler")
    x_train_2d=np.concatenate([z for z in x_train_vecs])
    min_max_scaler.fit(x_train_2d)

    x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
    x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])

    x_train=x_train.reshape([-1, max_document_length, max_features, 1])
    x_test = x_test.reshape([-1, max_document_length, max_features, 1])

    return x_train, x_test, y_train, y_test

三、构建模型

(一) NB模型

1.基于词袋、词集模型的NB算法

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    print ("NB and wordbag")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

NB and wordbag
              precision    recall  f1-score   support

           0       0.99      0.66      0.79      1918
           1       0.31      0.96      0.47       312

    accuracy                           0.70      2230
   macro avg       0.65      0.81      0.63      2230
weighted avg       0.90      0.70      0.74      2230

[[1258  660]
 [  12  300]]

2.基于word2vec模型的NB算法

def do_nb_word2vec(x_train, x_test, y_train, y_test):
    print ("NB and word2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

3.基于doc2vec的NB模型

def do_nb_doc2vec(x_train, x_test, y_train, y_test):
    print ("NB and doc2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

(二) SVM模型

这部分逻辑与NB类似，只是模型使用的是SVM，处理源码如下

def do_svm_wordbag(x_train, x_test, y_train, y_test):
    print ("SVM and wordbag")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

def do_nb_word2vec(x_train, x_test, y_train, y_test):
    print ("NB and word2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

def do_nb_doc2vec(x_train, x_test, y_train, y_test):
    print ("NB and doc2vec")
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print (metrics.accuracy_score(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

(三) XGBoost算法

XGBoost是近几年流行起来的一种分类算法，由Tianqi Chen最初开发的实现可扩展、便携、分布式gradient boosting算法的一个库，可以下载安装并应用于C++、Python、R等语言，现在由很多协作者共同开发维护。XGBoost所应用的算法就是gradient boosting decision tree，既可以用于分类也可以用于回归问题中。XGBoost最大的特点在于，它能够自动利用CPU的多线程进行并行，同时在算法上加以改进提高了精度。它的处女秀是Kaggle的希格斯子信号识别竞赛，因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注，在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高，最近也有队伍借助XGBoost在比赛中夺得第一。这里提到的Kaggle是由联合创始人、首席执行官安东尼·高德布卢姆2010年在墨尔本创立的，主要为开发商和数据科学家提供举办机器学习竞赛、托管数据库、编写和分享代码的平台。该平台已经吸引了80万名数据科学家的关注。

def do_xgboost_wordbag(x_train, x_test, y_train, y_test):
    print ("xgboost and wordbag")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
    print ("xgboost and word2vec")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

(四) MLP

def do_dnn_wordbag(x_train, x_test, y_train, y_test):
    print ("MLP and wordbag")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print (clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))
    
def do_dnn_word2vec(x_train, x_test, y_train, y_test):
    print ("MLP and word2vec")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print  (clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print (metrics.confusion_matrix(y_test, y_pred))

这里举mlp wordbag 运行结果

              precision    recall  f1-score   support

           0       0.86      1.00      0.92      1918
           1       0.00      0.00      0.00       312

    accuracy                           0.86      2230
   macro avg       0.43      0.50      0.46      2230
weighted avg       0.74      0.86      0.80      2230

0.8600896860986547

(五) CNN

这里需要注意，仅对于wordbag模型进行了pad处理，而对于word2vec模型，并没有这样处理

1.wordbag模型

def do_cnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print ("CNN and tf")

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="review")

2.word2vec

def do_cnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("CNN and word2vec")
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print(metrics.confusion_matrix(y_test, y_predict))

3.Word2Vec 2d模型1

def do_cnn_word2vec_2d(trainX, testX, trainY, testY):
    global max_features
    global max_document_length
    print ("CNN and word2vec2d")
    y_test = testY
    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length,max_features,1], name='input')

    network = conv_2d(network, 32, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = conv_2d(network, 64, 3, activation='relu', regularizer="L2")
    network = max_pool_2d(network, 2)
    network = local_response_normalization(network)
    network = fully_connected(network, 128, activation='tanh')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='tanh')
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.01,
                         loss='categorical_crossentropy', name='target')

    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

4.word2vec_2d 模型2

def do_cnn_word2vec_2d_345(trainX, testX, trainY, testY):
    global max_features
    global max_document_length
    print ("CNN and word2vec_2d_345")
    y_test = testY

    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length,max_features,1], name='input')
    network = tflearn.embedding(network, input_dim=1, output_dim=128,validate_indices=False)
    branch1 = conv_2d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_2d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_2d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool_2d(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="sms")

    y_predict_list = model.predict(testX)
    print (y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

5、doc2vec

ef do_cnn_doc2vec(trainX, testX, trainY, testY):
    global max_features
    print ("CNN and doc2vec")

    #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    #testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_features], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="review")

(六) RNN

Wordbag源码如下

def do_rnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print ("RNN and wordbag")
    y_test=testY
    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_document_length])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)

    y_predict_list = model.predict(testX)
    print(y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

Word2Vec源码相对wordbag区别不大，只是没有打印出具体的报告，如下所示

def do_rnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("RNN and wordbag")

    trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_features])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)

这里加上测试结果报告，代码如下所示

def do_rnn_word2vec(trainX, testX, trainY, testY):
    global max_features
    print ("RNN and wordbag")

    trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
    testX = pad_sequences(testX, maxlen=max_features, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_features])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="sms",n_epoch=5)
    y_predict_list = model.predict(testX)
    print(y_predict_list)

    y_predict = []
    for i in y_predict_list:
        print  (i[0])
        if i[0] > 0.5:
            y_predict.append(0)
        else:
            y_predict.append(1)

    print(classification_report(y_test, y_predict))
    print (metrics.confusion_matrix(y_test, y_predict))

这里我的主机中，由于rnn需要内存较大，没有跑起来。

总体来讲，作者的代码注释较少，关于一些关键的代码细节并没有详解，这也是这本书的书评中提到的一个问题，基本上对于初学者来说，这本书不适合，因为没有基础完全看不懂，很多内容需要自己去查找。而对于真正需要了解细节的人，讲的又不细致，似乎只是讲了一个应用方法，想在这个方向做深入研究的人，还是要多看看论文。