《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(4)

本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以Word2Vec_1d与Word2Vec_2d对两种特征提取方式来详细讲解。

 (六)Word2Vec_1d

与word2vec不同之处在于标准化处理,这一处理流程如下所示: 

1.MinMaxScaler标准化

        如果想把各个维度的数据都转换到0和1之间,可以使用MinMaxScaler函数,比如转换前的数据为:

from sklearn import preprocessing
import numpy as np

X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)

输出结果如下所示

[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]

相对于word2vec的scale处理,word2vec的标准化处理流程如下

    min_max_scaler = preprocessing.MinMaxScaler()
    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

2.完整源码

def  get_features_by_word2vec_cnn_1d():
    global  max_features
    global word2ver_bin
    x_train, x_test, y_train, y_test=load_all_files()

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)

        model.build_vocab(x)

        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    min_max_scaler = preprocessing.MinMaxScaler()

    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

    return x_train, x_test, y_train, y_test

(七)Word2Vec_2d

根据第五点,我们知道Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如

model['good boy']= (model['good]+ model['boy])/2

这里要特别注意,由于CNN使用2d卷积可以直接识别,故而这部分无需基于基于此原则处理,以训练姐为例,代码如下所示

    for sms in x_train:
        sms=sms[:max_document_length]
        x_train_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            try:
                vec=model[w].reshape((1, max_features))
                x_train_vec[i-1]=vec.copy()
            except KeyError:
                continue
        x_train_vecs.append(x_train_vec)

这里要特别注意需要捕捉异常,因为之前配置时限制了最大特征数量max_features,综上所有源码如下所示

def  get_features_by_word2vec_cnn_2d():
    global max_features
    global max_document_length
    global word2ver_bin

    x_train, x_test, y_train, y_test=load_all_files()

    x_train_vecs=[]
    x_test_vecs=[]

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)


    x_all=np.zeros((1,max_features))
    for sms in x_train:
        sms=sms[:max_document_length]
        #print sms
        x_train_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_train_vec[i-1]=vec.copy()
            #x_all=np.concatenate((x_all,vec))
        x_train_vecs.append(x_train_vec)
        #print x_train_vec.shape
    for sms in x_test:
        sms=sms[:max_document_length]
        x_test_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_test_vec[i-1]=vec.copy()
        x_test_vecs.append(x_test_vec)

    min_max_scaler = preprocessing.MinMaxScaler()
    print ("fix min_max_scaler")
    x_train_2d=np.concatenate([z for z in x_train_vecs])
    min_max_scaler.fit(x_train_2d)

    x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
    x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])

    x_train=x_train.reshape([-1, max_document_length, max_features, 1])
    x_test = x_test.reshape([-1, max_document_length, max_features, 1])

    return x_train, x_test, y_train, y_test

在这里对train数据集进行测试,如下测试值打印为400,160,也就是说生成向量长度为(160,400)

    print('feature', max_features)
    print('sms length', max_document_length)

在初始化阶段时,根据上面的值,将其初始化为160*400的矩阵,

x_train_vec = np.zeros((max_document_length, max_features))

以下图为示例,这句话的原始词汇如下所示

sms ['we', 'tried', 'to', 'contact', 'you', 're', 'our', 'offer', 'of', 'new', 'video', 'phone', '750', 'anytime', 'any', 'network', 'mins', 'half', 'price', 'rental', 'camcorder', 'call', '08000930705', 'or', 'reply', 'for', 'delivery', 'wed']

在计算过程中,每个单词都被映射为shape为(1,400)的向量,以we为例,特征向量如下

we [[ 5.13324559e-01  6.92280710e-01 -6.06139421e-01 -7.33043671e-01
  -1.05264068e+00  1.08887844e-01 -4.79352057e-01 -4.47796851e-01
   2.79668242e-01  8.62461388e-01 -6.26937568e-01 -1.11396635e+00
  -5.17404914e-01 -7.41610050e-01 -5.43049991e-01  1.26212585e+00
  -2.09969897e-02 -7.99523532e-01  7.05081582e-01 -2.12729737e-01
  -3.72616291e-01 -1.35459530e+00  4.50749367e-01  1.99423480e+00
  -6.92014813e-01 -5.42739570e-01  9.73733068e-01 -9.93132412e-01
  -1.98732018e+00  5.50295152e-02 -6.52212918e-01  2.61625558e-01
  -6.60811782e-01  5.22207975e-01  3.28823067e-02  1.19926560e+00
  -1.11021876e-01  7.38386214e-02  3.81358445e-01  2.09052667e-01
  -8.29950094e-01 -6.39967322e-01 -1.78698123e+00  9.00085032e-01
  -9.13119793e-01 -1.32008731e+00  1.05953765e+00  7.21830130e-01
   2.56760389e-01  8.63298297e-01  1.83388978e-01 -2.71840870e-01
  -1.10518491e+00 -1.32568777e+00 -1.76396556e-02 -3.53596121e-01
  -3.26753765e-01  8.18026602e-01  3.69717002e-01 -1.05936432e+00
  -5.11582315e-01 -7.87444296e-04  8.23280692e-01  5.23390949e-01
   3.56160045e-01  3.36051375e-01 -8.46250296e-01 -2.13131160e-02
   1.16455960e+00  5.76237142e-01 -2.83024520e-01  2.89030671e-01
  -2.98734903e-01 -8.48811343e-02 -1.35541141e+00  9.47913602e-02
  -1.93751216e-01  1.39996544e-01  2.48220786e-01 -3.07027787e-01
   1.46764421e+00  6.78055763e-01 -2.19629765e-01 -1.26420140e+00
  -9.32358503e-02 -1.06198204e+00 -1.05375014e-01 -8.62080008e-02
  -6.28659487e-01 -4.20507900e-02 -9.40151274e-01  1.91700347e-02
   1.57449976e-01 -8.73531494e-03  1.55828786e+00 -5.89314818e-01
   3.09453845e-01 -1.52258849e+00  2.46635512e-01  1.06939852e+00
  -1.97222233e-01  5.57442427e-01  9.26564932e-01 -4.49043661e-01
  -8.07128549e-01 -8.00708532e-01 -9.97351348e-01  1.27024937e+00
  -5.16803145e-01 -1.25073361e+00 -1.68358684e-02 -4.12435591e-01
   1.12003529e+00 -4.83542472e-01  9.42728221e-01  1.04354310e+00
  -3.37721795e-01  3.47444683e-01 -1.32276759e-01 -1.12430441e+00
   5.19255280e-01 -4.70025271e-01 -2.95793682e-01  1.15930760e+00
  -1.46419370e+00 -3.28355014e-01  1.20763075e+00  7.36234725e-01
   1.92663229e+00 -7.88556874e-01  2.79897422e-01 -7.99781442e-01
  -8.94552469e-01 -1.58961520e-01 -6.48326576e-01  1.52472287e-01
  -4.62858267e-02  2.21728235e-02 -4.91406024e-01  5.98726034e-01
   8.25376391e-01  5.93617022e-01  3.23204279e-01  1.87094241e-01
   8.95736217e-02  1.57505047e+00 -4.96031970e-01  2.10392982e-01
   2.35750794e-01  6.23321176e-01  1.00233054e+00 -7.98230767e-01
   1.71729431e-01  3.89039546e-01  3.40540446e-02  1.49258062e-01
   5.23613870e-01 -5.20109951e-01  1.11207891e+00  5.08346438e-01
  -4.21277046e-01 -3.53204191e-01 -6.67744637e-01 -1.07647288e+00
  -1.30223131e+00  3.72959256e-01  8.32573354e-01  1.23022115e+00
  -3.76891047e-01 -1.44855738e-01 -1.22572112e+00 -5.34232073e-02
  -2.84706771e-01  6.37961626e-01  5.68755329e-01  1.75359726e+00
   4.44663256e-01 -7.93476462e-01 -2.56309092e-01 -9.81180012e-01
   3.86805981e-02 -6.01904452e-01  3.38268459e-01  4.95142251e-01
  -4.19250995e-01 -6.14156127e-01 -1.74631262e+00  2.86130071e-01
  -7.84139872e-01  8.27276945e-01  4.14437145e-01  8.87963101e-02
  -1.96212530e-01  8.42595875e-01 -8.42688754e-02 -1.06598842e+00
  -5.22110045e-01 -3.77817780e-01  1.33652896e-01 -2.49611631e-01
   5.73737025e-01 -9.50190306e-01 -1.00966275e-01 -1.47440713e-02
  -1.23199463e+00 -1.03451073e+00 -6.90591514e-01 -1.17653474e-01
   4.69776809e-01  8.45959485e-01 -6.69992864e-01  9.39928591e-01
   2.43703172e-01 -5.22347033e-01  1.50591660e+00 -1.25705695e+00
  -1.07466733e+00  2.64614135e-01  1.29109251e+00 -5.54164290e-01
  -3.75026703e-01  1.21394324e+00  3.73631209e-01 -3.39792162e-01
   7.77602315e-01  6.85810968e-02  2.59181112e-01 -5.74017346e-01
   1.80175275e-01 -5.17311059e-02  7.63257027e-01 -6.27086818e-01
  -1.75175023e+00 -1.33330524e+00 -2.09044114e-01 -3.61217469e-01
   9.18019235e-01 -6.96359158e-01 -8.12097341e-02 -1.99478948e+00
   5.33144951e-01 -8.93714070e-01 -1.15886045e+00  1.04454172e+00
  -5.89572251e-01  5.85066915e-01 -7.20543563e-01  7.95059741e-01
  -1.21909738e+00 -6.99866176e-01 -2.16999158e-01  1.20310523e-01
  -1.82821289e-01 -4.16274257e-02  6.58949614e-01 -6.93615258e-01
   1.74605417e+00  1.38246298e-01 -1.33604541e-01 -5.39953291e-01
   6.67522177e-02  7.55294517e-04 -5.03420174e-01 -7.11347222e-01
  -4.60816860e-01  6.16341352e-01  4.63734388e-01  1.24169385e+00
   2.37992201e-02  1.45490086e+00  5.40668070e-01 -6.08588755e-01
   8.23473692e-01  3.63709390e-01  4.70518112e-01  5.07119834e-01
   1.00773372e-01  2.39598155e-01 -8.08379352e-01  6.01775765e-01
  -7.75957584e-01  5.12413621e-01 -8.06147754e-01  9.48663130e-02
  -1.41660139e-01  1.29196596e+00  5.58595240e-01 -5.60340136e-02
   1.15606940e+00 -3.82585824e-01  2.60255396e-01  1.53539240e+00
  -4.92683649e-01 -7.33903587e-01 -1.37848473e+00 -9.12505984e-01
  -1.74521387e-01 -1.90168723e-01  3.82246822e-01 -3.94751132e-02
   1.11151767e+00  1.11433804e+00  1.20742284e-01  1.30665922e+00
  -2.81528473e-01 -9.17194877e-03  5.35691738e-01 -5.08557260e-01
  -3.51130486e-01 -8.15440595e-01  9.59339619e-01 -1.22608221e+00
   1.50517665e-03 -4.85110991e-02 -1.02146101e+00 -6.09341323e-01
  -4.56822574e-01 -5.58608174e-02  5.85547611e-02 -9.65700805e-01
  -1.89950958e-01  2.49769300e-01 -1.83345869e-01 -1.61435679e-01
  -8.93191770e-02 -8.67426872e-01 -3.20736989e-02  9.62894142e-01
  -1.56400785e-01 -1.01036263e+00  4.74414289e-01 -9.77616847e-01
  -1.23069656e+00  9.33022261e-01 -1.11188591e-01 -8.74921307e-02
   1.23017088e-01  1.27999806e+00 -9.75147247e-01  2.15137690e-01
  -5.74331656e-02  8.87444019e-01  3.78122181e-01 -5.19478023e-01
   1.60879225e-01  4.34285700e-02  1.13737062e-01  6.59871325e-02
  -1.24426618e-01 -5.20280719e-01 -3.63858700e-01 -3.51046622e-01
  -3.04933637e-01 -5.91847226e-02  7.01255381e-01 -1.06392646e+00
   1.08689629e-01  1.07840514e+00 -9.45310116e-01 -1.78924516e-01
   2.00637262e-02  4.02919173e-01 -9.98029232e-01 -1.11832178e+00
  -6.08632922e-01  1.32893309e-01  3.19699764e-01 -2.58571565e-01
   2.51511425e-01 -1.25716972e+00  2.17012092e-01 -1.17317878e-01
   1.07084620e+00 -9.22225490e-02  8.86658728e-01 -9.41256881e-01
   1.72717959e-01 -3.56864363e-01 -4.89861757e-01  3.81545901e-01
   1.74080342e-01 -5.66165209e-01  1.95712000e-01  1.61951578e+00
   2.77753413e-01 -1.70714423e-01  1.27170652e-01 -1.92686200e-01
   2.13473767e-01 -9.17410314e-01 -1.04300666e+00 -5.42495966e-01
  -2.21326977e-01 -4.96313930e-01  1.79395282e+00  1.68110895e+00
   2.87916631e-01 -8.61096904e-02  1.61331439e+00 -1.03910947e+00]]

 tried映射word2vec的向量为

tried [[-6.12377405e-01  4.16364968e-01 -2.78998375e-01  8.34699646e-02
  -2.28010252e-01  3.08151543e-01  2.52343982e-01 -3.66741210e-01
  -8.41982186e-01 -1.80800125e-01 -5.81489503e-01 -1.90963492e-01
   3.02755803e-01 -1.06433824e-01  5.65255463e-01  1.59041151e-01
   4.47588146e-01  3.80713344e-01 -1.49773791e-01 -3.71468604e-01
  -4.12444770e-01 -2.07973748e-01  7.92427480e-01 -5.68225645e-02
   5.91229439e-01  2.05591187e-01 -2.96749562e-01 -3.81954312e-02
  -3.47267598e-01 -1.54623434e-01 -3.16062421e-02 -2.79324621e-01
   9.93312802e-04 -4.35144268e-02  3.68950248e-01  3.36872280e-01
   1.28488570e-01  4.10110801e-02  3.97733957e-01 -7.92261064e-02
  -2.56396919e-01  1.78995103e-01 -2.94328153e-01  3.26810986e-01
   6.19691372e-01 -6.44081712e-01  1.66229442e-01 -1.15246415e-01
  -1.96021140e-01  5.71881771e-01  2.39205420e-01 -8.21989954e-01
   5.77406943e-01  5.78376591e-01  6.62666559e-02  1.79710805e-01
   4.50686693e-01 -8.30448046e-02  5.49292117e-02 -6.13321841e-01
  -2.56487906e-01  4.41231430e-02 -7.95108438e-01 -6.97279871e-01
  -4.26836580e-01  6.01450074e-03 -5.60432561e-02 -1.08723938e-02
   5.39940178e-01 -6.16185009e-01  9.34244394e-02 -2.29318902e-01
   1.88086748e-01  4.97419924e-01 -9.02395964e-01 -7.60009706e-01
   9.49736834e-02  5.91007233e-01 -1.15908660e-01 -6.41334772e-01
  -4.10496473e-01  7.33896941e-02 -2.04081267e-01 -2.59717613e-01
  -3.02662104e-01  1.74025334e-02 -9.20143545e-01  5.68379939e-01
   4.52155858e-01 -3.03475648e-01 -8.60066772e-01 -3.32691103e-01
  -1.39265224e-01 -2.94290334e-01  2.71589816e-01 -8.56407359e-02
   2.16451183e-01 -3.93221825e-01 -3.92221771e-02  1.90483168e-01
   6.09004319e-01  5.67121565e-01  5.19057095e-01 -2.78411023e-02
  -3.63553882e-01  2.24665418e-01 -4.93543930e-02 -4.85199660e-01
  -1.28926590e-01 -2.72287935e-01 -5.09980321e-01  7.59351552e-02
   9.42224190e-02 -2.37373099e-01 -1.03565240e+00  4.93904233e-01
  -2.54788518e-01 -7.90989026e-02  1.92527086e-01 -3.01439375e-01
  -5.00758141e-02  3.35612983e-01  1.67256281e-01  7.30704248e-01
  -3.93237889e-01 -2.79548228e-01 -1.58486655e-03  2.05960274e-01
  -6.19939148e-01  9.11417976e-02  3.42352986e-01  2.19285220e-01
  -1.13095188e+00  5.46321392e-01  3.22455764e-02  2.40575343e-01
  -4.96992290e-01 -4.24431205e-01  2.97653377e-02 -2.34046698e-01
   3.77572142e-02  2.45649025e-01  1.18621111e-01 -2.35571980e-01
   1.21258840e-01 -7.71017745e-02 -5.61713159e-01  5.59415407e-02
  -3.39809328e-01 -1.32750332e-01 -3.00569478e-02 -3.62776071e-01
  -9.88005847e-03 -6.34131610e-01 -1.14288427e-01  1.50959522e-01
   4.09990132e-01  8.65844339e-02 -8.01804900e-01 -1.09013222e-01
  -8.83603334e-01  1.00924993e+00  1.72543451e-02  1.00044213e-01
   1.14075527e-01 -1.52706569e-02  5.38514793e-01  6.58321857e-01
   1.03414440e+00 -2.54678696e-01 -2.80912459e-01 -3.42982501e-01
  -6.03122354e-01 -5.64105988e-01 -4.83809561e-01 -7.63532937e-01
   3.94593596e-01  2.63797462e-01  1.89387262e-01  1.97888896e-01
  -2.22714692e-01 -1.63537845e-01  5.44133842e-01 -4.73020166e-01
  -1.58943608e-01  1.47972321e-02 -2.47510076e-01 -2.52945155e-01
  -1.42151117e+00 -7.73319006e-02 -2.18766674e-01  2.90358305e-01
  -8.11922923e-02  3.80099297e-01  4.07175541e-01 -6.06728375e-01
   4.35942084e-01 -5.07750392e-01 -3.74243110e-01 -2.28505861e-02
   3.85590881e-01 -9.61492583e-02  6.28207564e-01 -2.27478340e-01
  -3.53291005e-01  1.27462888e+00  1.22006260e-01  6.96230888e-01
  -4.91642160e-03  9.73952487e-02  5.97821176e-01  2.64382482e-01
  -1.07160784e-01  2.76032716e-01 -1.00402012e-02  1.52446823e-02
  -1.02653086e+00 -3.72105278e-02  9.84817185e-03 -9.78173837e-02
   2.51751870e-01  7.96171010e-01  1.74022168e-01  4.11238670e-01
   6.62876844e-01  2.44819392e-02  2.32926726e-01 -1.71004206e-01
   6.59075752e-02 -3.79822493e-01 -2.80748993e-01  3.84911120e-01
  -3.39996159e-01 -6.03033841e-01 -3.07716448e-02  9.59311575e-02
   1.63629785e-01  3.89313966e-01  4.56324965e-01 -1.47402048e-01
   2.24111572e-01 -1.47401705e-01  3.15533757e-01  2.16347668e-02
  -1.41451418e-01  4.05424476e-01 -4.94289964e-01  2.80132085e-01
  -2.61499792e-01  5.22326589e-01 -3.14167172e-01 -2.90078729e-01
   1.29095510e-01 -2.91293949e-01 -2.05306053e-01 -7.58467257e-01
   2.23533452e-01 -7.64667317e-02  4.83570069e-01  6.99249208e-02
   1.30418539e-01 -1.41056001e-01 -7.23488152e-01 -1.53211460e-01
  -1.52336270e-01 -4.12477590e-02  1.60394862e-01 -7.33491685e-03
   2.02331066e-01 -6.27143621e-01 -4.36783656e-02  6.27906978e-01
  -3.39776695e-01 -2.54275233e-01 -3.11400682e-01 -5.71068414e-02
  -9.78075266e-02 -6.21155202e-01  1.29985273e-01  1.04443580e-02
   2.43578345e-01 -1.07544787e-01  1.15666062e-01  1.05102813e+00
  -2.67948598e-01 -1.73818737e-01  7.98049942e-02  6.87964559e-02
  -4.02220339e-01  1.27988279e-01 -7.05842525e-02  8.49173516e-02
  -3.85340363e-01  4.22646612e-01 -1.09368823e-01 -9.09663737e-02
   5.47704101e-02 -4.70033109e-01  1.41687408e-01  2.02474698e-01
   7.75358200e-01  3.81336033e-01  8.62001162e-03 -1.83762997e-01
  -1.69773564e-01 -6.28888071e-01 -6.15247846e-01  4.59231079e-01
   4.09249932e-01 -2.44395018e-01 -1.12164527e-01 -3.31565857e-01
   1.67021617e-01  8.46286044e-02  1.54180750e-01 -1.56229764e-01
   3.98001611e-01 -1.01308875e-01  6.64622247e-01  3.89112197e-02
   8.25130641e-02 -4.84833509e-01 -3.67288351e-01  4.05334085e-01
   3.12755287e-01 -2.55807668e-01 -3.96498263e-01  5.65700412e-01
   2.73670524e-01 -9.04162899e-02 -2.60679964e-02  2.46381899e-03
   3.49983983e-02  3.14775467e-01 -2.59854734e-01 -2.87085712e-01
   7.13063776e-01  1.51837751e-01 -8.10065806e-01 -6.50787354e-01
   8.68012831e-02  1.63752958e-01  2.44253531e-01  2.92172730e-01
  -1.07378745e+00  1.08177908e-01  1.96341053e-02 -9.42850485e-02
   2.40874976e-01  7.31463015e-01 -1.52041927e-01  6.45682588e-02
   5.57659388e-01 -5.77583313e-01 -6.75140381e-01 -6.40653772e-03
   2.81538606e-01 -1.68723643e-01 -3.24986398e-01  1.59401417e-01
  -1.32915914e-01  9.62117612e-02  6.53585255e-01 -4.89544272e-01
   2.66982883e-01  1.92768946e-01  2.40562856e-02 -1.17575504e-01
  -1.35681018e-01 -1.33427590e-01 -4.41130966e-01  1.55471131e-01
   4.75975633e-01 -3.60106885e-01  5.60806274e-01 -4.64459300e-01
  -5.51069260e-01 -7.95026064e-01 -2.00052291e-01  1.86785296e-01
   5.44229865e-01 -1.40094772e-01  7.62127638e-02  2.72861958e-01
   6.26554191e-01 -1.92846179e-01  5.23298919e-01 -2.57884264e-02
   5.74357688e-01 -4.20777857e-01  2.34971404e-01 -1.05997846e-01
  -2.79536992e-01 -8.36799264e-01 -1.56205758e-01  5.80618501e-01
   4.83773470e-01  4.12972480e-01 -3.56931061e-01 -5.20759337e-02]]

最后计算获得sms这段话的向量最后shape为(160,400),由于初始化将其zero为全0矩阵,故而在计算过程中,仅需对该句话存在的单词生成为vec,并通过如下语句赋值 

x_train_vec[i-1]=vec.copy()

最后,将其append到list中,如下所示

x_train_vecs.append(x_train_vec)

 以下面运行结果为例,这里面是从-1~(句子的实际长度-2)的序号均有向量,其中-1即为最后一行,即第159行,这样做是因为作者的代码为x_train_vec[i-1]=vec.copy(),不过作者的代码这种细节化的部分通常都没有注释,我的理解是作者可能认为for循环中这个序号i是从1开始计数?所以这样赋值么?

['at', 'home', 'by', 'the', 'way']
(160, 400) [[-1.58791387  0.90998811  1.02353275 ... -0.49089381  0.45235381
   0.196127  ]
 [-0.56197596  0.48324966 -0.02229143 ... -1.04505527 -0.41888034
  -0.89763647]
 [-0.18195434  0.69938725  0.01750618 ... -0.00903855  0.69490665
   0.02577416]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.7947135   0.95048201  0.8174293  ...  0.59431046  1.00592685
   0.34427088]]

如果将赋值代码改为

x_train_vec[i] = vec.copy()

那么测试代码生成的向量结果如下所示,均为0-(length-1)的范围内向量值不全为0

sms ['oh', 'thanks', 'a', 'lot', '.', '.', 'i', 'already', 'bought', '2', 'eggs', '.', '.']
0 oh
1 thanks
2 a
3 lot
4 .
5 .
6 i
7 already
8 bought
9 2
10 eggs
11 .
12 .
(160, 400) [[ 0.51650995  0.87204415  0.0464043  ...  0.01095798  0.15478413
  -0.119578  ]
 [-0.10295536  0.97929972 -0.06606753 ... -0.0157218   0.04467017
   0.30258381]
 [-1.07609522  0.46579501  0.35544962 ...  0.36150065 -0.43277314
   0.32538563]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]

 

 

 

(八)Doc2vec

        基于Word2Vec的方法,Quoc Le和Tomas Mikolov又给出了Doc2Vec的训练方法。如下图所示,其原理与Word2Vec相同,分为分布式存储(Distributed Memory,DM)和分布
式词袋(Distributed Bag of Words,DBOW)    

        具体源码如下所示    

def  get_features_by_doc2vec():
    global  max_features
    x_train, x_test, y_train, y_test=load_all_files()
    print('y:', len(y_train), len(y_test))
    print('x:', len(x_train), len(x_test))
    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x_train = labelizeReviews(x_train, 'TRAIN')
    x_test = labelizeReviews(x_test, 'TEST')

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(doc2ver_bin):
        print ("Find cache file %s" % doc2ver_bin)
        model=Doc2Vec.load(doc2ver_bin)
    else:
        model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(doc2ver_bin)

    x_test=getVecs(model,x_test,max_features)
    x_train=getVecs(model,x_train,max_features)

    return x_train, x_test, y_train, y_test

       以train[0]为例,初始化代码如下

x_train, x_test, y_train, y_test=load_all_files()

这时返回结果为

That's my honeymoon outfit. :)

 接下来是cleanText处理

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

处理结果如下所示 

["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')']

 与Word2Vec不同的地方是,Doc2Vec处理的每个英文段落,需要使用一个唯一的标识来标记,并且使用一种特殊定义的数据格式保存需要处理的英文段落,这种数据格式定义如
下:

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

其中SentimentDocument可以理解为这种格式的名称,也可以理解为这种对象的名称,words会保存英文段落,并且是以单词和符合列表的形式保存,tags就是我们说的保存的唯一标识。最简单的一种实现就是依次给每个英文段落编号,测试数据集的标记为“TRAIN_数字”,测试数据集的标记为“TEST_数字”:

def labelizeReviews(reviews, label_type):
    labelized = []
    for i, v in enumerate(reviews):
        label = '%s_%s' % (label_type, i)
        labelized.append(SentimentDocument(v, [label]))
    return labelized

本文调用逻辑如下

    x_train = labelizeReviews(x_train, 'TRAIN')
    x_test = labelizeReviews(x_test, 'TEST')

这时,x_train[0]的处理结果如下

SentimentDocument(words=["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')'], tags=['TRAIN_0'])

 训练模型或者load模型的代码如下

    if os.path.exists(doc2ver_bin):
        print ("Find cache file %s" % doc2ver_bin)
        model=Doc2Vec.load(doc2ver_bin)
    else:
        model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(doc2ver_bin)

训练好模型后,通过如下方法即可向量化

def getVecs(model, corpus, size):
    vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus]
    return np.array(np.concatenate(vecs),dtype='float')

在本小节中,x_train[0]的处理结果如下

[ 1.07623920e-01  2.03667670e-01  4.39605879e-04  2.15817243e-02
 -1.33999720e-01 -1.04999244e-01  6.59373477e-02  1.42834812e-01
  1.93634644e-01  3.39346938e-02 -3.85309849e-03 -1.64910153e-01
 -1.58900153e-02 -2.52948906e-02 -3.99647981e-01 -9.12046656e-02
  5.02838679e-02 -2.62314398e-02 -2.06790075e-01  1.31346032e-01
 -4.74643335e-02 -8.93631503e-02  3.17618906e-01  1.38969775e-02
 -2.91506946e-01 -1.80232242e-01  1.07947908e-01  1.12745747e-01
 -8.35281014e-02 -1.05828121e-01 -1.48421243e-01 -6.38468117e-02
  4.14518379e-02  3.76846604e-02  1.51488677e-01 -8.19508582e-02
 -2.27401778e-01  4.36623842e-02  1.62245184e-01  1.52473062e-01
  4.37728465e-02 -1.23079844e-01 -3.73415574e-02 -1.03757225e-01
 -1.39077678e-01  1.03434525e-01 -1.43260106e-01 -5.22182137e-02
 -7.31687844e-02  2.00760439e-01  2.19642222e-02  8.81896541e-02
  2.05241382e-01 -1.33090258e-01  1.55719062e-02  2.02201935e-03
  1.01647124e-01 -3.05845201e-01 -8.83292854e-02 -1.23624884e-01
  7.08331615e-02  4.64668535e-02 -9.01960209e-02  9.96094272e-02
 -1.34212002e-01  2.00427771e-01  1.07755877e-01  2.25574061e-01
 -1.82992116e-01  1.53573334e-01 -9.32435393e-02  1.56051908e-02
  1.82265509e-02  7.87798688e-02  1.61014810e-01  9.62717682e-02
 -1.00199223e-01  5.18805161e-02 -1.82170309e-02 -1.01618238e-01
 -2.62078028e-02 -8.01115185e-02  7.61420429e-02  1.60436168e-01
  2.32044682e-01  1.22496150e-01  3.62544470e-02 -6.78069219e-02
 -2.98173614e-02 -8.31498671e-03 -1.16020748e-02  3.31646539e-02
  2.66764522e-01 -1.76852301e-01 -2.00362876e-01  7.83127621e-02
  4.40042838e-02  1.37215763e-01 -3.20810378e-02  4.33634184e-02
  7.93892331e-03  8.94398764e-02  7.15106502e-02  7.87770897e-02
 -2.57502228e-01 -1.56798676e-01  1.16182871e-01  4.86833565e-02
  4.79952479e-03  9.11491066e-02 -1.40987933e-02  2.22539883e-02
 -1.59795076e-01  9.47566405e-02  1.88340500e-01  2.16186821e-01
  1.01996697e-01 -5.08516617e-02 -6.35962486e-02 -1.20826885e-01
 -2.66686548e-02  1.30622894e-01 -5.55477142e-02  1.86781764e-01
  1.94867045e-01 -5.55491969e-02 -1.73223794e-01 -2.23940029e-03
 -8.42484012e-02 -9.78929922e-03  1.58954084e-01  2.10609302e-01
 -9.30389911e-02 -1.19832724e-01 -2.57326901e-01 -9.47180465e-02
 -6.85976595e-02  5.72628602e-02 -7.99224451e-02  1.05779201e-01
  3.46482173e-02 -6.65628025e-03 -4.14694101e-03 -1.77285731e-01
  1.71367064e-01 -1.04618266e-01  1.80818886e-02  1.00839488e-01
  4.69732359e-02  5.89858182e-02 -1.17412247e-01  2.59877682e-01
  2.43245989e-01 -1.44933388e-01  2.10463241e-01  2.31560692e-01
 -2.60037601e-01  4.47640829e-02 -2.82789797e-01  7.55265653e-02
 -2.29167998e-01  2.02734083e-01 -4.64072858e-04  1.37224300e-02
  1.67293772e-01  2.41031274e-01 -2.25378182e-02  3.51362005e-02
 -1.32802263e-01  1.59649570e-02  4.22540531e-02  1.46264017e-01
  7.94619396e-02 -1.92659259e-01  5.14810346e-02  3.52863334e-02
  1.29052609e-01 -1.65790990e-01  6.53655455e-02  8.40774626e-02
  6.88608587e-02 -1.29266530e-01 -4.67737317e-01  1.19465608e-02
 -4.01823148e-02 -6.38013333e-02  2.58218110e-01 -3.32820341e-02
 -3.28772753e-01  1.57434553e-01 -3.27924967e-01 -4.27057222e-02
 -1.31331399e-01 -1.63260803e-01  5.75413043e-03 -1.01014659e-01
  1.76209942e-01  4.11938801e-02 -2.52132833e-01  1.40920551e-02
 -1.06663443e-01 -2.49055997e-01  8.67770463e-02  1.78776234e-01
  1.24725297e-01  1.30073011e-01  4.45161164e-02  1.60344705e-01
  1.23212010e-01 -1.01057090e-01 -9.87102762e-02 -2.37999424e-01
 -2.54926801e-01  1.51191801e-01 -1.31282523e-01 -2.35551950e-02
 -4.09922339e-02  2.10600153e-01 -5.79826534e-02  3.74017060e-02
 -2.94750094e-01 -7.88278356e-02  2.41033360e-01 -1.52092993e-01
  2.95348503e-02  3.29553857e-02 -1.94013342e-02 -3.87551337e-01
 -2.59788465e-02 -5.78223402e-03 -2.31364612e-02 -3.76425721e-02
 -1.05162010e-01 -7.57280588e-02 -2.94224806e-02  9.29094255e-02
  5.02580917e-03  1.27323985e-01  1.81948487e-02  2.53445506e-02
  5.22816628e-02  1.21363856e-01  6.15054667e-02  1.54508455e-02
 -8.20488727e-04  7.56623894e-02 -1.50576439e-02  1.67132884e-01
 -2.82725275e-01  1.27364337e-01  1.80122808e-01 -1.31743163e-01
 -9.61113051e-02 -6.37162179e-02  7.30723217e-02  1.46646490e-02
 -1.33017078e-01 -1.19762242e-01  2.35711131e-02 -2.81341195e-01
 -5.68757765e-03  1.85813874e-01  1.10502146e-01 -7.27362931e-02
 -1.60796911e-01  5.11522032e-02 -9.55855697e-02 -7.15414509e-02
  2.83251163e-02  1.16654523e-01  3.66797764e-03  1.54608116e-01
 -2.03318566e-01  6.67307079e-02  8.06226656e-02 -1.19694658e-02
  8.96890089e-02  2.01400295e-01  8.02449882e-02 -2.21296884e-02
  5.21850772e-02 -1.38125028e-02  8.87114927e-02  1.21549807e-01
 -5.28845526e-02  3.75475399e-02  6.20372519e-02  1.29727647e-01
 -6.90457448e-02  2.34339647e-02 -4.55942191e-02  4.64116298e-02
 -1.33431286e-01  2.55507827e-01  2.16026157e-01 -9.25980732e-02
 -4.85622920e-02 -1.45295113e-01 -2.58427318e-02  1.93505753e-02
  1.17950164e-01 -1.23775247e-02  1.33392587e-01 -8.28817710e-02
  1.36878446e-01 -6.80091828e-02 -1.98093444e-01  5.15850522e-02
  1.92994636e-03  2.26874277e-01  1.26609832e-01 -5.96865974e-02
 -1.37154952e-01  1.35652214e-01  1.13142699e-01  3.95694794e-03
 -2.06833467e-01 -7.06818774e-02 -2.37924159e-02  7.30280858e-03
 -2.31933817e-01 -2.13069454e-01 -1.42960489e-01 -7.01301452e-03
  8.29812512e-03  6.81945086e-02  1.51407436e-01  7.65770003e-02
  1.34944141e-01  1.27678066e-01 -2.07772329e-01  1.72014341e-01
 -5.36921099e-02 -7.11955428e-02  4.27525938e-02 -7.01407567e-02
 -7.80161470e-03  2.37379566e-01 -9.58834738e-02 -7.06278309e-02
 -2.44213790e-02 -6.29713684e-02  4.97949077e-03 -1.93031520e-01
  1.07220590e-01 -6.00046758e-03 -1.33072376e-01 -1.13887295e-01
  1.19876094e-01  8.42822641e-02 -1.89513087e-01  9.65327695e-02
  5.26705459e-02  1.42758876e-01  1.56315282e-01  2.01160442e-02
  4.45949882e-02 -2.87032127e-02  8.93688649e-02  2.41289198e-01
  1.31013185e-01  4.65613231e-03 -4.40470129e-02  2.39219904e-01
 -1.30711459e-02  4.95963730e-02  2.94293970e-01 -5.10329269e-02
 -1.80931166e-01  1.07536905e-01  9.01441202e-02 -1.17586985e-01
  1.99178141e-02  1.22322226e-02 -1.71743870e-01 -1.57537639e-01
  1.04444884e-01  7.31004849e-02  9.70724411e-03  1.06952451e-01
  1.65776461e-01  1.47664443e-01  8.90543163e-02  7.31813684e-02
  1.05123490e-01  1.22088723e-01 -1.21460930e-02 -1.45071194e-01
 -8.42208490e-02 -1.08709313e-01  2.45642308e-02 -6.45151436e-02
  4.05842774e-02 -2.05672416e-03  6.51900023e-02  1.91479787e-01
 -9.36061218e-02  1.43005680e-02  1.36256188e-01 -5.99846505e-02]

 note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)

《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)_mooyuan的博客-CSDN博客

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mooyuan天天

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值