本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以Word2Vec_1d与Word2Vec_2d对两种特征提取方式来详细讲解。
(六)Word2Vec_1d
与word2vec不同之处在于标准化处理,这一处理流程如下所示:
1.MinMaxScaler标准化
如果想把各个维度的数据都转换到0和1之间,可以使用MinMaxScaler函数,比如转换前的数据为:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)
输出结果如下所示
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
[0. 1. 0. ]]
相对于word2vec的scale处理,word2vec的标准化处理流程如下
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
2.完整源码
def get_features_by_word2vec_cnn_1d():
global max_features
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
return x_train, x_test, y_train, y_test
(七)Word2Vec_2d
根据第五点,我们知道Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如
model['good boy']= (model['good]+ model['boy])/2
这里要特别注意,由于CNN使用2d卷积可以直接识别,故而这部分无需基于基于此原则处理,以训练姐为例,代码如下所示
for sms in x_train:
sms=sms[:max_document_length]
x_train_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
try:
vec=model[w].reshape((1, max_features))
x_train_vec[i-1]=vec.copy()
except KeyError:
continue
x_train_vecs.append(x_train_vec)
这里要特别注意需要捕捉异常,因为之前配置时限制了最大特征数量max_features,综上所有源码如下所示
def get_features_by_word2vec_cnn_2d():
global max_features
global max_document_length
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train_vecs=[]
x_test_vecs=[]
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
x_all=np.zeros((1,max_features))
for sms in x_train:
sms=sms[:max_document_length]
#print sms
x_train_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_train_vec[i-1]=vec.copy()
#x_all=np.concatenate((x_all,vec))
x_train_vecs.append(x_train_vec)
#print x_train_vec.shape
for sms in x_test:
sms=sms[:max_document_length]
x_test_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_test_vec[i-1]=vec.copy()
x_test_vecs.append(x_test_vec)
min_max_scaler = preprocessing.MinMaxScaler()
print ("fix min_max_scaler")
x_train_2d=np.concatenate([z for z in x_train_vecs])
min_max_scaler.fit(x_train_2d)
x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])
x_train=x_train.reshape([-1, max_document_length, max_features, 1])
x_test = x_test.reshape([-1, max_document_length, max_features, 1])
return x_train, x_test, y_train, y_test
在这里对train数据集进行测试,如下测试值打印为400,160,也就是说生成向量长度为(160,400)
print('feature', max_features)
print('sms length', max_document_length)
在初始化阶段时,根据上面的值,将其初始化为160*400的矩阵,
x_train_vec = np.zeros((max_document_length, max_features))
以下图为示例,这句话的原始词汇如下所示
sms ['we', 'tried', 'to', 'contact', 'you', 're', 'our', 'offer', 'of', 'new', 'video', 'phone', '750', 'anytime', 'any', 'network', 'mins', 'half', 'price', 'rental', 'camcorder', 'call', '08000930705', 'or', 'reply', 'for', 'delivery', 'wed']
在计算过程中,每个单词都被映射为shape为(1,400)的向量,以we为例,特征向量如下
we [[ 5.13324559e-01 6.92280710e-01 -6.06139421e-01 -7.33043671e-01
-1.05264068e+00 1.08887844e-01 -4.79352057e-01 -4.47796851e-01
2.79668242e-01 8.62461388e-01 -6.26937568e-01 -1.11396635e+00
-5.17404914e-01 -7.41610050e-01 -5.43049991e-01 1.26212585e+00
-2.09969897e-02 -7.99523532e-01 7.05081582e-01 -2.12729737e-01
-3.72616291e-01 -1.35459530e+00 4.50749367e-01 1.99423480e+00
-6.92014813e-01 -5.42739570e-01 9.73733068e-01 -9.93132412e-01
-1.98732018e+00 5.50295152e-02 -6.52212918e-01 2.61625558e-01
-6.60811782e-01 5.22207975e-01 3.28823067e-02 1.19926560e+00
-1.11021876e-01 7.38386214e-02 3.81358445e-01 2.09052667e-01
-8.29950094e-01 -6.39967322e-01 -1.78698123e+00 9.00085032e-01
-9.13119793e-01 -1.32008731e+00 1.05953765e+00 7.21830130e-01
2.56760389e-01 8.63298297e-01 1.83388978e-01 -2.71840870e-01
-1.10518491e+00 -1.32568777e+00 -1.76396556e-02 -3.53596121e-01
-3.26753765e-01 8.18026602e-01 3.69717002e-01 -1.05936432e+00
-5.11582315e-01 -7.87444296e-04 8.23280692e-01 5.23390949e-01
3.56160045e-01 3.36051375e-01 -8.46250296e-01 -2.13131160e-02
1.16455960e+00 5.76237142e-01 -2.83024520e-01 2.89030671e-01
-2.98734903e-01 -8.48811343e-02 -1.35541141e+00 9.47913602e-02
-1.93751216e-01 1.39996544e-01 2.48220786e-01 -3.07027787e-01
1.46764421e+00 6.78055763e-01 -2.19629765e-01 -1.26420140e+00
-9.32358503e-02 -1.06198204e+00 -1.05375014e-01 -8.62080008e-02
-6.28659487e-01 -4.20507900e-02 -9.40151274e-01 1.91700347e-02
1.57449976e-01 -8.73531494e-03 1.55828786e+00 -5.89314818e-01
3.09453845e-01 -1.52258849e+00 2.46635512e-01 1.06939852e+00
-1.97222233e-01 5.57442427e-01 9.26564932e-01 -4.49043661e-01
-8.07128549e-01 -8.00708532e-01 -9.97351348e-01 1.27024937e+00
-5.16803145e-01 -1.25073361e+00 -1.68358684e-02 -4.12435591e-01
1.12003529e+00 -4.83542472e-01 9.42728221e-01 1.04354310e+00
-3.37721795e-01 3.47444683e-01 -1.32276759e-01 -1.12430441e+00
5.19255280e-01 -4.70025271e-01 -2.95793682e-01 1.15930760e+00
-1.46419370e+00 -3.28355014e-01 1.20763075e+00 7.36234725e-01
1.92663229e+00 -7.88556874e-01 2.79897422e-01 -7.99781442e-01
-8.94552469e-01 -1.58961520e-01 -6.48326576e-01 1.52472287e-01
-4.62858267e-02 2.21728235e-02 -4.91406024e-01 5.98726034e-01
8.25376391e-01 5.93617022e-01 3.23204279e-01 1.87094241e-01
8.95736217e-02 1.57505047e+00 -4.96031970e-01 2.10392982e-01
2.35750794e-01 6.23321176e-01 1.00233054e+00 -7.98230767e-01
1.71729431e-01 3.89039546e-01 3.40540446e-02 1.49258062e-01
5.23613870e-01 -5.20109951e-01 1.11207891e+00 5.08346438e-01
-4.21277046e-01 -3.53204191e-01 -6.67744637e-01 -1.07647288e+00
-1.30223131e+00 3.72959256e-01 8.32573354e-01 1.23022115e+00
-3.76891047e-01 -1.44855738e-01 -1.22572112e+00 -5.34232073e-02
-2.84706771e-01 6.37961626e-01 5.68755329e-01 1.75359726e+00
4.44663256e-01 -7.93476462e-01 -2.56309092e-01 -9.81180012e-01
3.86805981e-02 -6.01904452e-01 3.38268459e-01 4.95142251e-01
-4.19250995e-01 -6.14156127e-01 -1.74631262e+00 2.86130071e-01
-7.84139872e-01 8.27276945e-01 4.14437145e-01 8.87963101e-02
-1.96212530e-01 8.42595875e-01 -8.42688754e-02 -1.06598842e+00
-5.22110045e-01 -3.77817780e-01 1.33652896e-01 -2.49611631e-01
5.73737025e-01 -9.50190306e-01 -1.00966275e-01 -1.47440713e-02
-1.23199463e+00 -1.03451073e+00 -6.90591514e-01 -1.17653474e-01
4.69776809e-01 8.45959485e-01 -6.69992864e-01 9.39928591e-01
2.43703172e-01 -5.22347033e-01 1.50591660e+00 -1.25705695e+00
-1.07466733e+00 2.64614135e-01 1.29109251e+00 -5.54164290e-01
-3.75026703e-01 1.21394324e+00 3.73631209e-01 -3.39792162e-01
7.77602315e-01 6.85810968e-02 2.59181112e-01 -5.74017346e-01
1.80175275e-01 -5.17311059e-02 7.63257027e-01 -6.27086818e-01
-1.75175023e+00 -1.33330524e+00 -2.09044114e-01 -3.61217469e-01
9.18019235e-01 -6.96359158e-01 -8.12097341e-02 -1.99478948e+00
5.33144951e-01 -8.93714070e-01 -1.15886045e+00 1.04454172e+00
-5.89572251e-01 5.85066915e-01 -7.20543563e-01 7.95059741e-01
-1.21909738e+00 -6.99866176e-01 -2.16999158e-01 1.20310523e-01
-1.82821289e-01 -4.16274257e-02 6.58949614e-01 -6.93615258e-01
1.74605417e+00 1.38246298e-01 -1.33604541e-01 -5.39953291e-01
6.67522177e-02 7.55294517e-04 -5.03420174e-01 -7.11347222e-01
-4.60816860e-01 6.16341352e-01 4.63734388e-01 1.24169385e+00
2.37992201e-02 1.45490086e+00 5.40668070e-01 -6.08588755e-01
8.23473692e-01 3.63709390e-01 4.70518112e-01 5.07119834e-01
1.00773372e-01 2.39598155e-01 -8.08379352e-01 6.01775765e-01
-7.75957584e-01 5.12413621e-01 -8.06147754e-01 9.48663130e-02
-1.41660139e-01 1.29196596e+00 5.58595240e-01 -5.60340136e-02
1.15606940e+00 -3.82585824e-01 2.60255396e-01 1.53539240e+00
-4.92683649e-01 -7.33903587e-01 -1.37848473e+00 -9.12505984e-01
-1.74521387e-01 -1.90168723e-01 3.82246822e-01 -3.94751132e-02
1.11151767e+00 1.11433804e+00 1.20742284e-01 1.30665922e+00
-2.81528473e-01 -9.17194877e-03 5.35691738e-01 -5.08557260e-01
-3.51130486e-01 -8.15440595e-01 9.59339619e-01 -1.22608221e+00
1.50517665e-03 -4.85110991e-02 -1.02146101e+00 -6.09341323e-01
-4.56822574e-01 -5.58608174e-02 5.85547611e-02 -9.65700805e-01
-1.89950958e-01 2.49769300e-01 -1.83345869e-01 -1.61435679e-01
-8.93191770e-02 -8.67426872e-01 -3.20736989e-02 9.62894142e-01
-1.56400785e-01 -1.01036263e+00 4.74414289e-01 -9.77616847e-01
-1.23069656e+00 9.33022261e-01 -1.11188591e-01 -8.74921307e-02
1.23017088e-01 1.27999806e+00 -9.75147247e-01 2.15137690e-01
-5.74331656e-02 8.87444019e-01 3.78122181e-01 -5.19478023e-01
1.60879225e-01 4.34285700e-02 1.13737062e-01 6.59871325e-02
-1.24426618e-01 -5.20280719e-01 -3.63858700e-01 -3.51046622e-01
-3.04933637e-01 -5.91847226e-02 7.01255381e-01 -1.06392646e+00
1.08689629e-01 1.07840514e+00 -9.45310116e-01 -1.78924516e-01
2.00637262e-02 4.02919173e-01 -9.98029232e-01 -1.11832178e+00
-6.08632922e-01 1.32893309e-01 3.19699764e-01 -2.58571565e-01
2.51511425e-01 -1.25716972e+00 2.17012092e-01 -1.17317878e-01
1.07084620e+00 -9.22225490e-02 8.86658728e-01 -9.41256881e-01
1.72717959e-01 -3.56864363e-01 -4.89861757e-01 3.81545901e-01
1.74080342e-01 -5.66165209e-01 1.95712000e-01 1.61951578e+00
2.77753413e-01 -1.70714423e-01 1.27170652e-01 -1.92686200e-01
2.13473767e-01 -9.17410314e-01 -1.04300666e+00 -5.42495966e-01
-2.21326977e-01 -4.96313930e-01 1.79395282e+00 1.68110895e+00
2.87916631e-01 -8.61096904e-02 1.61331439e+00 -1.03910947e+00]]
tried映射word2vec的向量为
tried [[-6.12377405e-01 4.16364968e-01 -2.78998375e-01 8.34699646e-02
-2.28010252e-01 3.08151543e-01 2.52343982e-01 -3.66741210e-01
-8.41982186e-01 -1.80800125e-01 -5.81489503e-01 -1.90963492e-01
3.02755803e-01 -1.06433824e-01 5.65255463e-01 1.59041151e-01
4.47588146e-01 3.80713344e-01 -1.49773791e-01 -3.71468604e-01
-4.12444770e-01 -2.07973748e-01 7.92427480e-01 -5.68225645e-02
5.91229439e-01 2.05591187e-01 -2.96749562e-01 -3.81954312e-02
-3.47267598e-01 -1.54623434e-01 -3.16062421e-02 -2.79324621e-01
9.93312802e-04 -4.35144268e-02 3.68950248e-01 3.36872280e-01
1.28488570e-01 4.10110801e-02 3.97733957e-01 -7.92261064e-02
-2.56396919e-01 1.78995103e-01 -2.94328153e-01 3.26810986e-01
6.19691372e-01 -6.44081712e-01 1.66229442e-01 -1.15246415e-01
-1.96021140e-01 5.71881771e-01 2.39205420e-01 -8.21989954e-01
5.77406943e-01 5.78376591e-01 6.62666559e-02 1.79710805e-01
4.50686693e-01 -8.30448046e-02 5.49292117e-02 -6.13321841e-01
-2.56487906e-01 4.41231430e-02 -7.95108438e-01 -6.97279871e-01
-4.26836580e-01 6.01450074e-03 -5.60432561e-02 -1.08723938e-02
5.39940178e-01 -6.16185009e-01 9.34244394e-02 -2.29318902e-01
1.88086748e-01 4.97419924e-01 -9.02395964e-01 -7.60009706e-01
9.49736834e-02 5.91007233e-01 -1.15908660e-01 -6.41334772e-01
-4.10496473e-01 7.33896941e-02 -2.04081267e-01 -2.59717613e-01
-3.02662104e-01 1.74025334e-02 -9.20143545e-01 5.68379939e-01
4.52155858e-01 -3.03475648e-01 -8.60066772e-01 -3.32691103e-01
-1.39265224e-01 -2.94290334e-01 2.71589816e-01 -8.56407359e-02
2.16451183e-01 -3.93221825e-01 -3.92221771e-02 1.90483168e-01
6.09004319e-01 5.67121565e-01 5.19057095e-01 -2.78411023e-02
-3.63553882e-01 2.24665418e-01 -4.93543930e-02 -4.85199660e-01
-1.28926590e-01 -2.72287935e-01 -5.09980321e-01 7.59351552e-02
9.42224190e-02 -2.37373099e-01 -1.03565240e+00 4.93904233e-01
-2.54788518e-01 -7.90989026e-02 1.92527086e-01 -3.01439375e-01
-5.00758141e-02 3.35612983e-01 1.67256281e-01 7.30704248e-01
-3.93237889e-01 -2.79548228e-01 -1.58486655e-03 2.05960274e-01
-6.19939148e-01 9.11417976e-02 3.42352986e-01 2.19285220e-01
-1.13095188e+00 5.46321392e-01 3.22455764e-02 2.40575343e-01
-4.96992290e-01 -4.24431205e-01 2.97653377e-02 -2.34046698e-01
3.77572142e-02 2.45649025e-01 1.18621111e-01 -2.35571980e-01
1.21258840e-01 -7.71017745e-02 -5.61713159e-01 5.59415407e-02
-3.39809328e-01 -1.32750332e-01 -3.00569478e-02 -3.62776071e-01
-9.88005847e-03 -6.34131610e-01 -1.14288427e-01 1.50959522e-01
4.09990132e-01 8.65844339e-02 -8.01804900e-01 -1.09013222e-01
-8.83603334e-01 1.00924993e+00 1.72543451e-02 1.00044213e-01
1.14075527e-01 -1.52706569e-02 5.38514793e-01 6.58321857e-01
1.03414440e+00 -2.54678696e-01 -2.80912459e-01 -3.42982501e-01
-6.03122354e-01 -5.64105988e-01 -4.83809561e-01 -7.63532937e-01
3.94593596e-01 2.63797462e-01 1.89387262e-01 1.97888896e-01
-2.22714692e-01 -1.63537845e-01 5.44133842e-01 -4.73020166e-01
-1.58943608e-01 1.47972321e-02 -2.47510076e-01 -2.52945155e-01
-1.42151117e+00 -7.73319006e-02 -2.18766674e-01 2.90358305e-01
-8.11922923e-02 3.80099297e-01 4.07175541e-01 -6.06728375e-01
4.35942084e-01 -5.07750392e-01 -3.74243110e-01 -2.28505861e-02
3.85590881e-01 -9.61492583e-02 6.28207564e-01 -2.27478340e-01
-3.53291005e-01 1.27462888e+00 1.22006260e-01 6.96230888e-01
-4.91642160e-03 9.73952487e-02 5.97821176e-01 2.64382482e-01
-1.07160784e-01 2.76032716e-01 -1.00402012e-02 1.52446823e-02
-1.02653086e+00 -3.72105278e-02 9.84817185e-03 -9.78173837e-02
2.51751870e-01 7.96171010e-01 1.74022168e-01 4.11238670e-01
6.62876844e-01 2.44819392e-02 2.32926726e-01 -1.71004206e-01
6.59075752e-02 -3.79822493e-01 -2.80748993e-01 3.84911120e-01
-3.39996159e-01 -6.03033841e-01 -3.07716448e-02 9.59311575e-02
1.63629785e-01 3.89313966e-01 4.56324965e-01 -1.47402048e-01
2.24111572e-01 -1.47401705e-01 3.15533757e-01 2.16347668e-02
-1.41451418e-01 4.05424476e-01 -4.94289964e-01 2.80132085e-01
-2.61499792e-01 5.22326589e-01 -3.14167172e-01 -2.90078729e-01
1.29095510e-01 -2.91293949e-01 -2.05306053e-01 -7.58467257e-01
2.23533452e-01 -7.64667317e-02 4.83570069e-01 6.99249208e-02
1.30418539e-01 -1.41056001e-01 -7.23488152e-01 -1.53211460e-01
-1.52336270e-01 -4.12477590e-02 1.60394862e-01 -7.33491685e-03
2.02331066e-01 -6.27143621e-01 -4.36783656e-02 6.27906978e-01
-3.39776695e-01 -2.54275233e-01 -3.11400682e-01 -5.71068414e-02
-9.78075266e-02 -6.21155202e-01 1.29985273e-01 1.04443580e-02
2.43578345e-01 -1.07544787e-01 1.15666062e-01 1.05102813e+00
-2.67948598e-01 -1.73818737e-01 7.98049942e-02 6.87964559e-02
-4.02220339e-01 1.27988279e-01 -7.05842525e-02 8.49173516e-02
-3.85340363e-01 4.22646612e-01 -1.09368823e-01 -9.09663737e-02
5.47704101e-02 -4.70033109e-01 1.41687408e-01 2.02474698e-01
7.75358200e-01 3.81336033e-01 8.62001162e-03 -1.83762997e-01
-1.69773564e-01 -6.28888071e-01 -6.15247846e-01 4.59231079e-01
4.09249932e-01 -2.44395018e-01 -1.12164527e-01 -3.31565857e-01
1.67021617e-01 8.46286044e-02 1.54180750e-01 -1.56229764e-01
3.98001611e-01 -1.01308875e-01 6.64622247e-01 3.89112197e-02
8.25130641e-02 -4.84833509e-01 -3.67288351e-01 4.05334085e-01
3.12755287e-01 -2.55807668e-01 -3.96498263e-01 5.65700412e-01
2.73670524e-01 -9.04162899e-02 -2.60679964e-02 2.46381899e-03
3.49983983e-02 3.14775467e-01 -2.59854734e-01 -2.87085712e-01
7.13063776e-01 1.51837751e-01 -8.10065806e-01 -6.50787354e-01
8.68012831e-02 1.63752958e-01 2.44253531e-01 2.92172730e-01
-1.07378745e+00 1.08177908e-01 1.96341053e-02 -9.42850485e-02
2.40874976e-01 7.31463015e-01 -1.52041927e-01 6.45682588e-02
5.57659388e-01 -5.77583313e-01 -6.75140381e-01 -6.40653772e-03
2.81538606e-01 -1.68723643e-01 -3.24986398e-01 1.59401417e-01
-1.32915914e-01 9.62117612e-02 6.53585255e-01 -4.89544272e-01
2.66982883e-01 1.92768946e-01 2.40562856e-02 -1.17575504e-01
-1.35681018e-01 -1.33427590e-01 -4.41130966e-01 1.55471131e-01
4.75975633e-01 -3.60106885e-01 5.60806274e-01 -4.64459300e-01
-5.51069260e-01 -7.95026064e-01 -2.00052291e-01 1.86785296e-01
5.44229865e-01 -1.40094772e-01 7.62127638e-02 2.72861958e-01
6.26554191e-01 -1.92846179e-01 5.23298919e-01 -2.57884264e-02
5.74357688e-01 -4.20777857e-01 2.34971404e-01 -1.05997846e-01
-2.79536992e-01 -8.36799264e-01 -1.56205758e-01 5.80618501e-01
4.83773470e-01 4.12972480e-01 -3.56931061e-01 -5.20759337e-02]]
最后计算获得sms这段话的向量最后shape为(160,400),由于初始化将其zero为全0矩阵,故而在计算过程中,仅需对该句话存在的单词生成为vec,并通过如下语句赋值
x_train_vec[i-1]=vec.copy()
最后,将其append到list中,如下所示
x_train_vecs.append(x_train_vec)
以下面运行结果为例,这里面是从-1~(句子的实际长度-2)的序号均有向量,其中-1即为最后一行,即第159行,这样做是因为作者的代码为x_train_vec[i-1]=vec.copy(),不过作者的代码这种细节化的部分通常都没有注释,我的理解是作者可能认为for循环中这个序号i是从1开始计数?所以这样赋值么?
['at', 'home', 'by', 'the', 'way']
(160, 400) [[-1.58791387 0.90998811 1.02353275 ... -0.49089381 0.45235381
0.196127 ]
[-0.56197596 0.48324966 -0.02229143 ... -1.04505527 -0.41888034
-0.89763647]
[-0.18195434 0.69938725 0.01750618 ... -0.00903855 0.69490665
0.02577416]
...
[ 0. 0. 0. ... 0. 0.
0. ]
[ 0. 0. 0. ... 0. 0.
0. ]
[ 0.7947135 0.95048201 0.8174293 ... 0.59431046 1.00592685
0.34427088]]
如果将赋值代码改为
x_train_vec[i] = vec.copy()
那么测试代码生成的向量结果如下所示,均为0-(length-1)的范围内向量值不全为0
sms ['oh', 'thanks', 'a', 'lot', '.', '.', 'i', 'already', 'bought', '2', 'eggs', '.', '.']
0 oh
1 thanks
2 a
3 lot
4 .
5 .
6 i
7 already
8 bought
9 2
10 eggs
11 .
12 .
(160, 400) [[ 0.51650995 0.87204415 0.0464043 ... 0.01095798 0.15478413
-0.119578 ]
[-0.10295536 0.97929972 -0.06606753 ... -0.0157218 0.04467017
0.30258381]
[-1.07609522 0.46579501 0.35544962 ... 0.36150065 -0.43277314
0.32538563]
...
[ 0. 0. 0. ... 0. 0.
0. ]
[ 0. 0. 0. ... 0. 0.
0. ]
[ 0. 0. 0. ... 0. 0.
0. ]]
(八)Doc2vec
基于Word2Vec的方法,Quoc Le和Tomas Mikolov又给出了Doc2Vec的训练方法。如下图所示,其原理与Word2Vec相同,分为分布式存储(Distributed Memory,DM)和分布
式词袋(Distributed Bag of Words,DBOW)
具体源码如下所示
def get_features_by_doc2vec():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
print('y:', len(y_train), len(y_test))
print('x:', len(x_train), len(x_test))
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(doc2ver_bin):
print ("Find cache file %s" % doc2ver_bin)
model=Doc2Vec.load(doc2ver_bin)
else:
model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(doc2ver_bin)
x_test=getVecs(model,x_test,max_features)
x_train=getVecs(model,x_train,max_features)
return x_train, x_test, y_train, y_test
以train[0]为例,初始化代码如下
x_train, x_test, y_train, y_test=load_all_files()
这时返回结果为
That's my honeymoon outfit. :)
接下来是cleanText处理
x_train=cleanText(x_train)
x_test=cleanText(x_test)
处理结果如下所示
["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')']
与Word2Vec不同的地方是,Doc2Vec处理的每个英文段落,需要使用一个唯一的标识来标记,并且使用一种特殊定义的数据格式保存需要处理的英文段落,这种数据格式定义如
下:
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
其中SentimentDocument可以理解为这种格式的名称,也可以理解为这种对象的名称,words会保存英文段落,并且是以单词和符合列表的形式保存,tags就是我们说的保存的唯一标识。最简单的一种实现就是依次给每个英文段落编号,测试数据集的标记为“TRAIN_数字”,测试数据集的标记为“TEST_数字”:
def labelizeReviews(reviews, label_type):
labelized = []
for i, v in enumerate(reviews):
label = '%s_%s' % (label_type, i)
labelized.append(SentimentDocument(v, [label]))
return labelized
本文调用逻辑如下
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
这时,x_train[0]的处理结果如下
SentimentDocument(words=["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')'], tags=['TRAIN_0'])
训练模型或者load模型的代码如下
if os.path.exists(doc2ver_bin):
print ("Find cache file %s" % doc2ver_bin)
model=Doc2Vec.load(doc2ver_bin)
else:
model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(doc2ver_bin)
训练好模型后,通过如下方法即可向量化
def getVecs(model, corpus, size):
vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus]
return np.array(np.concatenate(vecs),dtype='float')
在本小节中,x_train[0]的处理结果如下
[ 1.07623920e-01 2.03667670e-01 4.39605879e-04 2.15817243e-02
-1.33999720e-01 -1.04999244e-01 6.59373477e-02 1.42834812e-01
1.93634644e-01 3.39346938e-02 -3.85309849e-03 -1.64910153e-01
-1.58900153e-02 -2.52948906e-02 -3.99647981e-01 -9.12046656e-02
5.02838679e-02 -2.62314398e-02 -2.06790075e-01 1.31346032e-01
-4.74643335e-02 -8.93631503e-02 3.17618906e-01 1.38969775e-02
-2.91506946e-01 -1.80232242e-01 1.07947908e-01 1.12745747e-01
-8.35281014e-02 -1.05828121e-01 -1.48421243e-01 -6.38468117e-02
4.14518379e-02 3.76846604e-02 1.51488677e-01 -8.19508582e-02
-2.27401778e-01 4.36623842e-02 1.62245184e-01 1.52473062e-01
4.37728465e-02 -1.23079844e-01 -3.73415574e-02 -1.03757225e-01
-1.39077678e-01 1.03434525e-01 -1.43260106e-01 -5.22182137e-02
-7.31687844e-02 2.00760439e-01 2.19642222e-02 8.81896541e-02
2.05241382e-01 -1.33090258e-01 1.55719062e-02 2.02201935e-03
1.01647124e-01 -3.05845201e-01 -8.83292854e-02 -1.23624884e-01
7.08331615e-02 4.64668535e-02 -9.01960209e-02 9.96094272e-02
-1.34212002e-01 2.00427771e-01 1.07755877e-01 2.25574061e-01
-1.82992116e-01 1.53573334e-01 -9.32435393e-02 1.56051908e-02
1.82265509e-02 7.87798688e-02 1.61014810e-01 9.62717682e-02
-1.00199223e-01 5.18805161e-02 -1.82170309e-02 -1.01618238e-01
-2.62078028e-02 -8.01115185e-02 7.61420429e-02 1.60436168e-01
2.32044682e-01 1.22496150e-01 3.62544470e-02 -6.78069219e-02
-2.98173614e-02 -8.31498671e-03 -1.16020748e-02 3.31646539e-02
2.66764522e-01 -1.76852301e-01 -2.00362876e-01 7.83127621e-02
4.40042838e-02 1.37215763e-01 -3.20810378e-02 4.33634184e-02
7.93892331e-03 8.94398764e-02 7.15106502e-02 7.87770897e-02
-2.57502228e-01 -1.56798676e-01 1.16182871e-01 4.86833565e-02
4.79952479e-03 9.11491066e-02 -1.40987933e-02 2.22539883e-02
-1.59795076e-01 9.47566405e-02 1.88340500e-01 2.16186821e-01
1.01996697e-01 -5.08516617e-02 -6.35962486e-02 -1.20826885e-01
-2.66686548e-02 1.30622894e-01 -5.55477142e-02 1.86781764e-01
1.94867045e-01 -5.55491969e-02 -1.73223794e-01 -2.23940029e-03
-8.42484012e-02 -9.78929922e-03 1.58954084e-01 2.10609302e-01
-9.30389911e-02 -1.19832724e-01 -2.57326901e-01 -9.47180465e-02
-6.85976595e-02 5.72628602e-02 -7.99224451e-02 1.05779201e-01
3.46482173e-02 -6.65628025e-03 -4.14694101e-03 -1.77285731e-01
1.71367064e-01 -1.04618266e-01 1.80818886e-02 1.00839488e-01
4.69732359e-02 5.89858182e-02 -1.17412247e-01 2.59877682e-01
2.43245989e-01 -1.44933388e-01 2.10463241e-01 2.31560692e-01
-2.60037601e-01 4.47640829e-02 -2.82789797e-01 7.55265653e-02
-2.29167998e-01 2.02734083e-01 -4.64072858e-04 1.37224300e-02
1.67293772e-01 2.41031274e-01 -2.25378182e-02 3.51362005e-02
-1.32802263e-01 1.59649570e-02 4.22540531e-02 1.46264017e-01
7.94619396e-02 -1.92659259e-01 5.14810346e-02 3.52863334e-02
1.29052609e-01 -1.65790990e-01 6.53655455e-02 8.40774626e-02
6.88608587e-02 -1.29266530e-01 -4.67737317e-01 1.19465608e-02
-4.01823148e-02 -6.38013333e-02 2.58218110e-01 -3.32820341e-02
-3.28772753e-01 1.57434553e-01 -3.27924967e-01 -4.27057222e-02
-1.31331399e-01 -1.63260803e-01 5.75413043e-03 -1.01014659e-01
1.76209942e-01 4.11938801e-02 -2.52132833e-01 1.40920551e-02
-1.06663443e-01 -2.49055997e-01 8.67770463e-02 1.78776234e-01
1.24725297e-01 1.30073011e-01 4.45161164e-02 1.60344705e-01
1.23212010e-01 -1.01057090e-01 -9.87102762e-02 -2.37999424e-01
-2.54926801e-01 1.51191801e-01 -1.31282523e-01 -2.35551950e-02
-4.09922339e-02 2.10600153e-01 -5.79826534e-02 3.74017060e-02
-2.94750094e-01 -7.88278356e-02 2.41033360e-01 -1.52092993e-01
2.95348503e-02 3.29553857e-02 -1.94013342e-02 -3.87551337e-01
-2.59788465e-02 -5.78223402e-03 -2.31364612e-02 -3.76425721e-02
-1.05162010e-01 -7.57280588e-02 -2.94224806e-02 9.29094255e-02
5.02580917e-03 1.27323985e-01 1.81948487e-02 2.53445506e-02
5.22816628e-02 1.21363856e-01 6.15054667e-02 1.54508455e-02
-8.20488727e-04 7.56623894e-02 -1.50576439e-02 1.67132884e-01
-2.82725275e-01 1.27364337e-01 1.80122808e-01 -1.31743163e-01
-9.61113051e-02 -6.37162179e-02 7.30723217e-02 1.46646490e-02
-1.33017078e-01 -1.19762242e-01 2.35711131e-02 -2.81341195e-01
-5.68757765e-03 1.85813874e-01 1.10502146e-01 -7.27362931e-02
-1.60796911e-01 5.11522032e-02 -9.55855697e-02 -7.15414509e-02
2.83251163e-02 1.16654523e-01 3.66797764e-03 1.54608116e-01
-2.03318566e-01 6.67307079e-02 8.06226656e-02 -1.19694658e-02
8.96890089e-02 2.01400295e-01 8.02449882e-02 -2.21296884e-02
5.21850772e-02 -1.38125028e-02 8.87114927e-02 1.21549807e-01
-5.28845526e-02 3.75475399e-02 6.20372519e-02 1.29727647e-01
-6.90457448e-02 2.34339647e-02 -4.55942191e-02 4.64116298e-02
-1.33431286e-01 2.55507827e-01 2.16026157e-01 -9.25980732e-02
-4.85622920e-02 -1.45295113e-01 -2.58427318e-02 1.93505753e-02
1.17950164e-01 -1.23775247e-02 1.33392587e-01 -8.28817710e-02
1.36878446e-01 -6.80091828e-02 -1.98093444e-01 5.15850522e-02
1.92994636e-03 2.26874277e-01 1.26609832e-01 -5.96865974e-02
-1.37154952e-01 1.35652214e-01 1.13142699e-01 3.95694794e-03
-2.06833467e-01 -7.06818774e-02 -2.37924159e-02 7.30280858e-03
-2.31933817e-01 -2.13069454e-01 -1.42960489e-01 -7.01301452e-03
8.29812512e-03 6.81945086e-02 1.51407436e-01 7.65770003e-02
1.34944141e-01 1.27678066e-01 -2.07772329e-01 1.72014341e-01
-5.36921099e-02 -7.11955428e-02 4.27525938e-02 -7.01407567e-02
-7.80161470e-03 2.37379566e-01 -9.58834738e-02 -7.06278309e-02
-2.44213790e-02 -6.29713684e-02 4.97949077e-03 -1.93031520e-01
1.07220590e-01 -6.00046758e-03 -1.33072376e-01 -1.13887295e-01
1.19876094e-01 8.42822641e-02 -1.89513087e-01 9.65327695e-02
5.26705459e-02 1.42758876e-01 1.56315282e-01 2.01160442e-02
4.45949882e-02 -2.87032127e-02 8.93688649e-02 2.41289198e-01
1.31013185e-01 4.65613231e-03 -4.40470129e-02 2.39219904e-01
-1.30711459e-02 4.95963730e-02 2.94293970e-01 -5.10329269e-02
-1.80931166e-01 1.07536905e-01 9.01441202e-02 -1.17586985e-01
1.99178141e-02 1.22322226e-02 -1.71743870e-01 -1.57537639e-01
1.04444884e-01 7.31004849e-02 9.70724411e-03 1.06952451e-01
1.65776461e-01 1.47664443e-01 8.90543163e-02 7.31813684e-02
1.05123490e-01 1.22088723e-01 -1.21460930e-02 -1.45071194e-01
-8.42208490e-02 -1.08709313e-01 2.45642308e-02 -6.45151436e-02
4.05842774e-02 -2.05672416e-03 6.51900023e-02 1.91479787e-01
-9.36061218e-02 1.43005680e-02 1.36256188e-01 -5.99846505e-02]
note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)