本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以Word2Vec对骚扰短信特征提取方法来详细讲解。
note:第8章内容笔记为系列笔记,参考下面笔记列表
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)_mooyuan的博客-CSDN博客
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(2)
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(2)_mooyuan的博客-CSDN博客
(五) Word2Vec模型
Word2Vec是Google在2013年开源的一款将词表征为实数值向量的高效工具,采用的模型有连续词袋(Continuous Bag-Of-Words,CBOW)模型和Skip-Gram两种,原理图见下图。
Word2Vec通过训练,可以把对文本内容的处理简化为K维向量空间中的向量运算,而向量空间上的相似度可以用来表示文本语义上的相似度。因此,Word2Vec输出的词向量可以被用来做很多NLP相关的工作,比如聚类、找同义词、词性分析等等
1.数据预处理
相对于词袋、词集模型,word2vec模型增加了如下处理逻辑,因为短信内容中可能存在一些特殊符号,这类特殊符号也对判断骚扰邮件有一定帮助,需要处理的特殊符号如下所示:
punctuation = """.,?!:;(){}[]"""
常见的处理方法是在特殊符号前后增加空格,然后使用split函数切分时就可以完整保留这些特殊符号:
def cleanText(corpus):
punctuation = """.,?!:;(){}[]"""
corpus = [z.lower().replace('\n', '') for z in corpus]
corpus = [z.replace('<br />', ' ') for z in corpus]
# treat punctuation as individual words
for c in punctuation:
corpus = [z.replace(c, ' %s ' % c) for z in corpus]
corpus = [z.split() for z in corpus]
return corpus
将训练数据和测试数据分别使用cleanText函数处理,合并成完整数据集合x:
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
2.构建模型
初始化Word2Vec对象,size表示训练Word2Vec的神经网络隐藏层节点数,同时也表示了ord2Vec向量的维数;window表示训练Word2Vec的窗口长度;min_count表示出现次数小于min_count的单词将不计算;iter表示了训练Word2Vec的次数,gensim官方文档强烈建议增加iter次数以提高生成的Word2Vec的质量,默认值为5:
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
3.Word2Vec特征向量化
训练完成后,单词对应的Word2Vec会保存在model变量中,可以使用类似字典的方式直接访问,比如获取单词love对应的Word2Vec的方法为:
model['love']
Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如:
model['good boy']= (model['good]+ model['boy])/2
利用这个特性,可以将组成短信的单词和字符的Word2Vec相加并取平均值:
def buildWordVector(imdb_w2v,text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
try:
vec += imdb_w2v[word].reshape((1, size))
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
另外,由于出现次数小于min_count的单词将不计算,并且测试样本中也可能存在未处理的特殊字符,所以需要通过捕捉KeyError避免程序异常退出。将训练集和测试集依次处理,获取对应的Word2Vec值,同时使用scale函数将其标准化处理:
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = scale(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = scale(x_test)
4.scale标准化
使用scale函数的作用是,避免多维数据中个别维度的数据过大或者过小从而影响算法分类效果。scale函数会把各个维度的数据转换后使分布更加“平均”
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X)
print(X_scaled)
输出结果如下所示:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
5.完整源码
总体来讲,这一处理流程如下所示:
def get_features_by_word2vec():
global max_features
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
print(len(x_train), len(y_train))
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
print('before', len(y_train))
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = scale(x_train)
print('after', len(x_train))
print(x_train.shape)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = scale(x_test)
return x_train, x_test, y_train, y_test
6.举例
这里以train[0]为例,讲述向量化的过程,首先初始化从文件中导入
x_train, x_test, y_train, y_test=load_all_files()
在调用如上代码后,x_train[0]结果如下
If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy
在执行cleanText函数后,经过分词之后,x_train[0]结果如下
['if', 'you', "don't", ',', 'your', 'prize', 'will', 'go', 'to', 'another', 'customer', '.', 't&c', 'at', 'www', '.', 't-c', '.', 'biz', '18+', '150p/min', 'polo', 'ltd', 'suite', '373', 'london', 'w1j', '6hl', 'please', 'call', 'back', 'if', 'busy']
接下来使用word2vec模型处理
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,处理后的train[0]如下所示
[ 0.5688461 -0.7963458 -0.53969711 0.42368432 1.7073138 1.17516173
0.32935769 0.1749727 -1.10261336 -1.14618023 -0.64693019 0.03879264
-0.28986312 -0.15053948 0.86447008 1.03759495 -0.22362847 0.54810378
-0.09579477 0.06696273 0.53213082 1.13446066 0.70176198 -0.09194162
-1.00245396 -1.01783227 -0.72731505 0.43077651 -0.00673702 0.54794111
0.28392318 1.21258038 0.6954477 1.35741696 0.52566294 -0.11437557
-0.0698448 -0.06264644 0.00359846 0.19755338 0.02252081 -0.45468214
0.03074975 -0.97560132 -1.3320358 -0.191184 -0.99694834 -0.05791205
0.38126789 1.41985205 0.06165056 0.21995296 -0.25111755 -0.61057136
0.30779555 1.45024929 -1.25652236 0.77137314 0.14340256 -0.48314989
0.6579341 -1.64457267 -0.33124644 0.4243934 -1.32630979 0.37559585
-0.01618847 -0.72842787 0.75744382 0.22936961 0.38842295 0.70630939
-0.5755018 2.28154287 0.1041452 0.35924263 1.8132245 -0.10724146
-1.49230761 -0.32379927 -0.89156985 0.37247643 0.34482669 -0.10076832
-0.53934116 -0.38991501 -0.14401814 1.64303595 -0.50050573 0.32035356
-0.51832154 0.45338105 -1.35904802 -0.74532751 -0.31660083 0.15160747
0.76809469 -0.34191613 0.07772422 0.16559841 0.08473047 -0.10939166
0.1857267 0.02878834 0.64387584 0.45749407 0.69939248 -0.85222505
-1.57294277 -1.62788899 0.35674762 -0.24114483 0.29261773 0.18306259
-1.18492453 -0.52101244 1.15009746 0.97466267 -0.33838688 -1.17274655
0.57668485 1.56703609 1.27791816 -1.14988041 0.28182096 -0.09135877
-0.03609932 0.66226854 -0.35863005 -0.36398623 0.26722192 0.98188737
-0.33385907 0.445007 0.75214935 -0.81884158 1.0510767 0.63771857
0.19482218 -1.80268676 -0.34549945 -0.35621238 0.46528964 -0.55987857
-0.87382452 0.75147679 -0.66485836 -0.15657116 0.18748415 1.10138361
-0.0078085 0.50333768 1.3417442 1.10197353 -0.05941141 0.07282477
-0.19017513 -0.83439085 -0.00832164 0.06593468 -0.53035842 0.95551142
0.35307575 -0.31915962 0.20121204 -0.81100877 -0.91266333 0.03278571
0.26023407 -0.54093813 0.02997342 1.41653465 -0.12427418 -0.82120634
-1.17340946 -1.75454109 -0.76027333 1.2884738 0.17876992 0.26112962
-0.88782072 0.03205944 -0.16476012 -0.14802052 -1.12993536 0.4738586
0.72952233 1.57389264 -0.77677785 -0.6256085 -0.22538952 0.34228583
-0.56924201 0.7434089 1.40698768 0.52310801 -0.87181962 0.32473917
-1.27615191 1.0771901 1.12765643 1.1128303 0.28027994 0.23365211
-1.32999254 1.16263406 -0.24584286 1.32610628 -1.07430174 0.04978786
0.84560452 0.51568605 0.29324713 1.01046356 0.89309483 -0.68883869
-0.10943733 -1.14162474 0.43906249 -1.64726855 0.62657474 0.89747922
0.25619183 0.88133258 0.53152881 0.800173 1.07257533 -0.91345605
1.511324 -0.37129249 -1.21065258 1.41421037 0.63753296 0.77966061
0.34219329 -1.62505142 -0.50154156 -0.84119517 -0.10794676 0.14238391
-0.18933125 0.96618836 -0.09447222 -0.01457627 0.25379729 -0.00239968
-0.01879948 0.24551755 -0.19717246 1.49390844 0.41681463 -1.16918163
-0.7748389 0.6664235 -0.03348684 -0.13785069 -1.38920251 -0.65347069
-0.30330183 0.84497368 1.01966753 0.62513464 -0.61398801 0.17254219
0.47432455 -0.4636558 -0.2835449 -0.38155617 -0.47108328 -1.27081993
-0.09585531 0.49909651 -0.99359911 -0.07502736 -1.39910104 -0.34759668
0.21337289 -1.10769739 0.15850016 0.64950728 0.96845903 -0.71599037
-0.35235105 -0.64243931 -0.31335287 -1.04057976 -0.75755116 0.2656545
-0.91747103 0.51288032 1.12705359 -0.3520552 0.82732871 2.18699482
0.17190736 0.01063382 0.60349575 0.18857134 0.63457147 -1.40435699
-0.24523577 1.07861446 -1.93594107 -0.35640277 0.56313198 0.92719632
-1.19504538 -0.40542724 -0.16996568 -0.03463793 -0.97696206 -0.12556016
0.21483948 0.15585242 0.76265303 -0.65085202 0.65287212 -0.85362963
0.33149502 0.5701018 0.40867361 0.21806052 -1.14224126 1.42919841
-0.22902177 0.5451145 -0.1141489 0.25853344 1.02713966 -0.16200372
-0.23339327 0.87608441 0.75910643 0.18785408 1.23609536 -0.72335459
0.53511046 0.08358996 -0.5598393 0.5004547 -0.11572333 -0.47238923
1.20602503 -0.27158068 -0.65528766 0.25551535 0.32559297 -1.09997926
0.20791183 0.12843725 0.09087898 -0.22888646 -0.71270039 0.78723415
0.4676776 -0.3136612 0.4368007 0.56427676 -0.95792959 -0.12123958
0.25772387 0.27141381 1.62133518 1.0806067 -0.21620892 0.72400364
0.23908486 1.32545448 1.37374568 0.80119068 -1.11050208 0.61139748
0.19350813 -0.42820846 -0.09775806 0.37327079 -1.30432311 0.20804753
0.81459754 -0.36544708 0.00990999 -1.75476784 -1.18515867 -0.15301021
-0.02726374 0.63801717 0.70284723 0.69907364 -0.54179232 -1.13846505
-0.00501522 -0.95063539 -0.3019417 -0.72958836 -0.65496463 -0.22132243
1.35748601 1.41187301 0.82758802 1.23182959]
note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(4)
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(4)_mooyuan的博客-CSDN博客
后续内容具体可参考《web安全之深度学习实战》系列笔记。