目录
本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。介绍识别骚扰短信使用的征提取方法,包括词袋和TF-IDF模型、词汇表模型以及Word2Vec和Doc2Vec模型,介绍使用的模型以及对应的验证结果,包括朴素贝叶斯、支持向量机、XGBoost和MLP算法。这一节与第六章的垃圾邮件、第七章的负面评论类似、只是识别的内容变为了骚扰短信,均为2分类问题。
一、数据集
测试数据来自SMS Spam Collection数据集,SMS Spam Collection是用于骚扰短信识别的经典数据集,完全来自真实短信内容,包括4831条正常短信和747条骚扰短信。从官网下载数据集压缩包并解压,正常短信和骚扰短信保存在一个文本文件中。
逐行读取数据文件SMSSpamCollection.txt,由于每行数据都由标记和短信内容组成,两者之间使用制表符分割,所以可以通过split函数进行切分,直接获取标记和短信内容:
def load_all_files():
x=[]
y=[]
datafile="../data/sms/smsspamcollection/SMSSpamCollection.txt"
with open(datafile, encoding='utf-8') as f:
for line in f:
line=line.strip('\n')
label,text=line.split('\t')
x.append(text)
if label == 'ham':
y.append(0)
else:
y.append(1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
return x_train, x_test, y_train, y_test
调用源码
x_train, x_test, y_train, y_test=load_all_files()
二、特征提取
(一) 词集(词汇表)模型
def get_features_by_tf():
global max_document_length
x_train, x_test, y_train, y_test=load_all_files()
vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
min_frequency=0,
vocabulary=None,
tokenizer_fn=None)
x_train=vp.fit_transform(x_train, unused_y=None)
x_train=np.array(list(x_train))
x_test=vp.transform(x_test)
x_test=np.array(list(x_test))
return x_train, x_test, y_train, y_test
(二) 词袋模型
def get_features_by_wordbag():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1 )
print (vectorizer)
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,
min_df=1 )
print (vectorizer)
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
(三) TF-IDF模型
def get_features_by_wordbag_tfidf():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
binary=True)
print (vectorizer)
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,binary=True,
min_df=1 )
print (vectorizer)
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
x_train=x_train.toarray()
x_test=transformer.transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
这里以train[0]数据为例,讲解其经过每个步骤的向量变化:
原始数据, train[0]的数据如下:
Sorry,in meeting I'll call later
接下来通过调用 scikit-learn 的 CountVectorizer 类来进行文本的词频统计的调用函数。
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
binary=True)
x_train=vectorizer.fit_transform(x_train)
x_train = np.array(list(x_train))
这时候打印x_train[0]的结果,这里要强调因为max_features限制数量,故而有效的x_train[0]值,如下所示
(0, 308) 1
(0, 211) 1
(0, 194) 1
(0, 180) 1
将其向量化
x_train=x_train.toarray()
结果如下
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
对其进行TF-IDF处理
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
处理后,x_train[0]值如下所示:
(0, 366) 0.6954286575055049
(0, 238) 0.7185951449321735
将其向量化
x_train=x_train.toarray()
向量化处理如下
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.71859514 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.69542866 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
(四) N-gram
与TF-IDF相比,大概区别如下
代码如下所示
def get_features_by_ngram():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(3, 3),
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
token_pattern=r'\b\w+\b',
binary=True)
print (vectorizer)
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(3, 3),
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,binary=True,
token_pattern=r'\b\w+\b',
min_df=1 )
print (vectorizer)
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
x_train=x_train.toarray()
x_test=transformer.transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
(五) Word2Vec
1.数据预处理
相对于词袋、词集模型,word2vec模型增加了如下处理逻辑,因为短信内容中可能存在一些特殊符号,这类特殊符号也对判断骚扰邮件有一定帮助,需要处理的特殊符号如下所示:
punctuation = """.,?!:;(){}[]"""
常见的处理方法是在特殊符号前后增加空格,然后使用split函数切分时就可以完整保留这些特殊符号:
def cleanText(corpus):
punctuation = """.,?!:;(){}[]"""
corpus = [z.lower().replace('\n', '') for z in corpus]
corpus = [z.replace('<br />', ' ') for z in corpus]
# treat punctuation as individual words
for c in punctuation:
corpus = [z.replace(c, ' %s ' % c) for z in corpus]
corpus = [z.split() for z in corpus]
return corpus
将训练数据和测试数据分别使用cleanText函数处理,合并成完整数据集合x:
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
2.构建模型
初始化Word2Vec对象,size表示训练Word2Vec的神经网络隐藏层节点数,同时也表示了ord2Vec向量的维数;window表示训练Word2Vec的窗口长度;min_count表示出现次数小于min_count的单词将不计算;iter表示了训练Word2Vec的次数,gensim官方文档强烈建议增加iter次数以提高生成的Word2Vec的质量,默认值为5:
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
3.Word2Vec特征向量化
训练完成后,单词对应的Word2Vec会保存在model变量中,可以使用类似字典的方式直接访问,比如获取单词love对应的Word2Vec的方法为:
model['love']
Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如:
model['good boy']= (model['good]+ model['boy])/2
利用这个特性,可以将组成短信的单词和字符的Word2Vec相加并取平均值:
def buildWordVector(imdb_w2v,text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
try:
vec += imdb_w2v[word].reshape((1, size))
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
另外,由于出现次数小于min_count的单词将不计算,并且测试样本中也可能存在未处理的特殊字符,所以需要通过捕捉KeyError避免程序异常退出。将训练集和测试集依次处理,获取对应的Word2Vec值,同时使用scale函数将其标准化处理:
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = scale(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = scale(x_test)
4.scale标准化
使用scale函数的作用是,避免多维数据中个别维度的数据过大或者过小从而影响算法分类效果。scale函数会把各个维度的数据转换后使分布更加“平均”
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X)
print(X_scaled)
输出结果如下所示:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
5.完整源码
总体来讲,这一处理流程如下所示:
def get_features_by_word2vec():
global max_features
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
print(len(x_train), len(y_train))
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
print('before', len(y_train))
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = scale(x_train)
print('after', len(x_train))
print(x_train.shape)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = scale(x_test)
return x_train, x_test, y_train, y_test
6.举例
这里以train[0]为例,讲述向量化的过程,首先初始化从文件中导入
x_train, x_test, y_train, y_test=load_all_files()
在调用如上代码后,x_train[0]结果如下
If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy
在执行cleanText函数后,经过分词之后,x_train[0]结果如下
['if', 'you', "don't", ',', 'your', 'prize', 'will', 'go', 'to', 'another', 'customer', '.', 't&c', 'at', 'www', '.', 't-c', '.', 'biz', '18+', '150p/min', 'polo', 'ltd', 'suite', '373', 'london', 'w1j', '6hl', 'please', 'call', 'back', 'if', 'busy']
接下来使用word2vec模型处理
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,处理后的train[0]如下所示
[ 0.5688461 -0.7963458 -0.53969711 0.42368432 1.7073138 1.17516173
0.32935769 0.1749727 -1.10261336 -1.14618023 -0.64693019 0.03879264
-0.28986312 -0.15053948 0.86447008 1.03759495 -0.22362847 0.54810378
-0.09579477 0.06696273 0.53213082 1.13446066 0.70176198 -0.09194162
-1.00245396 -1.01783227 -0.72731505 0.43077651 -0.00673702 0.54794111
0.28392318 1.21258038 0.6954477 1.35741696 0.52566294 -0.11437557
-0.0698448 -0.06264644 0.00359846 0.19755338 0.02252081 -0.45468214
0.03074975 -0.97560132 -1.3320358 -0.191184 -0.99694834 -0.05791205
0.38126789 1.41985205 0.06165056 0.21995296 -0.25111755 -0.61057136
0.30779555 1.45024929 -1.25652236 0.77137314 0.14340256 -0.48314989
0.6579341 -1.64457267 -0.33124644 0.4243934 -1.32630979 0.37559585
-0.01618847 -0.72842787 0.75744382 0.22936961 0.38842295 0.70630939
-0.5755018 2.28154287 0.1041452 0.35924263 1.8132245 -0.10724146
-1.49230761 -0.32379927 -0.89156985 0.37247643 0.34482669 -0.10076832
-0.53934116 -0.38991501 -0.14401814 1.64303595 -0.50050573 0.32035356
-0.51832154 0.45338105 -1.35904802 -0.74532751 -0.31660083 0.15160747
0.76809469 -0.34191613 0.07772422 0.16559841 0.08473047 -0.10939166
0.1857267 0.02878834 0.64387584 0.45749407 0.69939248 -0.85222505
-1.57294277 -1.62788899 0.35674762 -0.24114483 0.29261773 0.18306259
-1.18492453 -0.52101244 1.15009746 0.97466267 -0.33838688 -1.17274655
0.57668485 1.56703609 1.27791816 -1.14988041 0.28182096 -0.09135877
-0.03609932 0.66226854 -0.35863005 -0.36398623 0.26722192 0.98188737
-0.33385907 0.445007 0.75214935 -0.81884158 1.0510767 0.63771857
0.19482218 -1.80268676 -0.34549945 -0.35621238 0.46528964 -0.55987857
-0.87382452 0.75147679 -0.66485836 -0.15657116 0.18748415 1.10138361
-0.0078085 0.50333768 1.3417442 1.10197353 -0.05941141 0.07282477
-0.19017513 -0.83439085 -0.00832164 0.06593468 -0.53035842 0.95551142
0.35307575 -0.31915962 0.20121204 -0.81100877 -0.91266333 0.03278571
0.26023407 -0.54093813 0.02997342 1.41653465 -0.12427418 -0.82120634
-1.17340946 -1.75454109 -0.76027333 1.2884738 0.17876992 0.26112962
-0.88782072 0.03205944 -0.16476012 -0.14802052 -1.12993536 0.4738586
0.72952233 1.57389264 -0.77677785 -0.6256085 -0.22538952 0.34228583
-0.56924201 0.7434089 1.40698768 0.52310801 -0.87181962 0.32473917
-1.27615191 1.0771901 1.12765643 1.1128303 0.28027994 0.23365211
-1.32999254 1.16263406 -0.24584286 1.32610628 -1.07430174 0.04978786
0.84560452 0.51568605 0.29324713 1.01046356 0.89309483 -0.68883869
-0.10943733 -1.14162474 0.43906249 -1.64726855 0.62657474 0.89747922
0.25619183 0.88133258 0.53152881 0.800173 1.07257533 -0.91345605
1.511324 -0.37129249 -1.21065258 1.41421037 0.63753296 0.77966061
0.34219329 -1.62505142 -0.50154156 -0.84119517 -0.10794676 0.14238391
-0.18933125 0.96618836 -0.09447222 -0.01457627 0.25379729 -0.00239968
-0.01879948 0.24551755 -0.19717246 1.49390844 0.41681463 -1.16918163
-0.7748389 0.6664235 -0.03348684 -0.13785069 -1.38920251 -0.65347069
-0.30330183 0.84497368 1.01966753 0.62513464 -0.61398801 0.17254219
0.47432455 -0.4636558 -0.2835449 -0.38155617 -0.47108328 -1.27081993
-0.09585531 0.49909651 -0.99359911 -0.07502736 -1.39910104 -0.34759668
0.21337289 -1.10769739 0.15850016 0.64950728 0.96845903 -0.71599037
-0.35235105 -0.64243931 -0.31335287 -1.04057976 -0.75755116 0.2656545
-0.91747103 0.51288032 1.12705359 -0.3520552 0.82732871 2.18699482
0.17190736 0.01063382 0.60349575 0.18857134 0.63457147 -1.40435699
-0.24523577 1.07861446 -1.93594107 -0.35640277 0.56313198 0.92719632
-1.19504538 -0.40542724 -0.16996568 -0.03463793 -0.97696206 -0.12556016
0.21483948 0.15585242 0.76265303 -0.65085202 0.65287212 -0.85362963
0.33149502 0.5701018 0.40867361 0.21806052 -1.14224126 1.42919841
-0.22902177 0.5451145 -0.1141489 0.25853344 1.02713966 -0.16200372
-0.23339327 0.87608441 0.75910643 0.18785408 1.23609536 -0.72335459
0.53511046 0.08358996 -0.5598393 0.5004547 -0.11572333 -0.47238923
1.20602503 -0.27158068 -0.65528766 0.25551535 0.32559297 -1.09997926
0.20791183 0.12843725 0.09087898 -0.22888646 -0.71270039 0.78723415
0.4676776 -0.3136612 0.4368007 0.56427676 -0.95792959 -0.12123958
0.25772387 0.27141381 1.62133518 1.0806067 -0.21620892 0.72400364
0.23908486 1.32545448 1.37374568 0.80119068 -1.11050208 0.61139748
0.19350813 -0.42820846 -0.09775806 0.37327079 -1.30432311 0.20804753
0.81459754 -0.36544708 0.00990999 -1.75476784 -1.18515867 -0.15301021
-0.02726374 0.63801717 0.70284723 0.69907364 -0.54179232 -1.13846505
-0.00501522 -0.95063539 -0.3019417 -0.72958836 -0.65496463 -0.22132243
1.35748601 1.41187301 0.82758802 1.23182959]
(六) Word2Vec_1d
与word2vec不同之处在于,这一处理流程如下所示:
1.MinMaxScaler标准化
如果想把各个维度的数据都转换到0和1之间,可以使用MinMaxScaler函数,比如转换前的数据为:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)
输出结果如下所示
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
[0. 1. 0. ]]
相对于word2vec的scale处理,word2vec的标准化处理流程如下
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
2.完整源码
def get_features_by_word2vec_cnn_1d():
global max_features
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
return x_train, x_test, y_train, y_test
(七) Word2Vec_2d
def get_features_by_word2vec_cnn_2d():
global max_features
global max_document_length
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train_vecs=[]
x_test_vecs=[]
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
#x_train_vec=np.zeros((max_document_length,max_features))
#x_test_vec=np.zeros((max_document_length, max_features))
"""
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
vec += imdb_w2v[word].reshape((1, size))
"""
#x_train = np.concatenate([buildWordVector_2d(model, z, max_features) for z in x_train])
x_all=np.zeros((1,max_features))
for sms in x_train:
sms=sms[:max_document_length]
#print sms
x_train_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_train_vec[i-1]=vec.copy()
#x_all=np.concatenate((x_all,vec))
x_train_vecs.append(x_train_vec)
#print x_train_vec.shape
for sms in x_test:
sms=sms[:max_document_length]
#print sms
x_test_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_test_vec[i-1]=vec.copy()
#x_all.append(vec)
x_test_vecs.append(x_test_vec)
#print x_train
#print x_all
min_max_scaler = preprocessing.MinMaxScaler()
print ("fix min_max_scaler")
x_train_2d=np.concatenate([z for z in x_train_vecs])
min_max_scaler.fit(x_train_2d)
x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])
x_train=x_train.reshape([-1, max_document_length, max_features, 1])
x_test = x_test.reshape([-1, max_document_length, max_features, 1])
return x_train, x_test, y_train, y_test
三、构建模型
(一) NB模型
1.基于词袋、词集模型的NB算法
def do_nb_wordbag(x_train, x_test, y_train, y_test):
print ("NB and wordbag")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
运行结果如下
NB and wordbag
precision recall f1-score support
0 0.99 0.66 0.79 1918
1 0.31 0.96 0.47 312
accuracy 0.70 2230
macro avg 0.65 0.81 0.63 2230
weighted avg 0.90 0.70 0.74 2230
[[1258 660]
[ 12 300]]
2.基于word2vec模型的NB算法
def do_nb_word2vec(x_train, x_test, y_train, y_test):
print ("NB and word2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
3.基于doc2vec的NB模型
def do_nb_doc2vec(x_train, x_test, y_train, y_test):
print ("NB and doc2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
(二) SVM模型
这部分逻辑与NB类似,只是模型使用的是SVM,处理源码如下
def do_svm_wordbag(x_train, x_test, y_train, y_test):
print ("SVM and wordbag")
clf = svm.SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
def do_nb_word2vec(x_train, x_test, y_train, y_test):
print ("NB and word2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
def do_nb_doc2vec(x_train, x_test, y_train, y_test):
print ("NB and doc2vec")
gnb = GaussianNB()
gnb.fit(x_train,y_train)
y_pred=gnb.predict(x_test)
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
(三) XGBoost算法
XGBoost是近几年流行起来的一种分类算法,由Tianqi Chen最初开发的实现可扩展、便携、分布式gradient boosting算法的一个库,可以下载安装并应用于C++、Python、R等语言,现在由很多协作者共同开发维护。XGBoost所应用的算法就是gradient boosting decision tree,既可以用于分类也可以用于回归问题中。XGBoost最大的特点在于,它能够自动利用CPU的多线程进行并行,同时在算法上加以改进提高了精度。它的处女秀是Kaggle的希格斯子信号识别竞赛,因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注,在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高,最近也有队伍借助XGBoost在比赛中夺得第一。这里提到的Kaggle是由联合创始人、首席执行官安东尼·高德布卢姆2010年在墨尔本创立的,主要为开发商和数据科学家提供举办机器学习竞赛、托管数据库、编写和分享代码的平台。该平台已经吸引了80万名数据科学家的关注。
def do_xgboost_wordbag(x_train, x_test, y_train, y_test):
print ("xgboost and wordbag")
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
print ("xgboost and word2vec")
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
(四) MLP
def do_dnn_wordbag(x_train, x_test, y_train, y_test):
print ("MLP and wordbag")
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
print (clf)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
def do_dnn_word2vec(x_train, x_test, y_train, y_test):
print ("MLP and word2vec")
global max_features
# Building deep neural network
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (5, 2),
random_state = 1)
print (clf)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
这里举mlp wordbag 运行结果
precision recall f1-score support
0 0.86 1.00 0.92 1918
1 0.00 0.00 0.00 312
accuracy 0.86 2230
macro avg 0.43 0.50 0.46 2230
weighted avg 0.74 0.86 0.80 2230
0.8600896860986547
(五) CNN
这里需要注意,仅对于wordbag模型进行了pad处理,而对于word2vec模型,并没有这样处理
1.wordbag模型
def do_cnn_wordbag(trainX, testX, trainY, testY):
global max_document_length
print ("CNN and tf")
trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="review")
2.word2vec
def do_cnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("CNN and word2vec")
y_test = testY
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_features], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print(metrics.confusion_matrix(y_test, y_predict))
3.Word2Vec 2d模型1
def do_cnn_word2vec_2d(trainX, testX, trainY, testY):
global max_features
global max_document_length
print ("CNN and word2vec2d")
y_test = testY
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length,max_features,1], name='input')
network = conv_2d(network, 32, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = conv_2d(network, 64, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = fully_connected(network, 128, activation='tanh')
network = dropout(network, 0.8)
network = fully_connected(network, 256, activation='tanh')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.01,
loss='categorical_crossentropy', name='target')
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
4.word2vec_2d 模型2
def do_cnn_word2vec_2d_345(trainX, testX, trainY, testY):
global max_features
global max_document_length
print ("CNN and word2vec_2d_345")
y_test = testY
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_document_length,max_features,1], name='input')
network = tflearn.embedding(network, input_dim=1, output_dim=128,validate_indices=False)
branch1 = conv_2d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_2d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_2d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool_2d(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="sms")
y_predict_list = model.predict(testX)
print (y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
5、doc2vec
ef do_cnn_doc2vec(trainX, testX, trainY, testY):
global max_features
print ("CNN and doc2vec")
#trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
#testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Building convolutional network
network = input_data(shape=[None,max_features], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="review")
(六) RNN
Wordbag源码如下
def do_rnn_wordbag(trainX, testX, trainY, testY):
global max_document_length
print ("RNN and wordbag")
y_test=testY
trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_document_length])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
y_predict_list = model.predict(testX)
print(y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
Word2Vec源码相对wordbag区别不大,只是没有打印出具体的报告,如下所示
def do_rnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("RNN and wordbag")
trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_features])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
这里加上测试结果报告,代码如下所示
def do_rnn_word2vec(trainX, testX, trainY, testY):
global max_features
print ("RNN and wordbag")
trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
testX = pad_sequences(testX, maxlen=max_features, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)
# Network building
net = tflearn.input_data([None, max_features])
net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
batch_size=10,run_id="sms",n_epoch=5)
y_predict_list = model.predict(testX)
print(y_predict_list)
y_predict = []
for i in y_predict_list:
print (i[0])
if i[0] > 0.5:
y_predict.append(0)
else:
y_predict.append(1)
print(classification_report(y_test, y_predict))
print (metrics.confusion_matrix(y_test, y_predict))
这里我的主机中,由于rnn需要内存较大,没有跑起来。
总体来讲,作者的代码注释较少,关于一些关键的代码细节并没有详解,这也是这本书的书评中提到的一个问题,基本上对于初学者来说,这本书不适合,因为没有基础完全看不懂,很多内容需要自己去查找。而对于真正需要了解细节的人,讲的又不细致,似乎只是讲了一个应用方法,想在这个方向做深入研究的人,还是要多看看论文。