七、【机器学习作业】支持向量机SVM（python版ex6）

最新推荐文章于 2024-07-15 16:12:20 发布

Liaojiajia-2020

最新推荐文章于 2024-07-15 16:12:20 发布

阅读量1.6k

点赞数 7

分类专栏： # 机器学习实验

本文链接：https://blog.csdn.net/mary_0830/article/details/99589528

版权

机器学习实验专栏收录该内容

11 篇文章 8 订阅

订阅专栏

支持向量机SVM

（一）支持向量机 Support Vector Machines
- （1）线性可分数据集：Example Dataset 1
- （2）线性不可分数据集： SVM with Gaussian Kernels（两个例子）
（二）垃圾邮件分类 Spam Classification

（一）支持向量机 Support Vector Machines

以下实验所使用的数据集是吴恩达机器学习提供的，是第六次作业（ex6）支持向量机。
实验目录如图所示：
在这里插入图片描述

（1）线性可分数据集：Example Dataset 1

线性可分实验所使用的数据集是ex6data1.mat。说明，这一部分的实验代码存放于SVM_ex6data1.py文件中。

首先载入需要的库：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.io import loadmat
from sklearn.metrics import classification_report #用于评价报告

显示载入的数据集并可视化这些数据：

raw_data = loadmat('data\ex6data1.mat')
print('data:',raw_data)

data = pd.DataFrame(raw_data['X'],columns = ['X1','X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])] #这个函数就是用来清洗数据，删选过滤掉DataFrame中一些行
negative = data[data['y'].isin([0])]

def plot_data(X,y):
    '''绘制数据集的散点图'''
    fig, ax = plt.subplots()
    ax.scatter(positive['X1'],positive['X2'],s = 30,marker = 'x',label = 'Positive',c = 'black')
    ax.scatter(negative['X1'],negative['X2'],s = 30,marker = 'o',label = 'Negative',c = 'y')
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title('Example Dataset 1')
    plt.legend()
    plt.show()

plot_data(X,y)

输出数据集里的数据：

data: {‘header’: b’MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Nov 13 14:28:43 2011’, ‘version’: ‘1.0’, ‘globals’: [], ‘X’: array([[1.9643 , 4.5957 ],
[2.2753 , 3.8589 ],
[2.9781 , 4.5651 ],
[2.932 , 3.5519 ],
[3.5772 , 2.856 ],
[4.015 , 3.1937 ],
[3.3814 , 3.4291 ],
[3.9113 , 4.1761 ],
[2.7822 , 4.0431 ],
[2.5518 , 4.6162 ],
[3.3698 , 3.9101 ],
[3.1048 , 3.0709 ],
[1.9182 , 4.0534 ],
[2.2638 , 4.3706 ],
[2.6555 , 3.5008 ],
[3.1855 , 4.2888 ],
[3.6579 , 3.8692 ],
[3.9113 , 3.4291 ],
[3.6002 , 3.1221 ],
[3.0357 , 3.3165 ],
[1.5841 , 3.3575 ],
[2.0103 , 3.2039 ],
[1.9527 , 2.7843 ],
[2.2753 , 2.7127 ],
[2.3099 , 2.9584 ],
[2.8283 , 2.6309 ],
[3.0473 , 2.2931 ],
[2.4827 , 2.0373 ],
[2.5057 , 2.3853 ],
[1.8721 , 2.0577 ],
[2.0103 , 2.3546 ],
[1.2269 , 2.3239 ],
[1.8951 , 2.9174 ],
[1.561 , 3.0709 ],
[1.5495 , 2.6923 ],
[1.6878 , 2.4057 ],
[1.4919 , 2.0271 ],
[0.962 , 2.682 ],
[1.1693 , 2.9276 ],
[0.8122 , 2.9992 ],
[0.9735 , 3.3881 ],
[1.25 , 3.1937 ],
[1.3191 , 3.5109 ],
[2.2292 , 2.201 ],
[2.4482 , 2.6411 ],
[2.7938 , 1.9656 ],
[2.091 , 1.6177 ],
[2.5403 , 2.8867 ],
[0.9044 , 3.0198 ],
[0.76615 , 2.5899 ],
[0.086405, 4.1045 ]]), ‘y’: array([[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[1],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[1]], dtype=uint8)}

可视化数据集的散点图：
在这里插入图片描述
说明：观察以上的实验结果，发现有一个异常的正样本在所属样本之外，这些数据依旧是线性可分的。下面就训练线性支持向量机来学习类边界。由于自己实现SVM比较麻烦，因此就需要使用scikit-learn。

在练习的这一部分中，将尝试将C参数的不同值与SVMs一起使用。C参数是一个正值，它控制了错误分类训练示例的惩罚。C参数越大，则告诉SVM要对所有的例子进行正确的分类。

from sklearn import svm

def plot_boundary(clf,X):
    '''绘制超平面'''
    x_min, x_max = X[:,0].min()*1.2, X[:,0].max()*1.1
    y_min, y_max = X[:,1].min()*1.1, X[:,1].max()*1.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),np.linspace(y_min, y_max, 500)) # 生成网格数据
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # ravel函数将多维数组降为一维，仍返回array数组，元素以列排列
    Z = Z.reshape(xx.shape) #保持维度一致
    plt.contour(xx, yy, Z,) #绘制决策边界（等高线）

models = [svm.SVC(C, kernel = 'linear') for C in [1, 50, 100]]
clfs = [model.fit(X, y.ravel()) for model in models]
title = ['SVM Decision Boundary with C = {} (Example Dataset 1)'.format(C) for C in [1, 50, 100]]

for model,title in zip(clfs,title):
    #plt.figure()
    plot_data(X,y)
    plot_boundary(model,X)
    plt.title(title)

代码说明：
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
这段代码中ravel() 函数将多维数组降为一维，仍返回array数组，元素以列向量的形式显示。之后调用np.c_[] 将xx.ravel()得到的列后增加列向量yy.ravel()。这时每行元素变为：
[[x1,y1];
[x2,y2];
[……]
这里的xx，yy使用np.meshgrid得到的坐标轴，所以上面那段代码实际上执行了对坐标轴上所有位置的[x, y]的预测。
svm.SVC(C, kernel = 'linear') 这段代码使用了scikit-learn中的svm。使用参数具体请参考

运行结果：
当C = 1时，
在这里插入图片描述
当C = 50时，

在这里插入图片描述
当C = 100时，

在这里插入图片描述
说明： 当C比较小时，模型对错误分类的惩罚较小，比较松弛，之间的间隔就比较大，可能会产生欠拟合的情况；当C比较大时，模型对错误分类的惩罚就大，因此两组数据之间的间隔就小，容易产生过拟合的情况。C值越大，越不愿放弃那些离群点；c值越小，越不重视那些离群点。 根据上图结果看出，当C=100时，发现SVM现在对每个示例都进行了正确的分类，但是它绘制的决策边界，似乎不适合数据。

补充知识：参考：绘制分隔超平面
下面是补充的知识，存放于SVM_test.py文件中。

# -*- coding: utf-8 -*-
import numpy as np
import pylab as plt
from sklearn import svm

#随机生成两组数据
np.random.seed(0)#使每次产生随机数不变

X = np.r_[np.random.randn(20,2)-[2,2],np.random.randn(20,2)+[2,2]]
#np.r_是按列连接两个矩阵，就是把两矩阵上下相加，要求列数相等，np.c_是按行连接两个矩阵，就是把两矩阵左右相加，要求行数相等
Y = [0] * 20+[1] * 20

#拟合模型
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)

# 获得分隔超平面
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5,5)#产生-5到5的线性连续值，间隔为1
yy = a * xx - (clf.intercept_[0]) / w[1]  #(clf.intercept_[0])/w[1]指的是直线的截距
#clf.intercept_[0]指的是w3，即为公式a1*x1+a2*x2+w3中的w3。

#得出支持向量的方程
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])#(b[1]-a*b[0])就是计算截距
b = clf.support_vectors_[-1]
yy_up = a * xx +(b[1] - a * b[0])

print("w:",w) #打印出权重系数
print("a:",a) #打印出斜率
print("suport_vectors_:",clf.support_vectors_)#打印出支持向量
print("clf.coef_:",clf.coef_)  #打印出权重系数

#绘制图形
plt.figure()
plt.plot(xx,yy,'k-')
plt.plot(xx,yy_down,'k--') #绘制下边界
plt.plot(xx,yy_up,'k--') #绘制上边界
plt.scatter(clf.support_vectors_[:,0],clf.support_vectors_[:,0],s = 30,facecolors='none')
plt.scatter(X[:,0],X[:,1],c=Y,cmap=plt.cm.Spectral)
plt.axis('tight')
plt.show()

运行结果：

w: [0.90230696 0.64821811]
a: -1.391980476255765
suport_vectors_: [[-1.02126202 0.2408932 ]
[-0.46722079 -0.53064123]
[ 0.95144703 0.57998206]]
clf.coef_: [[0.90230696 0.64821811]]

在这里插入图片描述

（2）线性不可分数据集： SVM with Gaussian Kernels（两个例子）

在这部分练习中，将使用SVM进行非线性分类。特别是，在线性不可分的的数据集上使用带有高斯核的SVM。这部分所使用的数据集是ex6data2.mat。说明，这一部分的实验代码存放于SVM_ex6data2.py文件中。

高斯核函数的数学公式：
$K_{gaussian}(x^{(i)},x^{(j)})=exp\left ( -\frac{\left \| x^{(i)}-x^{(j)} \right \|^{2}}{2\sigma ^{2}} \right )=exp\left ( -\frac{\sum_{k=1}^{n}(x_{k}^{(i)}-x_{k}^{(j)})^{2}}{2\sigma ^{2}} \right )$

编写实现高斯核函数的代码：

def gaussianKernel(x1, x2, sigma):
    '''高斯核函数计算公式'''
    G = np.exp(-(np.sum((x1 - x2)**2) / (2 * (sigma **2))))
    return G

x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2
G = gaussianKernel(x1, x2, sigma)
print('gaussianKernel:',G)

运行结果为：

gaussianKernel: 0.32465246735834974

下面，对线性不可分的数据进行处理：（ex6data2.mat）
同样的，首先载入需要的库：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.io import loadmat
from sklearn.metrics import classification_report #用于评价报告

载入数据集并可视化数据集：

raw_data = loadmat('data\ex6data2.mat')
print('data:',raw_data)
X, y = raw_data['X'],raw_data['y']
data = pd.DataFrame(raw_data['X'],columns = ['X1','X2'])
data['y'] = raw_data['y']
positive = data[data['y'].isin([1])] #这个函数就是用来清洗数据，删选过滤掉DataFrame中一些行
negative = data[data['y'].isin([0])]

def plot_data(X,y):
    '''绘制数据集的散点图'''
    fig, ax = plt.subplots()
    ax.scatter(positive['X1'],positive['X2'],s = 20,marker = 'x',label = 'Positive',c = 'black')
    ax.scatter(negative['X1'],negative['X2'],s = 20,marker = 'o',label = 'Negative',c = 'y')
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title('Example Dataset 2')
    plt.legend()
    plt.show()

plot_data(X,y)

运行结果如下：
在这里插入图片描述
对于该数据集，将使用内置的RBF内核构建支持向量机分类器，并检查其对训练数据的准确性。
编写绘制非线性决策边界的代码：

def plot_boundary(clf,X):
    '''绘制决策边界'''
    x_min, x_max = X[:,0].min()*1.2, X[:,0].max()*1.1
    y_min, y_max = X[:,1].min()*1.1, X[:,1].max()*1.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),np.linspace(y_min, y_max, 500)) # 生成网格数据
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # ravel函数将多维数组降为一维，仍返回array数组，元素以列排列
    Z = Z.reshape(xx.shape) #保持维度一致
    plt.contour(xx, yy, Z,) #绘制决策边界（等高线）

sigma = 0.1
clf = svm.SVC(C = 1,kernel = 'rbf',gamma = np.power(sigma,-2))
model = clf.fit(X, y)
plot_data(X, y)
plot_boundary(model, X)

运行结果为：
第一组，调整sigma的值，获得效果图如下所示。
当sigma = 0.1时，
在这里插入图片描述
当sigma = 0.2时，

当sigma = 0.5时，

说明： 对上面三种结果进行比较，看出当sigma越大时，所绘制的决策边界就越平滑，但是不能很好地将两组数据划分得很明确，是一种大致的分隔边界，容易产生欠拟合（高偏差）的情况。反之，当sigma很小时，基本能够将两组数据分隔得很好，除了一些很接近的点，会产生过拟合（高方差）的情况。

第二组，调整C参数的值，效果如图所示。
当C = 1时，
在这里插入图片描述
当C = 100时，

当C = 1000时，

说明：对比上面三组结果发现，当C越小时，分隔两组数据的决策边界就越松驰，间隔就稍微大一些，有一些正样本点也划分为负样本点；随着C的增大，划分越来越细致，基本上能够分类得很明确。像线性可分例子中的一样，C越大就容易产生过拟合的情况，C越小容易产生欠拟合的情况。

对于第三个数据集，课程给出了训练集和验证集，需要解决的任务是基于验证集表现为SVM模型找到最优超参数。

下面，编写找到最优超参数的代码：这部分实验所使用的数据集是ex6data3.mat。说明，这一部分的实验代码存放于SVM_ex6data3.py文件中。
首先，导入需要的库，并编写高斯核函数计算公式和绘制数据集的散点图：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.io import loadmat
from sklearn.metrics import classification_report #用于评价报告
from sklearn import svm

def gaussianKernel(x1, x2, sigma):
    '''高斯核函数计算公式'''
    G = np.exp(-(np.sum((x1 - x2)**2) / (2 * (sigma **2))))
    return G

x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2
G = gaussianKernel(x1, x2, sigma)
print('gaussianKernel:',G)

def plot_data(X,y):
    '''绘制数据集的散点图'''
    fig, ax = plt.subplots()
    ax.scatter(positive['X1'],positive['X2'],s = 20,marker = 'x',label = 'Positive',c = 'black')
    ax.scatter(negative['X1'],negative['X2'],s = 20,marker = 'o',label = 'Negative',c = 'y')
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title('Example Dataset 3')
    plt.legend()
    plt.show()

def plot_boundary(clf,X):
    '''绘制决策边界'''
    x_min, x_max = X[:,0].min()*1.2, X[:,0].max()*1.1
    y_min, y_max = X[:,1].min()*1.1, X[:,1].max()*1.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),np.linspace(y_min, y_max, 500)) # 生成网格数据
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # ravel函数将多维数组降为一维，仍返回array数组，元素以列排列
    Z = Z.reshape(xx.shape) #保持维度一致
    plt.contour(xx, yy, Z,) #绘制决策边界（等高线）

载入数据集并进行可视化数据集的操作（与前两个实验类似）：

raw_data3 = loadmat('data\ex6data3.mat')
print('data:',raw_data3)
X3, y3 = raw_data3['X'],raw_data3['y']
Xval, yval = raw_data3['Xval'], raw_data3['yval']
data3 = pd.DataFrame(raw_data3['X'],columns = ['X1','X2'])
data3['y'] = raw_data3['y']
positive = data3[data3['y'].isin([1])] #这个函数就是用来清洗数据，删选过滤掉DataFrame中一些行
negative = data3[data3['y'].isin([0])]

plot_data(X3,y3)

运行结果如下：
在这里插入图片描述
下一步，需要编写 寻找SVM模型最优的超参数 的代码：

'''---------------------------<ex6data3.mat>--------------------------------'''
raw_data3 = loadmat('data\ex6data3.mat')
print('data:',raw_data3)
X3, y3 = raw_data3['X'],raw_data3['y'].ravel()
Xval, yval = raw_data3['Xval'], raw_data3['yval'].ravel()
data3 = pd.DataFrame(raw_data3['X'],columns = ['X1','X2'])
data3['y'] = raw_data3['y']
positive = data3[data3['y'].isin([1])] #这个函数就是用来清洗数据，删选过滤掉DataFrame中一些行
negative = data3[data3['y'].isin([0])]

plot_data(X3,y3)

#设置可选择的超参数
C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = C_values

#初始化变量(用于存放最优超参数)
best_score = 0
best_params = {'C': None, 'gamma': None}

for C in C_values:
    for gamma in gamma_values: #这两行代码用于遍历每一个超参数
        svc = svm.SVC(C = C, gamma = gamma)
        svc.fit(X3, y3)
        score = svc.score(Xval, yval) #这三行代码用于调用SVM，计算当前参数下的得分
        if score > best_score:
            best_score = score
            best_params['C'] = C
            best_params['gamma'] = gamma #这部分代码用于替换得分最高的超参数组合，即输出的为最优超参数

print('best_params={}, best_score={}'.format(best_params, best_score))

sigma = 0.1
clf = svm.SVC(C = 1,kernel = 'rbf',gamma = np.power(sigma,-2))
model = clf.fit(X3, y3)
plot_data(X3, y3)
plot_boundary(model, X3)

运行结果为：
在这里插入图片描述

best_params={‘C’: 0.3, ‘gamma’: 100}, best_score=0.965

（二）垃圾邮件分类 Spam Classification

在这部分练习中，可以使用SVMs建立自己的垃圾邮件过滤器。首先需要将每个邮件 $x$ 变成一个 $n$ 维的特征向量，并训练一个分类器来分类给定的电子邮件 $x$ 是否属于垃圾邮件 $(y = 1)$ 或者非垃圾邮件 $(y = 0)$ 。

说明，在这部分实验中，所使用的数据集是emailSample1.txt& vocab.txt& spamTrain.mat & spamTest.mat，将下面所有的代码都存放于SVM_spam.py文件中。

（1）预处理电子邮件 Preprocessing Emails

实验的第一步是需要对邮件进行预处理。

读取电子邮件的内容，代码如下：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.io import loadmat

f = open('data\emailSample1.txt')
email = f.read()
print(email)

读取内容为：

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com

从读取内容里发现，邮件内容包含了URL、邮件地址、数字和美元金额。在其他的邮件里也会包含这些元素，但是每一封邮件的具体内容可能不一样，所以，处理邮件的方法是标准化这些数据，把所有URL和所有的数字都看成是一样的。比如说，使用独一无二的字符串"httpaddr"来替代所有URL，用这个字符串来表示邮件里包含URL，不需要具体的URL内容。这通常会提高垃圾邮件分类器的性能，由于垃圾邮件的发送者会随机化这些URL，所以在新的垃圾邮件中再一次看到任何特定的URL的几率是很小的。

下面是预处理和标准化的操作：

Lower-casing：把整封邮件转化为小写。
Stripping HTML：移除所有HTML标签，只保留内容。
Normalizing URLs：将所有的URL替换为字符串 “httpaddr”。
Normalizing Email Addresses：所有的地址替换为 “emailaddr”。
Normalizing Dollars：所有dollar符号($)替换为“dollar”。
Normalizing Numbers：所有数字替换为“number”。
Word Stemming(词干提取)：将所有单词还原为词源。例如，“discount”, “discounts”, “discounted” and “discounting”都替换为“discount”。
Removal of non-words：: 移除所有非文字类型，所有的空格(tabs, newlines, spaces)调整为一个空格。

对上面的邮件进行标准化操作，即步骤1.~步骤6.。代码如下：

import re 
def process_email_1(email):
    '''对邮件进行预处理(步骤1~6)'''
    vocab_list = get_vocab_list()
    word_indices = np.array([], dtype=np.int64)
    email = email.lower()  #把所有的大写字母转换成小写字母
    email = re.sub(r'<[^<>]+>',' ',email)  # 匹配<开头，然后所有不是< ,> 的内容，知道>结尾，相当于匹配<...>
    email = re.sub(r'(http|https)://[^\s]+','httpaddr',email)
    email = re.sub(r'[^\s]+@[^\s]+','emailaddr',email)
    email = re.sub(r'[$]+','dollar',email)
    email = re.sub(r'[\d]+','number',email)
    return email

print(process_email(email))

说明：

re.sub(pattern, repl, string, count=0, flags=0)
pattern：表示正则表达式中的模式字符串；
repl：被替换的字符串（既可以是字符串，也可以是函数）；
string：要被处理的，要被替换的字符串；
count：匹配的次数, 默认是全部替换；
flags：具体用处不详。

运行结果为：
为了观察方便，则把读取邮件的运行结果一起打印出来。

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com

------------------为了观察方便------------------------------------------------
> anyone knows how much it costs to host a web portal ?
>
well, it depends on how many visitors you're expecting.
this can be anywhere from less than number bucks a month to a couple of dollarnumber. 
you should checkout httpaddr or perhaps amazon ecnumber 
if youre running something big..

to unsubscribe yourself from this mailing list, send an email to:
emailaddr

继续编写步骤7.和步骤8.的代码：

def process_email_2_tokenlist(email):
    ''''对邮件进行预处理(步骤7,8):词干提取及去除空格'''
    stemmer = nltk.stem.porter.PorterStemmer() #调用词干提取的函数
    email = process_email_1(email) #对邮件进行预处理第一步
    tokens = re.split(r'[\@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]',email) #将邮件分割成单个单词，re.split()可以设置多种分隔符
    tokenList = [] #设置一个存放提取内容的数组
    for token in tokens: #遍历每一个分割出来的内容
        token = re.sub(r'[^a-zA-Z0-9]+','',token) #删除非文字类型的字符
        stemmed = stemmer.stem(token) #使用调用的函数提取词干
        if not len(token): #用于去除空字符，里面不含任何字符
            continue
        tokenList.append(stemmed)
        for i in range(1, len(vocab_list) + 1):
            if vocab_list[i] == token:
                word_indices = np.append(word_indices, i)

        print(token)
    return word_indices

print(process_email_2_tokenlist(email))

说明：其中re.split()的用法。请参考re.split()
运行结果为：

[‘anyoneknowshowmuchitcoststohostawebport’, ‘well’, ‘itdependsonhowmanyvisitorsy’, ‘reexpect’, ‘thiscanbeanywherefromlessthannumberbucksamonthtoacoupleofdollarnumb’, ‘youshouldcheckouthttpaddrorperhapsamazonecnumberifyourerunningsomethingbig’, ‘tounsubscribeyourselffromthismailinglist’, ‘sendanemailto’, ‘emailaddr’]

上面的运行结果，虽然看起来完成了对数据的预处理，但是好像并没有能够提取词干。经过查找资料，并修改代码后，完成了对数据的预处理，并把所有的单词提取出来。具体代码如下所示：

def get_vocab_list():
    vocab_dict = {}
    with open('data/vocab.txt') as f:
        for line in f:
            (val, key) = line.split()
            vocab_dict[int(val)] = key
    return vocab_dict

def process_email_1(email):
    '''对邮件进行预处理(步骤1~8)'''
    vocab_list = get_vocab_list()
    word_indices = np.array([], dtype=np.int64)
    email = email.lower()  #把所有的大写字母转换成小写字母
    email = re.sub(r'<[^<>]+>',' ',email)  # 匹配<开头，然后所有不是< ,> 的内容，知道>结尾，相当于匹配<...>
    email = re.sub(r'(http|https)://[^\s]*','httpaddr',email)
    email = re.sub(r'[^\s]+@[^\s]+','emailaddr',email)
    email = re.sub(r'[$]+','dollar',email)
    email = re.sub(r'[0-9]+','number',email)
    print('==== Processed Email ====')
    stemmer = nltk.stem.porter.PorterStemmer() #调用词干提取的函数
    #email = process_email_1(email) #对邮件进行预处理第一步
    tokens = re.split(r'[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]',email) #将邮件分割成单个单词，re.split()可以设置多种分隔符
    tokenList = [] #设置一个存放提取内容的数组
    for token in tokens: #遍历每一个分割出来的内容
        token = re.sub('[^a-zA-Z0-9]+','',token) #删除非文字类型的字符
        stemmed = stemmer.stem(token) #使用调用的函数提取词干
        if not len(token): #用于去除空字符，里面不含任何字符
            continue
        tokenList.append(stemmed)
        for i in range(1, len(vocab_list) + 1):
            if vocab_list[i] == token:
                word_indices = np.append(word_indices, i)

        print(token)

    print('==================')

    return word_indices

运行结果如下所示：

==== Processed Email ====
anyone
knows
how
much
it
costs
to
host
a
web
portal
well
it
depends
on
how
many
visitors
you
re
expecting
this
can
be
anywhere
from
less
than
number
bucks
a
month
to
a
couple
of
dollarnumber
you
should
checkout
httpaddr
or
perhaps
amazon
ecnumber
if
youre
running
something
big
to
unsubscribe
yourself
from
this
mailing
list
send
an
email
to
emailaddr
==================

在对邮件进行预处理后，会产生一个处理后的单词列表。实验的第二步是选择想要在分类器中使用的单词和需要去掉的单词。课程提供了一个词汇表vocab.txt，里面存储了在实际中经常使用到的单词（有1899个）。在练习中，常用的词汇表大约有10000到50000个单词。

现在的任务是：在将单词映射到为练习提供的词汇表中的ID中，如果该单词存在，则应该将该单词的索引添加到单词索引变量中。若单词不存在词汇表中，则跳过该单词。

编写词汇表索引的代码：

def email2vocab_indices(email, vocab):
    '''获取词汇表的索引'''
    token = process_email_2_tokenlist(email)
    for i in range(len(vocab)):
        if vocab[i] in token:
            index = i
    return index

（2）从邮件中提取特征 Extracting Features from Emails

编写从邮件中提取特征的实验代码：

def email2feature_vector(email):
    '''把email转换为词向量的形式.将存在单词的相应位置的值设为1,不存在设为0.'''
    df = pd.read_csv('data/vocab.txt',names=['words']) #读取.txt文件
    vocab = df.values #返回数组
    vector = np.zeros(len(vocab)) #初始化向量
    vocab_indices = email2vocab_indices(email, vocab) #返回有单词的索引
    for i in vocab_indices: #用于将存在单词的相应位置的值设置为1
        vector[i] = 1
    return vector

vector = email2feature_vector(email)
print('the feature vector had length {} and {} non-zero entries.'.format(len(vector), int(np.sum(vector))))

运行结果：

the feature vector had length 1899 and 45 non-zero entries.

（3）为垃圾邮件分类训练支持向量机 Training SVM for Spam Classification

完成邮件的特征变量提取后，可以利用spamTrain.mat中的4000个训练样本和spamTest.mat中的1000个测试样本训练SVM算法，使用（1）和（2）编写的函数处理每一个原始的邮件，并将其转化为一个向量 $x^{(i)}\in \mathbb{R}^{1899}$ 。载入数据集后，用变量 $y = 1$ 表示垃圾邮件， $y = 0$ 表示非垃圾邮件，这样就可以训练SVM算法了。
编写训练SVM的代码如下所示：

from sklearn import svm

mat_train = loadmat('data/spamTrain.mat')
X_train = mat_train['X']
y_train = mat_train['y'].flatten()
mat_test = loadmat('data/spamTest.mat')
X_test = mat_test['Xtest']
y_test = mat_test['ytest'].flatten()

C = 0.1
clf = svm.SVC(C, kernel='linear')
clf.fit(X_train, y_train)
p_train = clf.score(X_train, y_train)
p_test = clf.score(X_test, y_test)
print('Training Accuracy: {:.1%}'.format(p_train))
print('Test Accuracy: {:.1%}'.format(p_test))

运行结果为： 下面显示的数据分别是利用SVM算法构造的分类器，计算得到的训练集的精确度和测试集的精确度。