机器学习实践五---支持向量机(SVM)

之前已经学到了很多监督学习算法, 今天的监督学习算法是支持向量机,与逻辑回归和神经网络算法相比,它在学习复杂的非线性方程时提供了一种更为清晰,更强大的方式。

Support Vector Machines

SVM hypothesis
在这里插入图片描述

Example Dataset 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy.io import loadmat
from sklearn import svm

mat = loadmat("ex6data1.mat")
print(mat.keys())
X = mat['X']
y = mat['y']

def plot_data(X, y):
    plt.figure(figsize=(6, 4))
    plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='rainbow')
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.legend()

plot_data(X, y)
plt.show()
def plot_boundary(clf, X):
    x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1
    y_min, y_max = X[:, 1].min() * 1.1, X[:, 1].max() * 1.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                         np.linspace(y_min, y_max, 500))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z)

models = [svm.SVC(C, kernel='linear') for C in [1, 100]]
clfs = [model.fit(X, y.ravel()) for model in models]
title = ['SVM Decision Boundary with C = {} (Example Dataset 1'.format(C) for C in [1, 100]]
for model, title in zip(clfs, title):
    plt.figure(figsize=(8, 5))
    plot_data(X, y)
    plot_boundary(model, X)
    plt.title(title)
    plt.show()
SVM with Gaussian Kernels
Gaussian Kernel
def gauss_kernel(x1, x2, sigma):
    return np.exp(- ((x1 - x2) ** 2).sum() / (2 * sigma ** 2))
Example Dataset 2
mat = loadmat('ex6data2.mat')
X2 = mat['X']
y2 = mat['y']
plot_data(X2, y2)

sigma = 0.1
gamma = np.power(sigma, -2.)/2
clf = svm.SVC(C=1, kernel='rbf', gamma=gamma)
modle = clf.fit(X2, y2.flatten())
plot_data(X2, y2)
plot_boundary(modle, X2)
Example Dataset 3
mat3 = loadmat('ex6data3.mat')
X3, y3 = mat3['X'], mat3['y']
Xval, yval = mat3['Xval'], mat3['yval']
plot_data(X3, y3)

Spam Classification

Preprocessing Emails
with open('emailSample1.txt', 'r') as f:
    email = f.read()
    print(email)
# 做除了Word Stemming和Removal of non-words的所有处理
def process_email(email):
    email = email.lower()
    email = re.sub('<[^<>]>', ' ', email)  # 匹配<开头,然后所有不是< ,> 的内容,知道>结尾,相当于匹配<...>
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email )  # 匹配//后面不是空白字符的内容,遇到空白字符则停止
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)
    email = re.sub('[\$]+', 'dollar', email)
    email = re.sub('[\d]+', 'number', email)
    return email
# 预处理数据,返回一个干净的单词列表
def email2TokenList(email):
    # I'll use the NLTK stemmer because it more accurately duplicates the
    # performance of the OCTAVE implementation in the assignment
    stemmer = nltk.stem.porter.PorterStemmer()

    email = process_email(email)

    # 将邮件分割为单个单词,re.split() 可以设置多种分隔符
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)

    # 遍历每个分割出来的内容
    tokenlist = []
    for token in tokens:
        # 删除任何非字母数字的字符
        token = re.sub('[^a-zA-Z0-9]', '', token);
        # Use the Porter stemmer to 提取词根
        stemmed = stemmer.stem(token)
        # 去除空字符串‘’,里面不含任何字符
        if not len(token): continue
        tokenlist.append(stemmed)

    return tokenlist
Vocabulary List
# 提取存在单词的索引
def email2VocabIndices(email, vocab):
    token = email2TokenList(email)
    index = [i for i in range(len(vocab)) if vocab[i] in token ]
    return index
Extracting Features from Emails
# 将email转化为词向量,n是vocab的长度。存在单词的相应位置的值置为1,其余为0
def email2FeatureVector(email):
    df = pd.read_table('data/vocab.txt',names=['words'])
    vocab = df.as_matrix()  # return array
    vector = np.zeros(len(vocab))  # init vector
    vocab_indices = email2VocabIndices(email, vocab)  # 返回含有单词的索引
    # 将有单词的索引置为1
    for i in vocab_indices:
        vector[i] = 1
    return vector
Training SVM for Spam Classification
vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))

# 2.3 Training SVM for Spam Classification
# Training set
mat1 = loadmat('spamTrain.mat')
X, y = mat1['X'], mat1['y']

# Test set
mat2 = scipy.io.loadmat('spamTest.mat')
Xtest, ytest = mat2['Xtest'], mat2['ytest']

clf = svm.SVC(C=0.1, kernel='linear')
clf.fit(X, y)

Top Predictors for Spam
predTrain = clf.score(X, y)
predTest = clf.score(Xtest, ytest)
predTrain, predTest

参数对算法的影响:
C = 1/λ
大C: 低偏差,高方差(对应低λ)
小C: 高偏差,低方差(对应高λ)
大δ^2: 分布更平滑,高偏差,低方差
小δ^2: 分布更集中,地偏差,高方差

使用SVM 的步骤:

使用SVM软件库去求解参数θ

Need to specify:

  1. choice of parameter C
  2. choice of kernel (similarity function):
    eg: No kernel(‘linear kernel’)
    Gaussian kernel
    need to choose θ^2

logistic vs SVM
n为特征数,m为训练样本数。
(1)如果相较于而言,要大许多,即训练集数据量不够支持我们训练一个复杂的非线性模型,我们选用逻辑回归模型或者不带核函数的支持向量机。
(2)如果较小,而且大小中等,例如在 1-1000 之间,而在10-10000之间,使用高斯核函数的支持向量机。
(3)如果较小,而较大,例如在1-1000之间,而大于50000,则使用支持向量机会非常慢,解决方案是创造、增加更多的特征,然后使用逻辑回归或不带核函数的支持向量机。

值得一提的是,神经网络在以上三种情况下都可能会有较好的表现,但是训练神经网络可能非常慢,选择支持向量机的原因主要在于它的代价函数是凸函数,不存在局部最小值。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值