支持向量机

最新推荐文章于 2024-04-02 15:56:54 发布

guanyue.space

最新推荐文章于 2024-04-02 15:56:54 发布

阅读量157

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/qq_34620855/article/details/117696826

版权

笔记专栏收录该内容

19 篇文章 0 订阅

订阅专栏

支持向量机（回顾）

线性可分

训练数据标签： $y\in\{1,-1\}$
寻找一个超平面分离 +1，-1 超平面: $W^TX+b=0$

训练集 $\lbrace (x_i,y_i)\rbrace_{i=1,...,N}$
$\exist(W,b)$ 使得 $\forall i=1,...N有$

若 $y_i=+1 W^Tx_i+b \ge 0$
若 $y_i=-1 W^Tx_i+b < 0$

即 $y_i(w^Tx+b)\ge0$

优化问题：Margin(间隔)最大化即 $\Vert W \Vert$ 最小化
限制条件(Subject to)： $y_i(w^Tx+b)\ge0\quad {i=1,...,N}$

事实一： $W^TX+b=0 与 aW^TX+ab=0$ ，表示同一超平面 a为非零实数
事实二：点 $x_0,y_0)$ 到平面 $w_1x+w_2y+b=0$ 的距离
$d=\frac {\vert w_1x_0+w_2y_0+b\vert}{\sqrt{w_1^2+w_2^2}}$
扩展：向量 $X_0$ 到超平面 $W^TX+b=0$ 的距离
$d=\frac{\vert W^TX_0+b \vert}{\Vert W \Vert}$
其中 $\Vert W \Vert=\sqrt{w_1^2+w_2^2+...+w_n^2}$ n维

可以使用a去缩放
$(W,b)\rarr (aW=W',ab=b')$
使得在支持向量X_0上有
$\vert W'^TX_0+b' \vert=1$
此时支持向量到超平面的距离
$d=\frac 1{\Vert W \Vert}$

即线性模型下

$X\in R^D,\quad y\in(-1,+1)\\ f(x;W,b)=W^Tx+b ,\quad \Vert W \Vert 最小$

线性不可分

直接处理思路

初始优化问题：
$最小化：\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i}$

Subject to:
$y_i(W^Tx_i+b)\ge1-\xi_i \\ \xi_i\ge0 松弛向量,\quad i=1,...,N\\ N个 \xi_i 使得松弛，将\ge 0 扩大为 \ge a ,a可以足够小\\ 另外C使得 \xi_i 不能太小 C事先设定，正则化项$
代求参数： $,\xi_i$ 其中 $\xi_i$ 对应于每一个样本点

仍是寻找一个直线or平面

理解： 该算法找一条直线使得大部分正确， $\xi_i$ 仅是为了找出直线来
测试时仍然是， $y_j(W^Tx_j+b)\ge0$ （测试集数据不可能对应 $\xi_j$ ）
在这里插入图片描述

转化为线性处理

思路：将线性不可分问题转化为线性可分

增高数据X维度: 维度越高线性可分的可能性越大（亦即找出足够多的特征必可以将数据分离）

$X_i \rarr \Phi(X_i) \rarr y_i(W^T \Phi(X_i)+b) \ge 0$

例:异或问题
$\begin{aligned} data:&\\ &x_1=(0,0) y_1=-1 \quad x_4=(1,1) y_4=-1\\ &x_2=(0,1) y_2=+1 \quad x_3=(1,0) y_3=+1\\ &二维平面上不可分，无法找到 y_i(W^Tx_i+b)\ge0\\ \\ 高维转化\\ &取\Phi(x)为 \quad x=(a,b) \quad\Phi(x)=(a^2,b^2,a,b,ab)\\ &易得一种取法：w=(-1,-1,-1,-1,6) b=1 \quad 使得y_i(W^Tx_i+b)\ge0成立 \end{aligned}$

利用上面的直接处理思路

$\begin{aligned} 最小化：&\\ &\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i}\\ Subject \enspace to:&\\ &y_i(W^T\Phi(x_i)+b)\ge1-\xi_i ,\quad \xi_i\ge0 ,\quad i=1,...,N\\ 其中\Phi(x)为无限维 \end{aligned}$

无限维不可解
引入核函数（Kernel Function）：
只要知道 $K(x_1,x_2)=\Phi(x_1)T\Phi(x_2)$ 则上述问题依旧可解

$K(X_1,X_2)能写成\Phi(X_1)^T\Phi(X_2)的充要条件：\\ 1.K(X_1,X_2)=K(X_2,X_1) \\ 2.\forall C_i,X_i (i=1,...,N)有 \sum_{i=1}^{N}{\sum_{j=1}^{N}{C_iC_jK(X_i,X_j)}}\ge0$

常见核函数：高斯核、多项式核

原问题与对偶问题
目的：只用核函数而不用 $\Phi(X)$ ¹ 来解决优化问题

原问题（Prime Problem）
最小化: $f (w)$
限制条件：
$g_i(w)\le0\quad(i=1,..,k)\\ h_i(w)=0\quad(i=1,..,m)$

对偶问题（Dual Problem）
定义：
$\begin{aligned} L(w,\alpha,\beta)&=f(w)+\sum_{i=1}^{k}{\alpha_ig_i(w)}+\sum_{i=1}^{m}{\beta_ih_i(w)}\\ &=f(w)+\alpha^Tg(w)+\beta^Th(w) \\ 最大化：&\theta(\alpha,\beta)=\inf_{所有的w}{L(w,\quad\alpha,\beta)}\\ 限制条件：& \alpha_i \ge 0 \quad i=1,...,k \end{aligned}$

原问题与对偶问题之间关系
如果 $W^*$ 是原问题的解，而 $\alpha^*,\beta^*$ 就是对偶问题的解,那么
$f(w^*)\ge \theta(\alpha^*,\beta^*)$

证明
$\begin{aligned} \theta(\alpha^*,\beta^*)&=\inf_{对所有的w}{L(w,\quad \alpha^*,\beta^*)}\\ &\le L(W*,\quad \alpha^*,\beta^*)\\ &=f(w^*)+\sum_{i=i}^{k}{\alpha_i^*g_i(w^*)} + \sum_{i=1}^{m}{\beta_i^*h_i(w^*)}\\ &\#\# g_i(w^*)\le0 \quad h_i(w^*)=0\\ &\#\# \alpha^*\ge0\\ &\le f(w^*)\\ &\#\# 当且仅当W^*时,对于 \forall \alpha,\forall \beta时L(W,\alpha,\beta)取得最小值\\ &\#\# 另外\forall i=1,..,k均有 \alpha_i^*=0或g_i(w^*)=0 \quad 称为 KKT条件 \end{aligned}$

定义:
$\begin{aligned} &w^*为原问题的解；\alpha^*,\beta^*为对偶问题的解\\ &G=f(w^*)-\theta(\alpha^*,\beta^*)\ge0\\ &G叫做原问题与对偶问题之间的间距.\\ &对于某些特定的问题（强对偶问题）有G=0\\ \end{aligned}$

强对偶问题（实际意义）：
若 $f (w)$ 为凸函数且 $g (w) = a w + b$ 线性函数 $h (w) = c w + d$ 线性函数
则此优化问题的原问题与对偶问题的间距G=0,即 $f(w^*)=\theta(\alpha^*,\beta^*)$

将SVM的非线性可分的凸优化问题转换为对偶问题

$\begin{aligned} 最小化：&\\ &f(W,\xi)=\frac 1 2 \Vert W \Vert^2+C \sum_{i=1}^{N}{\xi_i}\\ &\#\#C为事先制定的正则化系数,常数 \\ Subject\enspace To:&\\ &y_i(W^T\Phi(x_i)+b)\ge1-\xi_i ,\\ &\xi_i\ge0 ,\quad i=1,...,N\\ &其中\Phi(x_i)为无限维 \end{aligned}$

转化

新原问题：
$\begin{aligned} 最小化：&\\ &f(w)=\frac 1 2 \Vert W \Vert^2-C \sum_{i=1}^{N}{\xi'_i}\\ Subject\enspace To:&\\ &\xi'_i\le0\\ &1+\xi'_i-y_i(W^T\Phi(x_i)+b)\le0\\ &\#\#此处\xi_i'=-\xi_i , 以下仍记为 \xi_i \end{aligned}$
对偶问题：
$\begin{aligned} 最大化：&\\ \theta(\alpha,\beta)&=\inf_{所有的w,\xi,b}{L(W,\xi_i,b)}\\ \\ L(W,\xi_i,b)&=\frac 1 2\Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+ \sum_{i=1}^{N}{\alpha_i\big(1+\xi_i-y_i(W^T\Phi(x_i)+b) \big)}\\ &+ \sum_{i=1}^{N}{\beta_i\xi_i}\\ 限制条件：&\\ &\forall i=1,...,N，有\alpha_i\ge0,\quad \beta_i\ge0 \end{aligned}$

满足强队偶问题的条件：
$\begin{aligned} &L(W^*,\xi^*,b^*)取得最小值\\ &\#\# 偏导为0\\ \\ \\KKT条件：\\ & \alpha_i=0 或 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ & 且 \beta_i=0 或 \xi_i=0 \end{aligned}$

计算：
$\begin{aligned} \frac {\partial L}{\partial W} =0 即 & W=\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)}\\ \frac {\partial L}{\partial \xi_i}=0 即 & \alpha_i+\beta_i=C\\ \frac {\partial L}{\partial b}=0 即 & \sum_{i=1}^{N}{\alpha_iy_i}=0\\ \\ 此时有：&\\ L(W,\xi_i,b)_{min}&=\frac 1 2\Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+ \sum_{i=1}^{N}{\alpha_i\big(1+\xi_i-y_i(W^T\Phi(x_i)+b) \big)}\\ &+ \sum_{i=1}^{N}{\beta_i\xi_i}\\ \\ &=\frac 1 2 \Vert W\Vert^2-C\sum_{i=1}^{N}{\xi_i}\\ &+\sum_{i=1}^{N}{\xi_i(\alpha_i+\beta_i)}\\ &+\sum_{i=1}^{N}{\alpha_i}\\ &-\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}\\ &-\sum_{i=1}^{N}{\alpha_iy_i b}\\ \\ &=\frac 1 2 \Vert W\Vert^2\\ &+\sum_{i=1}^{N}{\alpha_i}\\ &-\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}\\ \\ \\ 分开计算：&\\ \frac 1 2 \Vert W\Vert^2&=\frac 1 2 W^TW\\ &=\frac 1 2(\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)})^T (\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)})\\ &=\frac 1 2\sum_{i=1}^{N}{\alpha_iy_i(\Phi(x_i))^T} \sum_{j=1}^{N}{\alpha_jy_j\Phi(x_j)}\\ &=\frac 1 2\sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_i))^T\Phi(x_j)}}\\ &\#\# (\Phi(x_i))^T\Phi(x_j)=K(x_i,x_j) 核函数\\ \\ -\sum_{i=1}^{N}{\alpha_iy_iW^T\Phi(x_i)}&=-\sum_{i=1}^{N}{\alpha_iy_i\bigg(\sum_{j=1}^{N}{\alpha_jy_j \big(\Phi(x_j)\big)^T}\bigg)\Phi(x_i)}\\ &=-\sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_j))^T\Phi(x_i)}}\\ &\#\# (\Phi(x_j))^T\Phi(x_i)=K(x_i,x_j) 核函数\\ \\ \\ L(W,\xi_i,b)_{min}&=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad (\Phi(x_i))^T\Phi(x_j)}}\\ &\#\#K(x_i,x_j) 核函数\\ &=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ &\#\# \alpha_i 与 \alpha_j 未知\\ \\ \\ 目标最大化：&\\ \theta(\alpha,\beta)&=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ &\#\#\beta 消去\\ 限制条件（原有限制+偏导结论）:&\\ &\forall i=1,...,N，有0\le \alpha_i\le C\\ & \sum_{i=1}^{N}{\alpha_iy_i}=0\\ \\ 凸优化问题 SMO算法求解\alpha \end{aligned}$
转换为对偶问题后可以求解出 $\alpha$ 对应原问题求解 $W, b$
$W=\sum_{i=1}^{N}{\alpha_iy_i\Phi(x_i)}\quad 不可求\Phi(x_i)\\$

由于分类问题测试时，对于样本 $(x, y)$ 只需计算出： $\hat y=W^T\Phi(x)+b$ 比对 $\hat y==y$

$\begin{aligned} W^T\Phi(x_i)&=\sum_{i=1}^{N}{\alpha_iy_i(\Phi(x_i)^T\Phi(x)}\\ &=\sum_{i=1}^{N}{\alpha_iy_iK(x_i,x)}\\ \\ \\ b=? \quad KKT条件\\ &\alpha_i=0 或 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ &且 \beta_i=0 或 \xi_i=0 \end{aligned}$

b的求解：
$\begin{aligned} &对于SMO算法已求出的\alpha , 取出一个 \alpha_i 使得 0 \le \alpha_i \le C\\ &那么 \beta_i \not =0 则 \xi_i=0\\ &同时 1+\xi_i-y_i(W^T\Phi(x_i)+b)=0\\ &1-y_i(W^T\Phi(x_i)+b)=0 \\ & b=\frac {1-y_iW^T\Phi(x_i)}{y_i}=\frac{1-y_i\sum_{j=1}^{N}{\alpha_jy_jK(x_j,x_i)}}{y_i} \quad okay!!!! \end{aligned}$

the end!

总结

训练集数据 $\lbrace x_i,y_i \rbrace,i=1,...,N$

优化问题：

求解 $W,\xi_i,b,\Phi(x)$
$\begin{aligned} 最小化：&\\ &f(w)=\frac 1 2 \Vert W \Vert^2-C \sum_{i=1}^{N}{\xi_i}\\ Subject\enspace To:&\\ &\xi_i\le0\\ &1+\xi_i-y_i(W^T\Phi(x_i)+b)\le0 \end{aligned}$
转化后
求解 $\alpha_i$ (凸优化问题–SMO算法)
$最大化：\\ \theta(\alpha,\beta)=\sum_{i=1}^{N}{\alpha_i}-\frac 12 \sum_{i=1}^{N}{\sum_{j=1}^{N}{\alpha_i\alpha_jy_iy_j \quad K(x_i,x_j)}}\\ 限制条件：\\ \forall i=1,...,N ,有 0 \le \alpha_i \le C \quad \sum_{i=1}^{N}{\alpha_iy_i}=0$

测试时,对于某一测试样本 $(x, y)$
计算 $W^T\Phi(x)$ 与 $b$
$W^T\Phi(x)=\sum_{i=1}^{N}{\alpha_iy_iK(x_i,x)}\\ \\ 由训练集中求解，取一个\alpha_i使得： 0<\alpha_i<C\\ b=\frac{1-y_i\sum_{j=1}^{N}{\alpha_jy_jK(x_j,x_i)}}{y_i}$

若 $(W^T\Phi(x)+b)y\ge0$ 预测成功，否则err!

兵王代码

import pandas as pd
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,roc_curve,auc
import seaborn as sns
import matplotlib.pyplot as plt
"""
前期数据整理
"""

# 读取数据
data = pd.read_csv('krkopt.data', header=None)
data.dropna(inplace=True)  # 删除数据
# 将样本数值化
for i in [0, 2, 4]:
    """
    数据处理替换a,b,c,d,e,f,g,h 为数字
    """
    data.loc[data[i] == 'a', i] = 1
    data.loc[data[i] == 'b', i] = 2
    data.loc[data[i] == 'c', i] = 3
    data.loc[data[i] == 'd', i] = 4
    data.loc[data[i] == 'e', i] = 5
    data.loc[data[i] == 'f', i] = 6
    data.loc[data[i] == 'g', i] = 7
    data.loc[data[i] == 'h', i] = 8

# 将标签数值化
data.loc[data[6] != 'draw', 6] = -1
data.loc[data[6] == 'draw', 6] = 1

for i in range(6):
    data[i] = (data[i] - data[i].mean()) / data[i].std()  # 化为标准正态分布，样本归一化

"""
模型建立
"""

# 拆分训练集和测试集
# 前六项为数据，第七项为标签    训练样本占比0.82
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :6], data[6].astype("int").values,
                                                    test_size=0.82178500142572)


# SVC中的参数C越大，对于训练集来说，其误差越小，但是很容易发生过拟合；C越小，则允许有更多的训练集误分类，相当于soft margin
# gamma越大，样本的接受度和影响也会下降，从而导致对错误的容忍度降低。所以当欠拟合的时候gamma应该增加。相反，小的gamma容忍度较高，适合在过拟合的时候使用。

# 寻找C和gamma的粗略范围
CScale = [i for i in range(100, 201, 10)]
gammaScale = [i / 10 for i in range(1, 11)]
cv_scores = 0
savei = 0
savej = 0


# 参数学习
for i in CScale:
    for j in gammaScale:
        model = SVC(kernel='rbf', C=i, gamma=j)  # 支持向量机 rbf核函数    正则化项C(过拟合处理)     gamma
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')  # 交叉验证并为每次验证效果评测
        # Array of scores of the estimator for each run of the cross validation.
        if scores.mean() > cv_scores:
            cv_scores = scores.mean()
            savei = i
            savej = j * 100

# 找到更精确的C和gamma
CScale = [i for i in range(savei - 5, savei + 5)]
gammaScale = [i / 100 + 0.01 for i in range(int(savej) - 5, int(savej) + 5)]
cv_scores = 0
for i in CScale:
    for j in gammaScale:
        model = SVC(kernel='rbf', C=i, gamma=j)
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        if scores.mean() > cv_scores:
            cv_scores = scores.mean()
            savei = i
            savej = j

# 将确定好的参数重新建立svm模型
model = SVC(kernel='rbf', C=savei, gamma=(savej))
model.fit(X_train, y_train)
pre = model.predict(X_test)
model.score(X_test, y_test)



"""
绘图：简单评估
"""
# 绘制AUC和EER图形
cm = confusion_matrix(y_test, pre, labels=[-1, 1], sample_weight=None)
sns.set()
f, ax = plt.subplots()
sns.heatmap(cm, annot=True, ax=ax)  # 画热力图
ax.set_title('confusion matrix')  # 标题
ax.set_xlabel('predict')  # x轴
ax.set_ylabel('true')  # y轴
fpr, tpr, threshold = roc_curve(y_test, pre)  ###计算真正率和假正率
roc_auc = auc(fpr, tpr)  ###计算auc的值，auc就是曲线包围的面积，越大越好
plt.figure()
lw = 2
plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)  ###假正率为横坐标，真正率为纵坐标做曲线
plt.plot([0, 1], [1, 0], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

简析

支持向量机分类器

class SVC(BaseSVC):
    """C-Support Vector Classification.  The implementation is based on libsvm.
    The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.(拟合时间样本数量的二次方；样本量多于万不切实际的)

    For large datasets
    consider using :class:`~sklearn.svm.LinearSVC` or
    :class:`~sklearn.linear_model.SGDClassifier` instead, possibly after a
    :class:`~sklearn.kernel_approximation.Nystroem` transformer.

    Parameters
    ----------
    C : float, default=1.0   Regularization parameter.正数  正则化系数  过拟合  
    kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'    核函数  线性内核；多项式内核；高斯径向基函数核；sigmoid核

    degree :  多项式内核的次数 int, default=3  Degree of the polynomial kernel function ('poly').

    gamma : Kernel coefficient(系数) for 'rbf', 'poly' and 'sigmoid'.
    """
    pass

class LinearSVC(LinearClassifierMixin,SparseCoefMixin,BaseEstimator):
    """Linear Support Vector Classification.

    Similar to SVC with parameter kernel='linear', ... , so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

    Parameters
    ----------
    penalty : {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization.

    loss : {'hinge', 'squared_hinge'}, default='squared_hinge'   Hinge 损失函数
    The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.   
    tol : float, default=1e-4  Tolerance for stopping criteria.
    ...

    Examples
    --------
    >>> from sklearn.svm import LinearSVC
    >>> from sklearn.pipeline import make_pipeline
    >>> from sklearn.preprocessing import StandardScaler
    >>> from sklearn.datasets import make_classification

    # Generate a random n-class classification problem.
      # n_samples : int, default=100  The number of samples.
      n_features : int, default=20  The total number of features.
    >>> X, y = make_classification(n_features=4, random_state=0)

    # Construct a Pipeline from the given estimators（估计者：分类器）.
    # 管道连接多个分类模型
      # This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
      # return Pipeline(_name_estimators(steps), memory=memory, verbose=verbose)
???here???# ??? Pipeline of transforms with a final estimator. ????
    >>> clf = make_pipeline(StandardScaler(),
    ...                     LinearSVC(random_state=0, tol=1e-5))
    >>> clf.fit(X, y)
    Pipeline(steps=[('standardscaler', StandardScaler()),
                    ('linearsvc', LinearSVC(random_state=0, tol=1e-05))])

    >>> print(clf.named_steps['linearsvc'].coef_)
    [[0.141...   0.526... 0.679... 0.493...]]

    >>> print(clf.named_steps['linearsvc'].intercept_)
    [0.1693...]
    >>> print(clf.predict([[0, 0, 0, 0]]))
    [1]
    """
    pass

class Nystroem(TransformerMixin, BaseEstimator):
    """Approximate a kernel map using a subset of the training data.

    Examples
    --------
    >>> from sklearn import datasets, svm
    >>> from sklearn.kernel_approximation import Nystroem
    >>> X, y = datasets.load_digits(n_class=9, return_X_y=True)
    >>> data = X / 16.
    >>> clf = svm.LinearSVC()
    >>> feature_map_nystroem = Nystroem(gamma=.2,
    ...                                 random_state=1,
    ...                                 n_components=300)
    >>> data_transformed = feature_map_nystroem.fit_transform(data)
    >>> clf.fit(data_transformed, y)
    LinearSVC()
    >>> clf.score(data_transformed, y)
    0.9987...
    """

参数具体的含义

def train_test_split(*arrays, test_size=None, train_size=None, random_state=None,shuffle=True,stratify=None):
    """Split arrays or matrices into random train and test subsets

    random_state: Controls the shuffling applied to the data before applying the split.
    shuffle(洗牌，打乱)：Whether or not to shuffle the data before splitting.
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    >>> y_test
    [1, 4]
    数据 X 标签 y   测试集大小占比0.33
    """
    # .....
    pass

def cross_val_score(estimator, X, y=None, *, groups=None, scoring=None,cv=None, n_jobs=None, verbose=0, fit_params=None,pre_dispatch='2*n_jobs', error_score=np.nan):
    """Evaluate a score by cross-validation  交叉验证
    Examples
    --------
    >>> from sklearn import datasets, linear_model
    >>> from sklearn.model_selection import cross_val_score
    >>> diabetes = datasets.load_diabetes()
    >>> X = diabetes.data[:150]
    >>> y = diabetes.target[:150]
    # Linear Model trained with L1 prior as regularizer
    >>> lasso = linear_model.Lasso()
    >>> print(cross_val_score(lasso, X, y, cv=3))
    [0.33150734 0.08022311 0.03531764]
    """

SVM使用建议

SVM使用建议
We propose that beginners try the following procedure first:

Transform data to the format of an SVM package
Conduct simple scaling on the data

Consider the RBF kernel $K(x,y)=e^{-\gamma\Vert x-y \Vert^2}$
Use cross-validation to find the best parameter C and γ
Use the best parameter C and γ to train the whole training set
Test

数据处理：

多分类问题：采用one-hot向量编码
缩放（avoid attributes in greater numeric ranges dominating those in smaller numeric ranges；计算难度） $\rarr$ 样本归一化（化为正态分布）
We recommend linearly scaling each attribute to the range [−1, +1] or [0, 1].

模型选择（核函数）

RBF reasonable first choice
- 非线性映射，可以处理属性与分类标签之间的非线性的情况
- linear kernel 是 RBF kernel的特例 {带正则化系数C的线性核与RBF核( $c,\gamma$ )的能力相同}
- sigmoid kernel在某些情况下业余RBF kernel表现相同
- 超参数数目少于多项式核（Polynomial）
- the RBF kernel has fewer numerical difficulties.

There are some situations where the RBF kernel is not suitable.
In particular, when the number of features is very large, one may just use the linear kernel.

Cross-validation and Grid-search
找到比较好的 $(c,\gamma)$ —(The goal is to identify good (C, γ) so that the
classifier can accurately predict unknown data) $\quad$ but小心过拟合问题！！！ — 交叉验证{数据集划分}

We recommend a “grid-search” on C and γ using cross-validation. （穷举遍历）

The grid-search is straightforward but seems naive.
In fact, there are several advanced methods which can save computational cost by, for example, approximating the cross-validation rate.

但仍推荐Grid-search原因：

feel safe to use methods which avoid doing an exhaustive parameter search by approximations or heuristics（启发）.
只有两个参数 $(c,\gamma)$ ，使用穷举法也不户增加太多计算时间开销
容易并行计算okay $\quad$ Many of advanced methods are iterative processes, e.g. walking along a path, which can be hard to parallelize