Navigator
SVM
SVM原理部分可以见SVM系列博客,也可以参考讲义.
SVM的主要优势如下:
- Effective in high dimensional spaces.
- Still effective in cases where number of domensions is greater than the number of samples.(当特征数量比观测点数量还多的时候,SVM仍然是有效的,这个特点使得SVM比很多计量模型优秀的地方).
- Uses a subset of training points in the decision function (called support vectors, 支持向量),so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
SVM的主要劣势如下:
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. (当特征的数量比样本数量还多的时候,关键是如何选择核函数来克服过拟合的情况).
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.(使用
5-fold
交叉验证,对计算资源的开销比较大).
Classification
SVC
,NuSVC
和LinearSVC
模型既可以完成二分类也可以完成多分类任务.
在多分类任务中,SVC
和NuSVC
(两种模型类似,仅仅在参数和数学模型上有着差别),使用one-versus-one
方法,即建立
N
(
N
−
1
)
2
\frac{N(N-1)}{2}
2N(N−1)个分类器,其中
N
N
N表示类别的数量,每个分类器都执行二分类任务.
To provide a consistent interface with other classifiers, the
decision_function_shape
option allows to monotonically transform the results of theovo
classifier to aovr
decision function of shape(n_samples, n_classes)
.
def svc_multi_class_demo():
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = SVC(decision_function_shape='ovo') # 使用ovo方法
clf.fit(X, Y)
dec = clf.decision_function([[1]])
print(dec.shape[1]) # 6
clf = SVC(decision_function_shape='ovr') # 使用ovr方法
clf.fit(X, Y)
dec = clf.decision_function([[0]])
print(dec.shape[1]) # 4
Scores and probabilities
决策函数decision_function
给出了每种类别的评分 ,当选项probability
设置为True
,可以给出每种类别的估计概率(方法predict_proba
和predict_log_proba
),概率计算使用Platt scaling
1: logistic regression on the SVM's scores, fit by an additional cross-validation on the training data. In the multiclass case, this is extended as per
2.
Issues about Platt Scaling
The cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimate may be inconsistent with the scores:
- the argmax of the scores may not be the argmax of the probabilities
- in binary classification, a sample may be labeled by
predict
as belonging to the positive class even if the output ofpredict_proba
is less than 0.5.
Unbalanced prolems
为了解决数据集中存在的不平衡问题,可以对稀有类数据进行加权处理,参数class_weight
和sample_weight
可以实现该操作.
在SVC
中,class_weight
以字典形式传入{class_label : value}
,将class_value
的权重值变为C*value
.
Complexity
当训练向量的数量上升时,SVM的计算开销也在增加,SVM的核心是求解一个QP规划问题,将支持向量从其余训练数据中分离,基于libsvm的QO求解器的算法空间复杂度介于
O
(
n
f
e
a
t
u
r
e
s
×
n
s
a
m
p
l
e
s
2
)
\mathcal{O}(n_{features}\times n_{samples}^2)
O(nfeatures×nsamples2)和
O
(
n
f
e
a
t
u
r
e
s
×
n
s
a
m
p
l
e
s
3
)
\mathcal{O}(n_{features}\times n^3_{samples})
O(nfeatures×nsamples3).
相比较而言,LinearSVC
的效率就会高很多.
Tips
Kernel cache size
kernel cache
的值对算法的运行时间有着明显的影响,在RAM
足够的条件下,可以将cache_size
设置为超过默认值200MB
.
Setting C
C
值默认为1,如果观测数据中存在很多噪音,那么应该降低C
值
decresing C correspond to more regularization.
highly recommended to scale data
可以使用pipeline()
方法将数据标准化/归一化流程与SVC
串联
def svc_pipeline():
X = [[0, 2, 10, 15], [1,-3, 2, 7], [2, 18, -100, 1], [3, 0.25, 100, 7]]
y = [0, 1, 1, 0]
clf = make_pipeline(StandardScaler(), SVC())
clf.fit(X, y)
ans = clf.predict([[2, 3, 11, 3]])
print(ans)
Shrinking parameters
We found that if the number of iterations is large, then shrinking can shorten the training time. However, if we loosely solve the optimization problem (e.g. using a large stopping tolerance), the code without using shrinking may be faster.
L1 penalization
使用L1
正则化项可以产生稀疏解,仅有部分特征被使用,当C
值增加时倾向于产生一个更加复杂的模型(使用更多的特征),零模型(null model, all weights equal to zero
)可以使用函数l1_min_c
得到
Kernel functions
常用的核函数有如下几种
kernel function | expressions |
---|---|
linear | ⟨ x , x ′ ⟩ \langle x, x'\rangle ⟨x,x′⟩ |
polynomial |
(
γ
⟨
x
,
x
′
⟩
+
r
)
d
(\gamma\langle x, x'\rangle+r)^d
(γ⟨x,x′⟩+r)d,
d
d
d由参数degree 定义,
r
r
r由coef0 定义 |
rbf |
exp
(
−
γ
∥
x
−
x
′
∥
2
)
\exp(-\gamma\lVert x-x' \rVert^2)
exp(−γ∥x−x′∥2),
γ
\gamma
γ由参数gamma ,大于0 |
sigmoid | tanh ( γ ⟨ x , x ′ ⟩ + r ) \tanh(\gamma\langle x, x' \rangle+r) tanh(γ⟨x,x′⟩+r) |
def svc_kernel():
linear_svc = SVC(kernel='linear')
print(linear_svc.kernel)
rbf_svc = SVC(kernel='rbf')
print(rbf_svc.kernel)
parameters of RBF kernel
在RBF(Radial Basis Function
)核中,需要考虑两个参数,C
和gamma
.
C
: 参数在所有的SVM核函数中通用,在误分类和训练样本中平衡,描述了决策平面的形态,一个较低的C
值倾向于得到一个粗略的分离平面,而较高的C
值倾向于将每个样本都分类正确.gamma
: 定义了单个样本点的影响范围,当gamma
值越大,越近的样本点会受到影响.
Interface
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
demos: multilabel classification
def demo_multilabel():
# 绘制分离超平面
def hyperplane_plot(clf, x_min, x_max, ls, label):
w = clf.coef_[0]
k = -w[0]/w[1]
xx = np.linspace(x_min-3, x_max+3)
yy = k*xx-clf.intercept_[0]
plt.plot(xx, yy, ls, alpha=0.7, label=label)
def subfigure_plot(X, y, subp, title, transform):
# PCA提取主成分
if transform=='pca':
X = PCA(n_components=2).fit_transform(X)
elif transform=='cca':
X = CCA(n_components=2).fit(X, y).transform(X)
else:
raise ValueError
x_min, x_max = np.min(X[:, 0]), np.max(X[:, 0])
y_min, y_max = np.min(X[:, 1]), np.max(X[:, 1])
clf = OneVsRestClassifier(SVC(kernel='linear')) # ovr多分类
clf.fit(X, y)
plt.subplot(2, 2, subp)
plt.title(title)
class_0 = np.where(y[:, 0])
class_1 = np.where(y[:, 1])
plt.scatter(X[:, 0], X[:, 1], s=40, c='gray', alpha=0.6, edgecolors=(0, 0, 0))
plt.scatter(X[class_0, 0], X[class_0, 1], s=120, edgecolors='b', facecolors='none', lw=1, label='Class 0')
plt.scatter(X[class_1, 0], X[class_1, 1], s=60, edgecolors='orange', facecolors='none', lw=1, label='Class 1')
hyperplane_plot(clf.estimators_[0], x_min, x_max, 'k--', 'Boundary\nfor Class 0')
hyperplane_plot(clf.estimators_[1], x_min, x_max, 'k-.', 'Boundary\nfor Class 1')
plt.xticks([])
plt.yticks([])
plt.xlim(x_min-0.5*x_max, x_max+0.5*x_max)
plt.ylim(y_min-0.5*y_max, y_max+0.5*y_max)
plt.figure(figsize=(8, 6))
# 加入无标签的数据
X, y = make_multilabel_classification(n_classes=2, n_labels=1, allow_unlabeled=True, random_state=729)
subfigure_plot(X, y, 1, 'With Unlabeled obs PCA', 'pca')
subfigure_plot(X, y, 2, 'With Unlabeled obs CCA', 'cca')
# 不存在无标签数据
X, y = make_multilabel_classification(n_classes=2, n_labels=1, allow_unlabeled=False, random_state=729)
subfigure_plot(X, y, 3, 'Without Unlabeled obs PCA', 'pca')
subfigure_plot(X, y, 4, 'Without Unlabeled obs CCA', 'cca')
plt.subplots_adjust(.04, .02, .97, .94, .09, .2)
plt.show()
References
SVM User Guide
sklearn.svm.SVC