各种分类器的调用

最新推荐文章于 2024-05-31 23:57:57 发布

静心净气

最新推荐文章于 2024-05-31 23:57:57 发布

阅读量5.9k

点赞数 2

文章标签： Python sklearn XGBoost 机器学习

本文链接：https://blog.csdn.net/qimiejia5584/article/details/78622442

版权

前言：主要介绍sklearn里面的各种分类器使用方法。

0、分类器支持参数的细节

核函数：
1. Linear核：主要用于线性可分的情形。参数少，速度快，对于一般数据，分类效果已经很理想了。
2. RBF核：主要用于线性不可分的情形。参数多，分类结果非常依赖于参数。

一、K邻近算法

参考文献

from sklearn.neighbors import NearestNeighbors

它是使用近似算法的，所以与实际手工码的结果有少量差别。
1、商铺定位项目的实践证明：kneighbors函数、predict_proba预测概率函数。相互间有很大的关系，但是并不完全依赖，需要认为是两者是相互独立的。
所以说，有分别构建出两种不同的概率矩阵。

二、支持向量机

数据的类型要求：密度的用numpy.ndarray或numpy.asarray、稀疏的用scipy.sparse。加速：使用C-ordered的numpy.ndarray或 dtype=float64的scipy.sparse.csr_matrix 可以快一些。

SVC与LinerSVC的比较：LinerSVC会可能比SVC效果好许多的。linerSVC可以使用SMO算法，支持软化，速度很快，但不能用预测概率。SVC可以使用核函数，预测概率，不过速度很慢。

使用linerSVC会比SVC快，因为调用的底层库不同。而里面也提到梯度下降法的损失函数与linerSVC的一致，也是可以作为代替的。

svm的优点：参考文献
（1）在高纬度的效率高，包括处理维度数比样本数还大的数据。
（2）内存效率高，只使用训练点的子集作为决定函数（支持向量）。
（3）可以指定各种各样的核函数，核函数影响决定函数。
svm缺点：
（1）维度数大于样本数，尽管效率高，但得到的预测结果不好。
（2）它不直接提供概率评估，而是使用高消耗的五折验证计算。

svc = SVC(decision_function_shape='ovo', class_weight='balanced')
svc.fit(all_train_vecs_nor, y_train_data)
predict = svc.predict(unlabeled_vecs_nor)  # 这个代表正则化后的test集合向量。

class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None, random_state=None)

1、各个参数的解释、翻译。

C : float, optional (default=1.0)
惩罚系数。
kernel : string, optional (default=’rbf’)
核函数，包括‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’。如果输入矩阵数据，将会自动转化为核。
degree : int, optional (default=3)
多项式核函数(‘poly’).的度数。
gamma : float, optional (default=’auto’)
三种核的系数‘rbf’, ‘poly’ and ‘sigmoid’，默认1/n_features
coef0 : float, optional (default=0.0)
两种核的偏置量 ‘poly’ and ‘sigmoid’.
probability : boolean, optional (default=False)
是否使用概率估计，这会让fit方法速度减慢。控制输出是概率，还是直接一个类别。
shrinking : boolean, optional (default=True)
是否使用shrinking启发式算法。（能预知哪些变量对应着支持向量，则这些变量可保持不动，只对其他变量进行优化，训练时间大大降低。）
tol : float, optional (default=1e-3)
停止条件的容许值。
cache_size : float, optional
核的缓存大小 (in MB)，对训练速度无影响。
class_weight : {dict, ‘balanced’}, optional
默认是平衡，也就是哪一类的输入数目越多，权重反而越小，最终实现平衡。
verbose : bool, default: False
是否输出每一轮的情况。但是可能会影响多线程上下文。
max_iter : int, optional (default=-1)
最多循环训练多少轮，-1代表无限制。
decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None
决策函数的方式，用于多分类 ‘ovo’代表一个类与另一类最终投票决定共n_classes * (n_classes - 1) / 2 个分类器； ‘ovr’一个类与剩下所有类比较共n_classes 个分类器。
random_state : int seed, RandomState instance, or None (default)
在使用概率评估时，它可以用来扰动数据，这样就有属于哪一种的概率可言了。
epsilon : float, optional (default=0.1)
只在回归问题如SVR类才用到，用于多大的误差是相当于没有误差的。

2、属性（直接获取的）:–未翻译

support_ : array-like, shape = [n_SV]
Indices of support vectors.
support_vectors_ : array-like, shape = [n_SV, n_features]
Support vectors.
n_support_ : array-like, dtype=int32, shape = [n_class]
Number of support vectors for each class.
dual_coef_ : array, shape = [n_class-1, n_SV]
Coefficients of the support vector in the decision function. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the section about multi-class classification in the SVM section of the User Guide for details.
coef_ : array, shape = [n_class-1, n_features]
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_ is a readonly property derived from dual_coef_ and support_vectors_.
intercept_ : array, shape = [n_class * (n_class-1) / 2]
Constants in decision function.

三、随机梯度下降分类器

参考文献
1、优缺点
优点：通常使用在文本分类、自然语言处理的问题。能够处理大样本、特征数目多。高效、调参方便。
缺点：要求一些超参数（比如，正则化参数、迭代的数目），并且它对特征的缩放比较敏感。

2、随机梯度下降-处理分类问题-线性模型
注意：
（1）要确保你fit训练集之前，让shuffle=True 来在每一轮迭代后进行洗牌。
（2）loss=”log” 和loss=”modified_huber” 更适合多分类问题。
（3）当设置average=True，就变成了averaged SGD (ASGD)，区别是：使用朴素SGD的coefficients平均值。因此学习率可以设置大一些加速训练的时间。

from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2")# loss是计算损失的方法、penalty代表惩罚修正的方法。
clf.fit(X, y)
print clf.predict([[2., 2.]])# 输出 array([1])
print clf.coef_ # coefficients 系数。
print clf.intercept_ # 同名：拦截、截断、窃听、抵消或偏见，偏移量。
print clf.decision_function([[2., 2.]]) # 得到各点到超平面的带符号的距离。

3、分类问题的参数调优

class sklearn.linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, n_iter=5, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate='optimal', eta0=0.0, power_t=0.5, class_weight=None, warm_start=False, average=False)

loss：损失。str, ‘hinge’[线性SVM], ‘log’[逻辑回归，概率分类器], ‘modified_huber’[容忍异常值，概率分类器], ‘squared_hinge’[hinge加了平方], ‘perceptron’[线性感知机算法], 回归问题的loss[对应算法看SGDRegressor章节]: ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
penalty：惩罚，又名正则项。str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’ # l2 是一个标准正则项针对线性SVM模型的，而l1、elasticnet是在特征选择时使用的，即使得模型稀疏，这个l2无法实现。
alpha : float #与正则化项相乘的常数。如果learning_rate设置为optimal模式，该值同时作为optimal模式公式的一个变量值。
l1_ratio : float # 混合惩罚项的比例。如果设置为0，就相当于只使用l2，如果设置为1，就相当于只使用l1.
fit_intercept : bool #是够要偏移。默认是要偏移的。如果你的数据已经居中了，就不用偏移。
n_iter : int, optional# 多少次经过训练集，即迭代多少次。如果是部分训练 partial_fit，迭代次数会设置为1，否则默认为5.
shuffle : bool, optional# 是否每次迭代都要洗牌一次。处理分类问题时记得保留默认的True。
random_state : int seed, RandomState instance, or None (default)# 随机种子，自己设置一个具体值最好。
verbose : integer, optional #冗余
epsilon : float# 只有loss选择这3个才起阈值的作用：‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
n_jobs : integer, o

最低0.47元/天解锁文章

静心净气

关注

2
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
各种分类器的调用

前言：主要介绍sklearn里面的各种分类器使用方法。0、分类器支持参数的细节核函数： 1. Linear核：主要用于线性可分的情形。参数少，速度快，对于一般数据，分类效果已经很理想了。 2. RBF核：主要用于线性不可分的情形。参数多，分类结果非常依赖于参数。一、K邻近算法参考文献from sklearn.neighbors import NearestN...
复制链接

扫一扫