机器学习算法概览-CSDN博客

转载：http://www.cnblogs.com/chenyaling/p/7826229.html

1。监督学习
1.1。广义线性模型
1.1.1。普通最小二乘法
class sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
1.1.1.1。普通最小二乘法复杂性 O(np2)
1.1.2。岭回归 linear_model.Ridge
1.1.2.1。脊的复杂性 O(np2)
1.1.2.2。设置正则化参数：广义交叉验证linear_model.RidgeCV
1.1.3。套索 linear_model.Lasso
1.1.3.1。正则化参数设置
1.1.3.1.1。使用交叉验证
1.1.3.1.2。基于信息标准的模型选择
1.1.3.1.3。支持向量机正则化参数的比较
1.1.4。多任务的套索
1.1.5。弹性网
1.1.6。多任务弹性网
1.1.7。最小角回归
1.1.8。LARS-Lasso
1.1.8.1。数学公式
1.1.9。正交匹配追踪（OMP）
1.1.10。贝叶斯回归
1.1.10.1。贝叶斯岭回归
1.1.10.2。自动相关性判定
1.1.11。Logistic回归
class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)
penalty：在调参时如果我们主要的目的只是为了解决过拟合，一般penalty选择L2正则化就够了。
但是如果选择L2正则化发现还是过拟合，即预测效果差的时候，就可以考虑L1正则化。
另外，如果模型的特征非常多，我们希望一些不重要的特征系数归零，从而让模型系数稀疏化的话，也可以使用L1正则化。
penalty参数的选择会影响我们损失函数优化算法的选择。即参数solver的选择，如果是L2正则化，那么4种可选的算法{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}都可以选择。但是如果penalty是L1正则化的话，就只能选择‘liblinear’了。这是因为L1正则化的损失函数不是连续可导的，而{‘newton-cg’, ‘lbfgs’,‘sag’}这三种优化算法时都需要损失函数的一阶或者二阶连续导数。而‘liblinear’并没有这个依赖。
solver:决定了我们对逻辑回归损失函数的优化方法，有4种算法可以选择，分别是：
a) liblinear：使用了开源的liblinear库实现，内部使用了坐标轴下降法来迭代优化损失函数。
b) lbfgs：拟牛顿法的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
c) newton-cg：也是牛顿法家族的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
d) sag：即随机平均梯度下降
liblinear支持L1和L2，只支持one-vs-rest(OvR)做多分类，“lbfgs”, “sag” “newton-cg”只支持L2，支持one-vs-rest(OvR)和many-vs-many(MvM)做多分类。
multi_class:决定了我们分类方式的选择，有 ovr和multinomial两个值可以选择，默认是 ovr。
ovr即前面提到的one-vs-rest(OvR)，而multinomial即前面提到的many-vs-many(MvM)。如果是二元逻辑回归，ovr和multinomial并没有任何区别，区别主要在多元逻辑回归上。
OvR的思想很简单，无论你是多少元逻辑回归，我们都可以看做二元逻辑回归。具体做法是，对于第K类的分类决策，我们把所有第K类的样本作为正例，除了第K类样本以外的所有样本都作为负例，然后在上面做二元逻辑回归，得到第K类的分类模型。其他类的分类模型获得以此类推。
而MvM则相对复杂，这里举MvM的特例one-vs-one(OvO)作讲解。如果模型有T类，我们每次在所有的T类样本里面选择两类样本出来，不妨记为T1类和T2类，把所有的输出为T1和T2的样本放在一起，把T1作为正例，T2作为负例，进行二元逻辑回归，得到模型参数。我们一共需要T(T-1)/2次分类。
从上面的描述可以看出OvR相对简单，但分类效果相对略差（这里指大多数样本分布情况，某些样本分布下OvR可能更好）。而MvM分类相对精确，但是分类速度没有OvR快。1.1.12。随机梯度下降- SGD
class_weight:可以选择balanced让类库自己计算类型权重，或者我们自己以字典形式输入各个类型的权重，
当class_weight为balanced时，类权重计算方法如下：n_samples / (n_classes * np.bincount(y))，n_samples为样本数，n_classes为类别数量，np.bincount(y)会输出每个类的样本数，例如y=[1,0,0,1,1],则np.bincount(y)=[2,3]

class_weight可以使用sklearn.utils.class_weight.compute_class_weight得到。

sample_weight:class_weight是样本平衡的情况下使用，如果样本不均衡，在fit数据时使用fit(X, y[, sample_weight])来自己调节每个样本权重。在scikit-learn做逻辑回归时，如果上面两种方法都用到了，那么样本的真正权重是class_weight*sample_weight.

sample_weigth可以使用sklearn.utils.class_weight.compute_sample_weight得到，原理也就是n_samples / (n_classes * np.bincount(y))，计算得到的是一个array，按照array中的值为每个样本分配权重。

sample_weight=compute_sample_weight(class_weight='balanced', y=labels)，将sample_weight传入fit函数就行了。

C：正则化参数
max_iter:迭代次数
1.1.13。感知器
class sklearn.linear_model.Perceptron(penalty=None, alpha=0.0001, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, eta0=1.0, n_jobs=1, random_state=0, class_weight=None, warm_start=False, n_iter=None)
1.1.14。被动攻击的算法
1.1.15。稳健回归：离群值和建模误差
1.1.15.1。不同的场景和有用的概念
1.1.15.2。方法：随机抽样一致
1.1.15.2.1。算法细节
1.1.15.3。泰尔森估计：广义中值估计
1.1.15.3.1。的理论思考
1.1.15.4。胡贝尔的回归
1.1.15.5。笔记
1.1.16。多项式回归：用基函数展开线性模型

1.2。线性和二次判别分析
1.2.1。基于线性判别分析的降维
1.2.2。数学公式的LDA和QDA分类
1.2.3。LDA降维的数学公式
class sklearn.discriminant_analysis.LinearDiscriminantAnalysis(solver=’svd’, shrinkage=None, priors=None, n_components=None, store_covariance=False, tol=0.0001)
1.2.4。收缩
1.2.5。估计算法

1.3。核岭回归

1.4。支持向量机
1.4.1。分类
1.4.1.1。多类分类
class sklearn.svm.LinearSVC(penalty=’l2’, loss=’squared_hinge’, dual=True, tol=0.0001, C=1.0, multi_class=’ovr’, fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)
只有linearsvm使用multi_class="crammer_singer"来达到一对一分类器
1.4.1.2。分数和概率
1.4.1.3。不平衡的问题
1.4.2。回归
1.4.3。密度估计，新颖性检测（2.7里有详解）
1.4.4。复杂性
1.4.5。实际使用技巧
1.4.6。核函数
1.4.6.1。自定义内核
1.4.6.1.1。使用Python函数作为内核
1.4.6.1.2。利用Gram矩阵
kernel='precomputed'在fit方法中设置并传递Gram矩阵而不是X。此时，必须提供所有训练向量和测试向量之间的内核值。
clf = svm.SVC(kernel='precomputed')
gram = np.dot(X, X.T)
clf.fit(gram, y)
1.4.6.1.3。径向基函数核参数
利用模型选择中的gridsearchcv进行c和gamme参数的选择
1.4.7。数学公式
1.4.7.1。SVC
class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
SVC参数解释
（1）C: 目标函数的惩罚系数C，用来平衡分类间隔margin和错分样本的，default C = 1.0；
（2）kernel：参数选择有RBF, Linear, Poly, Sigmoid, 默认的是"RBF";
（3）degree：if you choose 'Poly' in param 2, this is effective, degree决定了多项式的最高次幂；
（4）gamma：核函数的系数('Poly', 'RBF' and 'Sigmoid'), 默认是gamma = 1 / n_features;
（5）coef0：核函数中的独立项，'RBF' and 'Poly'有效；
（6）probablity: 可能性估计是否使用(true or false)；
（7）shrinking：是否进行启发式；
（8）tol（default = 1e - 3）: svm结束标准的精度;
（9）cache_size: 制定训练所需要的内存（以MB为单位）；
（10）class_weight: 每个类所占据的权重，不同的类设置不同的惩罚参数C, 缺省的话自适应；
（11）verbose: 跟多线程有关，不大明白啥意思具体；
（12）max_iter: 最大迭代次数，default = 1， if max_iter = -1, no limited;
（13）decision_function_shape ： ‘ovo’ 一对一, ‘ovr’ 多对多 or None 无, default=None
（14）random_state ：用于概率估计的数据重排时的伪随机数生成器的种子。
（15）decision_function是样本对于不同类的分数
ps：7,8,9一般不考虑。
decision_function(X) Distance of the samples X to the separating hyperplane.
fit(X, y[, sample_weight]) Fit the SVM model according to the given training data.
get_params([deep]) Get parameters for this estimator.
predict(X) Perform classification on samples in X.
score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
set_params(**params) Set the parameters of this estimator.
1.4.7.2。nusvc
class sklearn.svm.NuSVC(nu=0.5, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
1.4.7.3。SVR
class sklearn.svm.SVR(kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)参数里没有class_weight，属性有sample_weight设置C值
1.4.8。实施细则

1.5。随机梯度下降法
1.5.1。分类
class sklearn.linear_model.SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, class_weight=None, warm_start=False, average=False, n_iter=None)
1.5.2。回归
class sklearn.linear_model.SGDRegressor(loss=’squared_loss’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate=’invscaling’, eta0=0.01, power_t=0.25, warm_start=False, average=False, n_iter=None)
1.5.3。稀疏数据的随机梯度下降
1.5.4。复杂性
1.5。实际使用技巧
1.5.6。数学公式
1.5.6.1。SGD
1.5.7。实施细则

1.6。最近的邻居
1.6.1。无监督的近邻
1.6.1.1。寻找最近的邻居
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
1.6.1.2。KDTree和BallTree Classes
1.6.2。最近邻分类
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
KNeighborsClassifier.fit(X,y)
1.6.3。最近邻回归
class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
最近邻回归是用在标签值是连续取值的场景智商的，而不是离散取值，而是用最近邻回归进行查询的点，最后得到的结果是其所有最近邻居的平均值。
1.6.4。最近邻居算法
1.6.4.1。蛮力
1.6.4.2。k-d树
algorithm = 'kd_tree'
1.6.4.3。球树
1.6.4.4。最近邻算法的选择
1.6.4.5。影响leaf_size
1.6.5。最近的质心分类器
1.6.5.1。最近萎缩的重心

1.7。高斯过程
1.7.1。高斯过程回归（GPR）
1.7.2。探地雷达的例子
1.7.2.1。噪声水平估计的探地雷达
1.7.2.2。探地雷达与Kernel Ridge回归的比较
1.7.2.3。探地雷达在冒纳罗亚CO2数据
1.7.3。高斯过程分类（GPC）
1.7.4。GPC的例子
1.7.4.1。GPC的概率预测
1.7.4.2。异或数据集上的GPC实例
1.7.4.3。虹膜数据集的高斯过程分类
1.7.5。高斯过程的核函数
1.7.5.1。高斯过程核API
1.7.5.2。基本内核
1.7.5.3。核心运营商
1.7.5.4。径向基函数（RBF）核
1.7.5.5。堆芯
1.7.5.6。有理二次核
1.7.5.7。验正弦平方核
1.7.5.8。点积核
1.7.5.9。工具书类
1.7.6。传统的高斯过程
1.7.6.1。介绍性回归例子
1.7.6.2。数据拟合
1.7.6.3。数学公式
1.7.6.3.1。最初的假设
1.7.6.3.2。最佳线性无偏预测（BLUP）
1.7.6.3.3。经验最佳线性无偏预测（EBLUP）
1.7.6.4。相关模型
1.7.6.5。回归模型
1.7.6.6。实施细则

1.8。交叉分解

1.9。朴素贝叶斯
1.9.1。高斯朴素贝叶斯
class sklearn.naive_bayes.GaussianNB(priors=None)
1.9.2。多项式朴素贝叶斯
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
1.9.3。伯努利的朴素贝叶斯
class sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
binarize=0.0:默认输入的是二进制的向量
fit_prior=True：是否重新计算先验概率，若为FALSE，则使用统一的鲜艳概率（我也不知道有啥用）
class_prior=None：指定先验概率
在多项式模型中：
在多项式模型中，设某文档d=(t1,t2,…,tk)，tk是该文档中出现过的单词，允许重复，则
先验概率P(c)= 类c下单词总数/整个训练样本的单词总数 
类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)
V是训练样本的单词表（即抽取单词，单词出现多次，只算一个），|V|则表示训练样本包含多少种单词。 P(tk|c)可以看作是单词tk在证明d属于类c上提供了多大的证据，而P(c)则可以认为是类别c在整体上占多大比例(有多大可能性)。
在伯努利模型中：
P(c)= 类c下文件总数/整个训练样本的文件总数 
P(tk|c)=(类c下包含单词tk的文件数+1)/(类c下单词总数+2)
1.9.4。非核心朴素贝叶斯模型拟合

1.10。决策树
决策树的一些优点是：
简单的理解和解释。树木可视化。
需要很少的数据准备。其他技术通常需要数据归一化，需要创建虚拟变量，并删除空值。请注意，此模块不支持缺少值。
使用树的成本（即，预测数据）在用于训练树的数据点的数量上是对数的。
能够处理数字和分类数据。其他技术通常专门用于分析只有一种变量类型的数据集。有关更多信息，请参阅算法。
能够处理多输出问题。
使用白盒模型。如果给定的情况在模型中可以观察到，那么条件的解释很容易用布尔逻辑来解释。相比之下，在黑盒子模型（例如，在人造神经网络中），结果可能更难解释。
可以使用统计测试验证模型。这样可以说明模型的可靠性。
即使其假设被数据生成的真实模型有些违反，表现良好。
决策树的缺点包括：
决策树学习者可以创建不能很好地推广数据的过于复杂的树。这被称为过拟合。修剪（不支持当前）的机制，设置叶节点所需的最小样本数或设置树的最大深度是避免此问题的必要条件。
决策树可能不稳定，因为数据的小变化可能会导致完全不同的树生成。通过使用合奏中的决策树来减轻这个问题。
在最优性的几个方面甚至简单的概念中，学习最优决策树的问题已知是NP完整的。因此，实际的决策树学习算法基于启发式算法，例如在每个节点进行局部最优决策的贪心算法。这样的算法不能保证返回全局最优决策树。这可以通过在综合学习者中训练多个树木来缓解，其中特征和样本随机抽样取代。
有一些难以学习的概念，因为决策树不能很容易地表达它们，例如XOR，奇偶校验或复用器问题。
如果某些类占主导地位，决策树学习者会创造有偏见的树木。因此，建议在拟合之前平衡数据集与决策树。
1.10.1。分类
class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
>>> import graphviz
>>> dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph.render("iris")
输出n_output值predict
输出类概率的n_output数组列表 predict_proba
*划分时考虑的最大特征数max_features：可以使用很多种类型的值，默认是"None",意味着划分时考虑所有的特征数；如果是"log2"意味着划分时最多考虑log2N个特征；如果是"sqrt"或者"auto"意味着划分时最多考虑N个特征。如果是整数，代表考虑的特征绝对数。如果是浮点数，代表考虑特征百分比，即考虑（百分比xN）取整后的特征数。其中N为样本总特征数。一般来说，如果样本特征数不多，比如小于50，我们用默认的"None"就可以了，如果特征数非常多，我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数，以控制决策树的生成时间。
*决策树最大深max_depth：决策树的最大深度，默认可以不输入，如果不输入的话，决策树在建立子树的时候不会限制子树的深度。一般来说，数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最大深度，具体的取值取决于数据的分布。常用的可以取值10-100之间。
*内部节点再划分所需最小样本数min_samples_split：这个值限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。默认是2.如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。我之前的一个项目例子，有大概10万样本，建立决策树时，我选择了min_samples_split=10。可以作为参考。
*叶子节点最少样本数min_samples_leaf：这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝。默认是1,可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。之前的10万样本项目使用min_samples_leaf的值为5，仅供参考。
*特征选择标准criterion:可以使用"gini"或者"entropy"，前者代表基尼系数，后者代表信息增益。一般说使用默认的基尼系数"gini"就可以了，即CART算法。除非你更喜欢类似ID3, C4.5的最优特征选择方法.regression：可以使用"mse"或者"mae"，前者是均方差，后者是和均值之差的绝对值之和。推荐使用默认的"mse"。一般来说"mse"比"mae"更加精确。除非你想比较二个参数的效果的不同之处。
特征划分点选择标准splitter：可以使用"best"或者"random"。前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。默认的"best"适合样本量不大的时候，而如果样本数据量非常大，此时决策树构建推荐"random"
叶子节点最小的样本权重和min_weight_fraction_leaf：这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝。默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。
最大叶子节点数max_leaf_nodes：通过限制最大叶子节点数，可以防止过拟合，默认是"None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制，具体的值可以通过交叉验证得到。
类别权重class_weight：指定样本各类别的的权重，主要是为了防止训练集某些类别的样本过多，导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重，或者用“balanced”，如果使用“balanced”，则算法会自己计算权重，样本量少的类别所对应的样本权重会高。当然，如果你的样本类别分布没有明显的偏倚，则可以不管这个参数，选择默认的"None" 不适用于回归树
节点划分最小不纯度min_impurity_split：这个值限制了决策树的增长，如果某节点的不纯度(基尼系数，信息增益，均方差，绝对差)小于这个阈值，则该节点不再生成子节点。即为叶子节点。一般不推荐改动默认值1e-7。
数据是否预排序presort：这个值是布尔值，默认是False不排序。一般来说，如果样本量少或者限制了一个深度很小的决策树，设置为true可以让划分点选择更加快，决策树建立的更加快。如果样本量太大的话，反而没有什么好处。问题是样本量少的时候，我速度本来就不慢。所以这个值一般懒得理它就可以了。
　　　　除了这些参数要注意以外，其他在调参时的注意点有：
　　　　1）当样本少数量但是样本特征非常多的时候，决策树很容易过拟合，一般来说，样本数比特征数多一些会比较容易建立健壮的模型
　　　　2）如果样本数量少但是样本特征非常多，在拟合决策树模型前，推荐先做维度规约，比如主成分分析（PCA），特征选择（Losso）或者独立成分分析（ICA）。这样特征的维度会大大减小。再来拟合决策树模型效果会好。
　　　　3）推荐多用决策树的可视化（下节会讲），同时先限制决策树的深度（比如最多3层），这样可以先观察下生成的决策树里数据的初步拟合情况，然后再决定是否要增加深度。
　　　　4）在训练模型先，注意观察样本的类别情况（主要指分类树），如果类别分布非常不均匀，就要考虑用class_weight来限制模型过于偏向样本多的类别。
　　　　5）决策树的数组使用的是numpy的float32类型，如果训练数据不是这样的格式，算法会先做copy再运行。
　　　　6）如果输入的样本矩阵是稀疏的，推荐在拟合前调用csc_matrix稀疏化，在预测前调用csr_matrix稀疏化。
1.10.2。回归
class sklearn.tree.DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False)
1.10.3。多输出问题
多输出是指目标Y值不止一个，比如输入一个X，要求输出cos和sin。
只能使用回归，不能使用分类
1.10.4。复杂性
1.10.5。实际使用技巧
1.10.6。算法：ID3，C4.5，树C5.0和CART
ID3，C4.5，CART的伪代码，差别，剪枝。sklearn中中默认使用CART，因为CART可以运用到回归中，另外两个不行。
1.10.7。数学公式
1.10.7.1。分类标准
1.10.7.2。回归的标准

1.11。集成方法
1.11.1。bagging
GBDT的子采样是无放回采样，而Bagging的子采样是放回采样。随机森林使用的是bagging采样。一个是boosting派系，它的特点是各个弱学习器之间有依赖关系。另一种是bagging流派，它的特点是各个弱学习器之间没有依赖关系，可以并行拟合。
1.11.2。随机树的森林
RF的主要优点有：
1）训练可以高度并行化，对于大数据时代的大样本训练速度有优势。个人觉得这是的最主要的优点。
2）由于可以随机选择决策树节点划分特征，这样在样本特征维度很高的时候，仍然能高效的训练模型。
3）在训练后，可以给出各个特征对于输出的重要性
4）由于采用了随机采样，训练出的模型的方差小，泛化能力强。
5）相对于Boosting系列的Adaboost和GBDT， RF实现比较简单。
6）对部分特征缺失不敏感。
RF的主要缺点有：
1）在某些噪音比较大的样本集上，RF模型容易陷入过拟合。
2) 取值划分比较多的特征容易对RF的决策产生更大的影响，从而影响拟合的模型的效果。
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)
1.11.2.1。随机森林
1.11.2.2。非常随机的树木
1.11.2.3。参数
RF与DT主要的参数差别
*n_estimators: 也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，计算量会太大，并且n_estimators到一定的数量后，再增大n_estimators获得的模型提升会很小，所以一般选择一个适中的数值。默认是100。在实际调参的过程中，我们常常将n_estimators和learning_rate一起考虑。
oob_score :即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True，因为袋外分数反应了一个模型拟合后的泛化能力。
criterion: 即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse，另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。
1.11.2.4。并行化
1.11.2.5。特征重要性评价
1.11.2.6。完全随机树嵌入

1.11.3。AdaBoost
class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
class sklearn.ensemble.AdaBoostRegressor(base_estimator=None, n_estimators=50, learning_rate=1.0, loss=’linear’, random_state=None)
base_estimator：
n_estimators：迭代次数，弱分类器个数
learning_rate：步长；在通过在范围（0.0,1.0）中放缩来限制过拟合的高级参数；限制每个弱分类器的步长
algorithm：指定算法
estimators_ : list of classifiers：The collection of fitted sub-estimators.
estimator_weights_ : array of floats：Weights for each estimator in the boosted ensemble.
estimator_errors_ : array of floats：Regression error for each estimator in the boosted ensemble.
feature_importances_ : array of shape = [n_features]：The feature importances if supported by the base_estimator.
1.11.3.1。使用
1.11.4。梯度树提高
GBRT的优点是：
混合型数据的自然处理（=异构特征）
预测力
输出空间异常值的鲁棒性（通过强大的损失函数）
GBRT的缺点是：
由于升压的顺序性，可扩展性几乎不能并行化。
1.11.4.1。分类
class sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
1.11.4.2。回归
class sklearn.ensemble.GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
1) n_estimators: 也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，又容易过拟合，一般选择一个适中的数值。默认是100。在实际调参的过程中，我们常常将n_estimators和下面介绍的参数learning_rate一起考虑。
2) learning_rate: 即每个弱学习器的权重缩减系数νν，也称作步长，在原理篇的正则化章节我们也讲到了，加上了正则化项，我们的强学习器的迭代公式为fk(x)=fk−1(x)+νhk(x)fk(x)=fk−1(x)+νhk(x)。νν的取值范围为0<ν≤10<ν≤1。对于同样的训练集拟合效果，较小的νν意味着我们需要更多的弱学习器的迭代次数。通常我们用步长和迭代最大次数一起来决定算法的拟合效果。所以这两个参数n_estimators和learning_rate要一起调参。一般来说，可以从一个小一点的νν开始调参，默认是1。
3) subsample: 即我们在原理篇的正则化章节讲到的子采样，取值为(0,1]。注意这里的子采样和随机森林不一样，随机森林使用的是放回抽样，而这里是不放回抽样。如果取值为1，则全部样本都使用，等于没有使用子采样。如果取值小于1，则只有一部分样本会去做GBDT的决策树拟合。选择小于1的比例可以减少方差，即防止过拟合，但是会增加样本拟合的偏差，因此取值不能太低。推荐在[0.5, 0.8]之间，默认是1.0，即不使用子采样。
4) init: 即我们的初始化的时候的弱学习器，拟合对应原理篇里面的f0(x)f0(x)，如果不输入，则用训练集样本来做样本集的初始化分类回归预测。否则用init参数提供的学习器做初始化分类回归预测。一般用在我们对数据有先验知识，或者之前做过一些拟合的时候，如果没有的话就不用管这个参数了。
5) loss: 即我们GBDT算法中的损失函数。分类模型和回归模型的损失函数是不一样的。
对于分类模型，有对数似然损失函数"deviance"和指数损失函数"exponential"两者输入选择。默认是对数似然损失函数"deviance"。在原理篇中对这些分类损失函数有详细的介绍。一般来说，推荐使用默认的"deviance"。它对二元分离和多元分类各自都有比较好的优化。而指数损失函数等于把我们带到了Adaboost算法。
二项式偏差（'deviance'）：二进制分类的负二项对数似然损失函数（提供概率估计）。初始模型由对数优势比给出。
多项式偏差（'deviance'）：用于具有n_classes互斥类的多类分类的负多项式对数似然损失函数。它提供概率估计。初始模型由每个类的先验概率给出。在每个迭代n_classes 回归中，必须构造树，这样使得GBRT对于具有大量类的数据集而言效率低下。
指数损失（'exponential'）：与损失函数相同AdaBoostClassifier。较不坚固到错误标记的例子比'deviance'; 只能用于二进制分类。
对于回归模型，有均方差"ls", 绝对损失"lad", Huber损失"huber"和分位数损失“quantile”。默认是均方差"ls"。一般来说，如果数据的噪音点不多，用默认的均方差"ls"比较好。如果是噪音点较多，则推荐用抗噪音的损失函数"huber"。而如果我们需要对训练集进行分段预测的时候，则采用“quantile”。
最小二乘（'ls'）：由于其优越的计算性质，回归的自然选择。初始模型由目标值的平均值给出。
最小绝对偏差（'lad'）：用于回归的强大的损失函数。初始模型由目标值的中值给出。
Huber（'huber'）：另一个结合最小二乘和最小绝对偏差的强大的损失函数; 用于alpha控制异常值的灵敏度（详见[F2001]）。
分位数（'quantile'）：分位数回归的损失函数。使用指定的位数。该损失函数可用于创建预测间隔（参见梯度增强回归的预测间隔）。0 < alpha < 1
6) alpha：这个参数只有GradientBoostingRegressor有，当我们使用Huber损失"huber"和分位数损失“quantile”时，需要指定分位数的值。默认是0.9，如果噪音点较多，可以适当降低这个分位数的值。
1.11.4.3。适合学习能力较弱的学生
warm_start=True，允许您添加更多的估计器到已经适合的模型
1.11.4.4。控制树的大小
1.11.4.5。数学公式
1.11.4.5.1。损失函数
1.11.4.6。正则化
1.11.4.6.1。收缩
learning_rate
1.11.4.6.2。子采样
subsample
1.11.4.7。解释
1.11.4.7.1。特征的重要性
feature_importances_ : array of shape = [n_features]：The feature importances if supported by the base_estimator.
1.11.4.7.2。部分依赖
from sklearn.ensemble.partial_dependence import plot_partial_dependence？？？？

1.11.5。投票分类器
1.11.5.1。多数类标签（多数/硬投票）
1.11.5.1.1。使用
1.11.5.2。加权平均概率（软投票）
1.11.5.3。使用votingclassifier与网格搜索法
1.11.5.3.1。使用

1.12。Multiclass和细粒度的算法
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
1.12.1。细粒度的分类格式
1.12.2。一对其余
OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, y).predict(X)
1.12.2.1。Multiclass学习
1.12.2.2。多标记学习
1.12.3。一对一
OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, y).predict(X)
1.12.3.1。Multiclass学习
1.12.4。纠错输出码
1.12.4.1。Multiclass学习
1.12.5。多输出回归
1.12.6。多分类
1.12.7。分类器链

1.13。特征选择
1.13.1。去除低方差特征
1.13.2。单变量的特征选择
挂。递归特征消除
1.13.4。使用selectfrommodel特征选择
1.13.4.1。基于L1的特征选择
1.13.4.2。基于树的特征选择
1.13.5。作为管道的一部分的特征选择

1.14。半监督
1.14.1。标签传播
1.15。保序回归
1.16。概率校准

1.17。神经网络模型（监督）
所谓神经网络的训练或者是学习，其主要目的在于通过学习算法得到神经网络解决指定问题所需的参数，这里的参数包括各层神经元之间的连接权重以及偏置等。

因为作为算法的设计者（我们），我们通常是根据实际问题来构造出网络结构，参数的确定则需要神经网络通过训练样本和学习算法来迭代找到最优参数组。

说起神经网络的学习算法，不得不提其中最杰出、最成功的代表——误差逆传播（error BackPropagation，简称BP）算法。BP学习算法通常用在最为广泛使用的多层前馈神经网络中。

深度学习指的是深度神经网络模型，一般指网络层数在三层或者三层以上的神经网络结构。1.17.1。多层感知器
多层感知器的优点是：
学习非线性模型的能力。
能够实时学习模型（在线学习）partial_fit。
多层感知器（MLP）的缺点包括：
具有隐层的MLP具有非凸失去函数，其中存在多于一个局部最小值。因此，不同的随机权重初始化可能导致不同的验证精度。
MLP需要调整许多超参数，例如隐藏的神经元，层和迭代的数量。
MLP对特征缩放很敏感。
1.17.2。分类
class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
1.17.3。回归
class sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
1.17.4。正则化
1.17.5。算法
1.17.6。复杂性
1.17.7。数学公式
1.17.8。实际使用技巧
多层感知器对特征缩放很敏感，因此强烈建议您扩展数据。例如，将输入向量X上的每个属性缩放为[0,1]或[-1，+1]，或将其标准化为平均值0和方差1.
请注意，必须将相同的缩放应用于测试集有意义的结果。您可以使用StandardScaler标准化。
一种替代和推荐的方法是StandardScaler 在a中使用Pipeline
找到合理的正则化参数最好使用GridSearchCV，通常在范围内。10.0 ** -np.arange(1, 7)
经验上，我们观察到L-BFGS收敛速度更快，并且对小数据集有更好的解决方案。
然而，对于相对较大的数据集，Adam非常强大。它通常会快速收敛并给出相当不错的表现。
另一方面，如果学习率正确调整，SGD具有动量或nesterov的动量，可以比这两种算法表现更好。
1.17.9。与warm_start更多的控制

2。无监督学习
2.1。高斯混合模型
2.1.1。高斯混合
2.1.1.1。类高斯利弊
2.1.1.1.1。赞成的意见
2.1.1.1.2。欺骗
2.1.1.2。经典高斯混合模型中元件数的选取
2.1.1.3。估计算法期望最大化
2.1.2。变分贝叶斯高斯混合
2.1.2.1。估计算法：变分推理
2.1.2.2。与bayesiangaussianmixture变分推理的利弊
2.1.2.2.1。赞成的意见
2.1.2.2.2。欺骗
2.1.2.3。Dirichlet过程
2.2。流形学习
2.2.1。介绍
2.2.2。等距映射
2.2.2.1。复杂性
2.2.3。局部线性嵌入
2.2.3.1。复杂性
2.2.4。改进的局部线性嵌入
2.2.4.1。复杂性
2.2.5。海森eigenmapping
2.2.5.1。复杂性
2.2.6。谱嵌入
2.2.6.1。复杂性
2.2.7。局部切空间排列算法
2.2.7.1。复杂性
2.2.8。多维标度（MDS）
2.2.8.1。度量MDS
2.2.8.2。非度量MDS
2.2.9。t分布的随机邻居嵌入（T-SNE）
2.2.9.1。优化T-SNE
2.2.9.2。巴尼斯的小屋T-SNE
2.2.10。实际使用技巧

2.3。聚类
2.3.1。聚类方法综述
2.3.2。聚类
class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)
n_clusters:质心数
init：'kmeans++'：将初始化质心（通常）彼此远离，导致比随机初始化更好的结果。‘random’表示随机选初始质点，ndarray是初入一个数组(n_clusters, n_features)，指定了质点
n_init:Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter:一个运行k-均值算法的最大迭代次数。
precompute_distances：三个可选值，‘auto’，True 或者 False。预计算距离，计算速度更快但占用更多内存。‘auto’：如果样本数乘以聚类数大于 12million 的话则不预计算距离。True：总是预先计算距离。False：永远不预先计算距离。
当我们precomputing distances时，将数据中心化会得到更准确的结果。如果把此参数值设为True，则原始数据不会被改变。如果是False，则会直接在原始数据
上做修改并在函数返回值时将其还原。但是在计算过程中由于有对数据均值的加减运算，所以数据返回后，原始数据和计算前可能会有细小差别。
属性：
tol：float形，默认值= 1e-4,与inertia结合来确定收敛条件。
n_jobs：整形数。指定计算所用的进程数。内部原理是同时进行n_init指定次数的计算。若值为 -1，则用所有的CPU进行运算。若值为1，则不进行并行运算，这样的话方便调试。若值小于-1，则用到的CPU数为(n_cpus + 1 + n_jobs)。因此如果 n_jobs值为-2，则用到的CPU数为总CPU数减1。
random_state：整形或 numpy.RandomState 类型，可选用于初始化质心的生成器（generator）。如果值为一个整数，则确定一个seed。此参数默认值为numpy的随机数生成器。
verbose：整形，默认值=0
copy_x：布尔型，默认值=True
cluster_centers_：向量，质心[n_clusters, n_features]
Labels_:每个点的分类
inertia_：float形,每个点到其簇的质心的距离之和。
fit_transform(X[,y]):计算簇并 transform X to cluster-distance space。
transform(X[,y]):将X转换入cluster-distance 空间。
get_params([deep]):取得估计器的参数。
set_params(**params):为这个估计器手动设定参数。

缺点：
惯性假定簇是凸的和各向同性的，这并不总是这样。它对细长的团簇或具有不规则形状的歧管反应不佳。
惯性不是归一化度量：我们只知道较低的值是更好的，零是最优的。但是在非常高维的空间中，欧几里德的距离往往变得膨胀（这是所谓的“维度诅咒”的一个例子）。在k-means聚类之前运行诸如PCA的维度降低算法可以缓解这个问题并加快计算速度。
2.3.2.1。k-均值迷你批
class sklearn.cluster.MiniBatchKMeans(n_clusters=8, init=’k-means++’, max_iter=100, batch_size=100, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01)
小批量是输入数据的子集，在每次训练迭代中随机抽样。
2.3.3。亲和传播
2.3.4。均值漂移
2.3.5。谱聚类算法
2.3.5.1。不同标签分配策略
2.3.6。层次聚类
2.3.6.1。不同的连锁类型：病房，完全和平均联动
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, memory=None, connectivity=None, compute_full_tree=’auto’, linkage=’ward’, pooling_func=<function mean>)
AgglomerativeClustering: 使用自底向上的聚类方法。
linkage : {“ward”, “complete”, “average”}三种聚类准则：complete(maximum) linkage: 两类间的距离用最远点距离表示。avarage linkage:平均距离。ward's method: 以组内平方和最小，组间平方和最大为目的。
affinity：字符串或可调用默认：“euclidean（欧几里德l2）”度量用于计算联动。可“欧几里德”、“语言”、“语言”、“曼哈顿l1”、“余弦”，或“算”。如果是联动的“ward”，只有“欧几里德”是公认的。
当affinity不是欧几里得氟度量时，推荐使用average。l1距离通常对于稀疏特征或稀疏噪声是有利的：即许多特征都是零，如在使用罕见词的发生的文本挖掘中。
余弦距离很有趣，因为它对信号的全局缩放是不变的。
2.3.6.2。添加连接限制
2.3.6.3。不同的度量
2.3.7。DBSCAN算法
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric=’euclidean’, metric_params=None, algorithm=’auto’, leaf_size=30, p=None, n_jobs=1)
eps : float, optional。在同一个街区的两个样本之间的最大距离。
min_samples : int, optional核心点区域的最小样本个数
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
New in version 0.17: metric precomputed to accept precomputed sparse matrix.
metric_params : dict, optional
Additional keyword arguments for the metric function.
New in version 0.19.
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
leaf_size : int, optional (default = 30)Leaf size passed to BallTree or cKDTree.
p : float, optionalThe power of the Minkowski（闵可夫斯基） metric to be used to calculate distance between points.
n_jobs : int, optional (default = 1)The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.
属性：
core_sample_indices_ : array, shape = [n_core_samples]，Indices of core samples.
components_ : array, shape = [n_core_samples, n_features]，Copy of each core sample found by training.
labels_ : array, shape = [n_samples]，Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label
1.2.3.8。桦木

2.3.9。聚类算法的性能评价
2.3.9.1。调整后的兰德指数
2.3.9.1.1。优势
2.3.9.1.2。缺点
2.3.9.1.3。数学公式
2.3.9.2。互信息评分
2.3.9.2.1。优势
2.3.9.2.2。缺点
2.3.9.2.3。数学公式
2.3.9.3。均匀性、完整性和v-measure
2.3.9.3.1。优势
2.3.9.3.2。缺点
2.3.9.3.3。数学公式
2.3.9.4。Fowlkes，锦葵评分
2.3.9.4.1。优势
2.3.9.4.2。缺点
2.3.9.5。轮廓系数
2.3.9.5.1。优势
2.3.9.5.2。缺点
2.3.9.6。Calinski Harabaz指数
2.3.9.6.1。优势
2.3.9.6.2。缺点
2.4。双聚类
2.4.1。谱聚类
2.4.1.1。数学公式
2.4.2。谱聚类
2.4.2.1。数学公式
2.4.3。双聚类评价

2.5。元件分解信号（矩阵分解问题）
2.5.1。主成分分析（PCA）
2.5.1.1。精确PCA与概率解释
2.5.1.2。增量PCA
2.5.1.3。使用随机奇异值分解
2.5.1.4。核的主分量分析
2.5.1.5。稀疏主成分分析（SparsePCA和minibatchsparsepca）
2.5.2。截断奇异值分解与潜在语义分析
2.5.3。字典学习
2.5.3.1。与预先计算的编码字典的稀疏
2.5.3.2。泛型字典学习
2.5.3.3。小批量字典学习
2.5.4。因子分析
2.5.5。独立分量分析（ICA）
2.5.6。非负矩阵分解（NMF或NNMF）
2.5.6.1。NMF的Frobenius范数
2.5.6.2。具有β散度的NMF
2.5.7。潜在狄利克雷分配（LDA）

2.6。协方差估计
2.6.1。经验协方差
2.6.2。缩水的协方差
2.6.2.1。基本收缩
2.6.2.2。Ledoit Wolf收缩
2.6.2.3。Oracle逼近收缩
2.6.3。稀疏逆协方差
2.6.4。强大的协方差估计
2.6.4.1。最小的Covariance Determinant

2.7。新颖性与异常检测
2.7.1。新颖性检测
2.7.2。孤立点检测
2.7.2.1。椭圆包络线拟合
2.7.2.2。隔离的森林
2.7.2.4。一类支持向量机与椭圆包络与隔离的森林与LOF

2.8。密度估计
2.8.1。密度估计：Histograms
2.8.2。核密度估计

2.9。神经网络模型（无监督）
2.9.1。限制Boltzmann的机器
2.9.1.1。图形模型与参数化
2.9.1.2。Bernoulli Restricted Boltzmann机器
2.9.1.3。随机最大似然学习

三.模型选择与评价
3.1。交叉验证：评估估计器性能
sklearn.model_selection.train_test_split(*arrays, **options)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
*arrays : sequence of indexables with same length / shape[0]。Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : 如果是float,应该在0.0和1.0之间。如果是int,代表测试样本的绝对数量。默认是0.25.
train_size : 与test_size差不多。
random_state : 如果指定，则表示选定了一个随机种子。只要选这个值，生成的随机数都一样。
shuffle : 重新排序打乱样本。如果shuffle=False，那么stratify必须是None.
stratify : array-like or None (default is None)。If not None, data is split in a stratified fashion, using this as the class label
3.1.1。计算交叉验证指标
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)
estimator : 用来fit的算法
X : 需要学习的数组。
y : 目标值
scoring : string, callable or None,默认 None。string或者callable要使用scorer(estimator,X,y)函数。一般都是使用使用的算法自身的score函数，但是若要制定，使用scoring='f1_macro'指定。
cv : 交叉验证生成器或者迭代器。可选值有None:使用默认3-fold的交叉验证；integer, 指定fold里的k；可以用做交叉验证生成器的一个对象；一个能产生train/test划分的迭代器对象,也可以通过交叉验证使用迭代器
对于integer/None的输入，并且算法是一个分类算法，y是对应的类标签，使用Stratified.其他情况使用kfold
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=cv)
3.1.1.1。的cross_validate函数和多指标评价
sklearn.model_selection.cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=True)
返回的结果中这些['test_score', 'fit_time', 'score_time']，
其中test_score可以是指定通过字典或者列表传入scoring参数的多种score方式，例如召回率，准确率。
当 return_train_score=True时同时返回测试集的评分，False不返回。默认为True.
3.1.1.2。通过交叉验证获得预测
sklearn.model_selection.cross_val_predict(estimator, X, y=None, groups=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, method=’predict’)
返回预测的结果，没有评分，自己调用评分函数进行评估，如
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted)
3.1.2。交叉验证的迭代器
3.1.3。独立同分布的数据的交叉验证迭代器
3.1.3.1。kfold
class sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
KFold随机分几组
split(X[, y, groups]) Generate indices to split data into training and test set.需要用循环导出每个分组的索引
class sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=False, random_state=None)
StratifiedKFold 是一种将数据集中每一类样本的数据成分，按均等方式拆分的方法。
3.1.3.2。反复折
class sklearn.model_selection.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)
3.1.3.3。留一个出去
class sklearn.model_selection.LeaveOneOut
LeaveOneOut() = KFold(n_splits=n) = LeavePOut(p=1)
运用于稀疏数据
3.1.3.4。留下P（LPO）
class sklearn.model_selection.LeavePOut(p)
LeavePOut与LeaveOneOut通过p从完整集合中移除样本创建所有可能的训练/测试集非常相似。
3.1.3.5。随机排列的交叉验证又名洗牌与分裂
class sklearn.model_selection.ShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)
可以通过明确种子random_state伪随机数发生器来控制结果的重现性的随机性。
3.1.4。基于类标签的交叉验证迭代器分层。
3.1.4.1。分层折
3.1.4.2。分层随机分
3.1.5。分组数据的交叉验证迭代器。
3.1.5.1。组折
model_selection.GroupKFold([n_splits]) K-fold iterator variant with non-overlapping groups.
3.1.5.2。离开一组
class sklearn.model_selection.LeaveOneGroupOut() Leave One Group Out cross-validator
3.1.5.3。使p组离开
class sklearn.model_selection.LeavePGroupsOut(n_groups) Leave P Group(s) Out cross-validator
3.1.5.4。组随机分
model_selection.GroupShuffleSplit([…]) Shuffle-Group(s)-Out cross-validation iterator
3.1.6。预定义折叠分割/验证集
model_selection.PredefinedSplit(test_fold) Predefined split cross-validator
3.1.7。时间序列数据交叉验证
由于kfold是建立在样本之间独立的情况下，，对时间样本会有影响，所以要用新的。。。
3.1.7.1。时间序列分割
class sklearn.model_selection.TimeSeriesSplit(n_splits=3, max_train_size=None) Time Series cross-validator
3.1.8。洗牌的说明
3.1.9。交叉验证与模型选择

3.2。调整估计量的超参数
它其实是一种贪心算法：拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优，但是省时间省力，巨大的优势面前，还是试一试吧，后续可以再拿bagging再优化。
3.2.1。详尽的网格搜索
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=True)
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
参数：
verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。
n_jobs: 并行数，int：个数,-1：跟CPU核数一致, 1:默认值。
pre_dispatch：指定总共分发的并行任务数。当n_jobs大于1时，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次
属性有：
cv_results_：
{
'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],mask = [False False False False]...)
'param_gamma': masked_array(data = [-- -- 0.1 0.2],mask = [ True True False False]...),
'param_degree': masked_array(data = [2.0 3.0 -- --], mask = [False False True True]...),
'split0_test_score' : [0.8, 0.7, 0.8, 0.9],
'split1_test_score' : [0.82, 0.5, 0.7, 0.78],
'mean_test_score' : [0.81, 0.60, 0.75, 0.82],
'std_test_score' : [0.02, 0.01, 0.03, 0.03],
'rank_test_score' : [2, 4, 3, 1],
'split0_train_score' : [0.8, 0.9, 0.7],
'split1_train_score' : [0.82, 0.5, 0.7],
'mean_train_score' : [0.81, 0.7, 0.7],
'std_train_score' : [0.03, 0.03, 0.04],
'mean_fit_time' : [0.73, 0.63, 0.43, 0.49],
'std_fit_time' : [0.01, 0.02, 0.01, 0.01],
'mean_score_time' : [0.007, 0.06, 0.04, 0.04],
'std_score_time' : [0.001, 0.002, 0.003, 0.005],
'params' : [{'kernel': 'poly', 'degree': 2}, ...],
}
best_estimator_ :
best_score_ :
best_params_ :
best_index_ :
scorer_ :
n_splits_ :
3.2.2。随机参数的优化
class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, random_state=None, error_score=’raise’, return_train_score=True)
3.2.3。参数搜索的技巧
3.2.3.1。指定一个客观度量
scoring
3.2.3.2。指定多个评估指标
refit :默认为True,程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。
gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid={'min_samples_split': range(2, 403, 10)},
scoring=scoring, cv=5, refit='AUC')
当指定多个度量，改装参数必须设置为公制（字符串），best_params_将被发现，用来建造best_estimator_对整个数据集。如果搜索不应改装，将改装=假。离开改装为默认值都会产生一个错误时，使用多个度量。
3.2.3.3。综合估计和参数空间
pipeline与gridsearchcv一起使用
pipe = Pipeline([('reduce_dim', PCA()),('classify', LinearSVC())])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [{'reduce_dim': [PCA(iterated_power=7), NMF()],'reduce_dim__n_components': N_FEATURES_OPTIONS,'classify__C': C_OPTIONS},
{'reduce_dim': [SelectKBest(chi2)],'reduce_dim__k': N_FEATURES_OPTIONS,'classify__C': C_OPTIONS}]
#reduce_dim就代指pca，reduce_dim_n_componets:pca.n_componets,classfiy也是这么理解。参数要与管道命名对应起来。
grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
3.2.3.4 模型选择：开发和评估
3.2.3.5。并行
n_jobs=-1
n_jobs : int, default=1 Number of jobs to run in parallel.
3.2.3.6。鲁棒性的失败
error_score=0 (or =np.NaN)
error_score : ‘raise’ (default) or numeric
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
3.2.4。蛮力参数搜索的替代方案
3.2.4.1。模型特异交叉验证
用这个pipeline和ridgeCV函数就不用使用gridsearchcv与linear_model.Ridge()
Pipeline([
('poly', PolynomialFeatures()),
('linear', RidgeCV(alphas=np.logspace(-3, 2, 50), fit_intercept=False))]),
3.2.4.1.1。sklearn.linear_model.elasticnetcv 弹性网模型沿正则化路径迭代拟合
class sklearn.linear_model.ElasticNetCV(l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, cv=None, copy_X=True, verbose=0, n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
正则化项:为了防止损失函数过拟合的问题，一般会在损失函数中加上正则化项,增加模型的泛化能力
损失函数：J(θ)=1/2m(Xθ−Y)T(Xθ−Y)+αρ||θ||1+α(1−ρ)/2||θ||22 其中α为正则化超参数，ρ为范数权重超参数
alphas=np.logspace(-3, 2, 50), l1_ratio=[.1, .5, .7, .9, .95, .99, 1] ElasticNetCV会从中选出最优的 a和p
ElasticNetCV类对超参数a和p使用交叉验证，帮助我们选择合适的a和p
使用场景:ElasticNetCV类在我们发现用Lasso回归太过(太多特征被稀疏为0),而Ridge回归也正则化的不够(回归系数衰减太慢)的时候
ElasticNet 是一种使用L1和L2先验作为正则化矩阵的线性回归模型.这种组合用于只有很少的权重非零的稀疏模型，比如:class:Lasso, 但是又能保持:class:Ridge 的正则化属性.我们可以使用 l1_ratio 参数来调节L1和L2的凸组合(一类特殊的线性组合)。
当多个特征和另一个特征相关的时候弹性网络非常有用。Lasso 倾向于随机选择其中一个，而弹性网络更倾向于选择两个.
在实践中，Lasso 和 Ridge 之间权衡的一个优势是它允许在循环过程（Under rotate）中继承 Ridge 的稳定性.

3.2.4.1.2。sklearn.linear_model.larscv 交叉验证的最小二乘回归模型
class sklearn.linear_model.LarsCV(fit_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=’auto’, cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True, positive=False)

3.2.4.1.3。sklearn.linear_model.lassocv 拉索线性模型，沿正则化路径迭代拟合（坐标下降）
class sklearn.linear_model.LassoCV(eps=0.001, n_alphas=100, alphas=None, fit_intercept=True, normalize=False, precompute=’auto’, max_iter=1000, tol=0.0001, copy_X=True, cv=None, verbose=False, n_jobs=1, positive=False, random_state=None, selection=’cyclic’)
损失函数:J(θ)=1/2m(Xθ−Y)T(Xθ−Y)+α||θ||1 线性回归LineaRegression的损失函数+L1（1范式的正则化项）)
Lasso回归可以使得一些特征的系数变小,甚至还使一些绝对值较小的系数直接变为0，从而增强模型的泛化能力
使用场景:对于高纬的特征数据,尤其是线性关系是稀疏的，就采用Lasso回归,或者是要在一堆特征里面找出主要的特征，那么
Lasso回归更是首选了
3.2.4.1.3.1。例子中使用sklearn.linear_model.lassocv

3.2.4.1.4。sklearn.linear_model.lassolarscv 使用LARS算法进行交叉验证的Lasso（最小二乘法）
class sklearn.linear_model.LassoLarsCV(fit_intercept=True, verbose=False, max_iter=500, normalize=True, precompute=’auto’, cv=None, max_n_alphas=1000, n_jobs=1, eps=2.2204460492503131e-16, copy_X=True, positive=False)
3.2.4.1.4.1。例子中使用sklearn.linear_model.lassolarscv

3.2.4.1.5。sklearn.linear_model.logisticregressioncv Logistic回归CV（又名logit，MaxEnt）分类器。
class sklearn.linear_model.LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None, dual=False, penalty=’l2’, scoring=None, solver=’lbfgs’, tol=0.0001, max_iter=100, class_weight=None, n_jobs=1, verbose=0, refit=True, intercept_scaling=1.0, multi_class=’ovr’, random_state=None)
Cs:正则化参数，其余参照logisticregression
3.2.4.1.6。sklearn.linear_model.multitaskelasticnetcv 多任务L1 / L2 ElasticNet内置交叉验证。

3.2.4.1.7。sklearn.linear_model.multitasklassocv 多任务L1 / L2 Lasso内置交叉验证。

3.2.4.1.8。sklearn.linear_model.orthogonalmatchingpursuitcv 交叉验证的正交匹配追踪模型（OMP）
3.2.4.1.8.1。例子中使用sklearn.linear_model.orthogonalmatchingpursuitcv

3.2.4.1.9。sklearn.linear_model.ridgecv 里奇回归与内置交叉验证。
Ridge回归(岭回归)损失函数的表达形式：J(θ)=1/2(Xθ−Y)T(Xθ−Y)+1/2α||θ||22(线性回归LineaRegression的损失函数+L2（2范式的正则化项）)
a为超参数 alphas=np.logspace(-3, 2, 50) 从给定的超参数a中选择一个最优的,logspace用于创建等比数列本例中开始点为10的-3次幂,结束点10的2次幂,元素个数为
50,并且从这50个数中选择一个最优的超参数
linspace创建等差数列
Ridge回归中超参数a和回归系数θ的关系,a越大，正则项惩罚的就越厉害，得到的回归系数θ就越小,最终趋近与0
如果a越小,即正则化项越小，那么回归系数θ就越来越接近于普通的线性回归系数
使用场景:只要数据线性相关，用LinearRegression拟合的不是很好，需要正则化，可以考虑使用RidgeCV回归,
如何输入特征的维度很高,而且是稀疏线性关系的话， RidgeCV就不太合适,考虑使用Lasso回归类家族
3.2.4.1.9.1。例子中使用sklearn.linear_model.ridgecv

3.2.4.1.10。sklearn.linear_model.ridgeclassifiercv 里奇分类器内置交叉验证。

3.2.4.2。信息准则
3.2.4.2.1。sklearn.linear_model.lassolarsic Lasso模型适合Lars使用Aikike信息标准（AIC）或贝叶斯信息标准（BIC）进行型号选择
class sklearn.linear_model.LassoLarsIC(criterion=’aic’/'bic', fit_intercept=True, verbose=False, normalize=True, precompute=’auto’, max_iter=500, eps=2.2204460492503131e-16, copy_X=True, positive=False)
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
3.2.4.2.1.1。例子中使用sklearn.linear_model.lassolarsic

3.2.4.3。袋外估计
3.2.4.3.1。sklearn.ensemble.randomforestclassifier 随机森林分类器
3.2.4.3.1.1。例子中使用sklearn.ensemble.randomforestclassifier

3.2.4.3.2。sklearn.ensemble.randomforestregressor 随机森林回归。
3.2.4.3.2.1。例子中使用sklearn.ensemble.randomforestregressor

3.2.4.3.3。sklearn.ensemble.extratreesclassifier 一个额外的树分类器。
class sklearn.ensemble.ExtraTreesClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=False, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
3.2.4.3.3.1。例子中使用sklearn.ensemble.extratreesclassifier

3.2.4.3.4。sklearn.ensemble.extratreesregressor 一个额外的树木回归。
3.2.4.3.4.1。例子中使用sklearn.ensemble.extratreesregressor

3.2.4.3.5。sklearn.ensemble.gradientboostingclassifier 梯度提升分类。
3.2.4.3.5.1。例子中使用sklearn.ensemble.gradientboostingclassifier

3.2.4.3.6。sklearn.ensemble.gradientboostingregressor 渐变提升回归。
3.2.4.3.6.1。例子中使用sklearn.ensemble.gradientboostingregressor

3D图
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from pylab import *
fig=figure()
ax=Axes3D(fig)
x=np.arange(-4,4,0.1)
y=np.arange(-4,4,0.1)
x,y=np.meshgrid(x,y)
R=np.sqrt(x**2+y**2)
z=np.sin(R)
ax.plot_surface(x,y,z,rstride=1,cstride=1,cmap='hot')
show()

3.3。模型评估：量化预测的质量
3.3.1。评分参数：定义模型评估规则
使用scoring指定
3.3.1.1。常见的情况：预定义值
3.3.1.2。从度量函数定义评分策略
sklearn.metrics.make_scorer(score_func, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
3.3.1.3。实现自己的评分对象
3.3.1.4。多指标评价
scoring = ['accuracy', 'precision']
from sklearn.metrics import accuracy_score
scoring = {'accuracy': make_scorer(accuracy_score),
'prec': 'precision'}
3.3.2。分类指标
3.3.2.1设施上。从二进制到多细粒度
3.3.2.2。准确度评分
3.3.2.3。科恩的Kappa
3.3.2.4。混淆矩阵
sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
3.3.2.5。分类报告
sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2)
3.3.2.6。Hamming 损失
3.3.2.7。Jaccard相似系数评分
3.3.2.8。精度，召回和F-措施
分类准确率分数是指所有分类正确的百分比。分类准确率这一衡量分类器的标准比较容易理解，但是它不能告诉你响应值的潜在分布，并且它也不能告诉你分类器犯错的类型。
形式：
sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
normalize：默认值为True，返回正确分类的比例；如果为False，返回正确分类的样本数
klearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1,average='binary', sample_weight=None)
参数average : string, [None, ‘micro’, ‘macro’(default), ‘samples’, ‘weighted’]
将一个二分类matrics拓展到多分类或多标签问题时，我们可以将数据看成多个二分类问题的集合，每个类都是一个二分类。接着，我们可以通过跨多个分类计算每个二分类metrics得分的均值，这在一些情况下很有用。你可以使用average参数来指定。
macro：计算二分类metrics的均值，为每个类给出相同权重的分值。当小类很重要时会出问题，因为该macro-averging方法是对性能的平均。另一方面，该方法假设所有分类都是一样重要的，因此macro-averaging方法会对小类的性能影响很大。
weighted:对于不均衡数量的类来说，计算二分类metrics的平均，通过在每个类的score上进行加权实现。
micro：给出了每个样本类以及它对整个metrics的贡献的pair（sample-weight），而非对整个类的metrics求和，它会每个类的metrics上的权重及因子进行求和，来计算整个份额。Micro-averaging方法在多标签（multilabel）问题中设置，包含多分类，此时，大类将被忽略。
samples：应用在multilabel问题上。它不会计算每个类，相反，它会在评估数据中，通过计算真实类和预测类的差异的metrics，来求平均（sample_weight-weighted）
average：average=None将返回一个数组，它包含了每个类的得分.
3.3.2.8.1。二分类
只限于二分类单标签分类问题的评估指标
matthews_corrcoef(y_true,y_pred[],...):计算二元分类中的Matthews相关系数（MCC）
precision_recall_curve(y_true,probas_pred)：在不同的概率阈值下计算precision-recall点，形成曲线
roc_curve(y_true,y_score[,pos_label,...]):计算ROC曲线
可用于二分类多标签分类问题的评估指标
average_precision_score(y_true,y_score[,...]) 计算预测得分的平均精度（mAP）
roc_auc_score(y_true,y_score[,average,...])计算预测得分的AUC值
3.3.2.8.2。Multiclass和细粒度的分类
可用于多分类问题的评估指标（紫色的可用于多标签分类问题）
cohen_kappa_score(y1,y2[,labels,weights])
confusion_matrix(y_true,y_pred[,labels,...])
hinge_loss(y_true,pred_decision[,labels,...])
//accuracy_score(y_true,y_pred[,normalize,...])
classification_report(y_true,y_pred[,...])
f1_score(y_true,y_pres[,labels,...])
fbeta_score(y_true,,y_pres,beta[,labels,...])
hamming_loss(y_true,y_pres[,labels,...])
jaccard_similarity_score(y_true,y_pres[,...])
log_loss(y_true,y_pres[,eps,normalize,...])
zero_one_loss(y_true,y_pres[,normalize,...])
precision_recall_fsconfe_support(y_true,y_pres)
3.3.2.9。铰链损失
3.3.2.10。日志丢失
3.3.2.11。马修斯相关系数
3.3.2.12。接收机工作特性（ROC）
ROC曲线指受试者工作特征曲线/接收器操作特性(receiver operating characteristic，ROC)曲线,是反映灵敏性和特效性连续变量的综合指标,是用构图法揭示敏感性和特异性的相互关系，它通过将连续变量设定出多个不同的临界值，从而计算出一系列敏感性和特异性。ROC曲线是根据一系列不同的二分类方式（分界值或决定阈），以真正例率（也就是灵敏度）（True Positive Rate,TPR）为纵坐标，假正例率（1-特效性）（False Positive Rate,FPR）为横坐标绘制的曲线。
ROC观察模型正确地识别正例的比例与模型错误地把负例数据识别成正例的比例之间的权衡。TPR的增加以FPR的增加为代价。ROC曲线下的面积是模型准确率的度量，AUC（Area under roccurve）。
纵坐标：真正率（True Positive Rate , TPR）或灵敏度（sensitivity）
TPR = TP /（TP + FN）（正样本预测结果数 / 正样本实际数）
横坐标：假正率（False Positive Rate , FPR）
FPR = FP /（FP + TN）（被预测为正的负样本结果数 /负样本实际数）
形式：
sklearn.metrics.roc_curve(y_true,y_score, pos_label=None, sample_weight=None, drop_intermediate=True)
该函数返回这三个变量：fpr,tpr,和阈值thresholds;
这里理解thresholds:
分类器的一个重要功能“概率输出”，即表示分类器认为某个样本具有多大的概率属于正样本（或负样本）。
“Score”表示每个测试样本属于正样本的概率。
接下来，我们从高到低，依次将“Score”值作为阈值threshold，当测试样本属于正样本的概率大于或等于这个threshold时，我们认为它为正样本，否则为负样本。每次选取一个不同的threshold，我们就可以得到一组FPR和TPR，即ROC曲线上的一点。当我们将threshold设置为1和0时，分别可以得到ROC曲线上的(0,0)和(1,1)两个点。将这些(FPR,TPR)对连接起来，就得到了ROC曲线。当threshold取值越多，ROC曲线越平滑。其实，我们并不一定要得到每个测试样本是正样本的概率值，只要得到这个分类器对该测试样本的“评分值”即可（评分值并不一定在(0,1)区间）。评分越高，表示分类器越肯定地认为这个测试样本是正样本，而且同时使用各个评分值作为threshold。我认为将评分值转化为概率更易于理解一些。
3.3.2.13。零损失
3.3.2.14。蒺藜分数损失
3.3.3。细粒度的排序指标
3.3.3.1。覆盖误差
3.3.3.2。标号排序平均精度
3.3.3.3。排名损失
3.3.4。回归指标
3.3.4.1。解释方差分
3.3.4.2。平均绝对误差
3.3.4.3。均方误差
3.3.4.4。均方对数误差
3.3.4.5。平均绝对误差
3.3.4.6。R²评分系数的测定
3.3.5。聚类度量
3.3.6。虚拟的估计
太多太杂了，要用的时候在再说吧
损失函数：
hinge_loss,hamming_loss,log_loss,zero_one_loss,brier_score_loss

3.4。模型的持久性
3.4.1。持久性的例子
3.4.2。安全性和可维护性限制

3.5。验证曲线：绘制评分以评估模型
3.5.1。验证曲线
sklearn.model_selection.validation_curve(estimator, X, y, param_name, param_range, groups=None, cv=None, scoring=None, n_jobs=1, pre_dispatch=’all’, verbose=0)
3.5.2。学习曲线
sklearn.model_selection.learning_curve(estimator, X, y, groups=None, train_sizes=array([ 0.1, 0.33, 0.55, 0.78, 1. ]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=1, pre_dispatch=’all’, verbose=0, shuffle=False, random_state=None)
看笔记

4。数据变换
4.1。管道和featureunion：结合估计
class sklearn.pipeline.Pipeline(steps, memory=None)
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
sklearn.pipeline.make_pipeline(*steps, **kwargs)
make_pipeline(Binarizer(), MultinomialNB()) 差别是后者自动填写step的名称
from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)
4.1.1。管道：链接估计
4.1.1.1。使用
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
4.1.1.2。笔记
在管道上调用适配与依次调用每个估计器的拟合相同，转换输入并将其传递到下一步。
流水线具有管道中最后一个估计器的所有方法，即如果最后一个估计器是分类器，则可以将流水线用作分类器。
如果最后一个估计器是一个变压器，那么管道也是如此
4.1.1.3。缓存变压器：避免重复计算
from tempfile import mkdtemp
from shutil import rmtree
pca1 = PCA()
svm1 = SVC()
cachedir = mkdtemp()
pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
pipe.fit(digits.data, digits.target)
# The pca instance can be inspected directly
print(pca1.components_)
rmtree(cachedir)
注意：在未使用cache的情况下，可以直接使用pca1访问实例。使用cache后必须使用pipe.named_steps['reduce_dim'].components_
4.1.2。featureunion：复合特征空间
4.1.2.1。使用
跟pipe差不多，可以与pipe公用创建更佳的管道

4.2。特征提取
4.2.1。加载特征词典将dict类型的list数据，转换成numpy array
class sklearn.feature_extraction.DictVectorizer(dtype=<class ‘numpy.float64’>, separator=’=’, sparse=True, sort=True)
fit(X[, y]) Learn a list of feature name -> indices mappings.
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
fit_transform(measurements).toarray()
get_feature_names() Returns a list of feature names, ordered by their indices.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X[, dict_type]) Transform array or sparse matrix X back to feature mappings.
restrict(support[, indices]) Restrict the features to those in support using feature selection.
set_params(**params) Set the parameters of this estimator.
transform(X) Transform feature->value dicts to array or sparse matrix.
4.2.2。特征哈希特征哈希，相当于一种降维技巧
class sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type=’dict’, dtype=<class ‘numpy.float64’>, alternate_sign=True, non_negative=False)
4.2.2.1。实施细则

4.2.3。文本特征提取
4.2.3.1。词语表达袋
4.2.3.2。稀疏
4.2.3.3。常见的矢量化，使用将文本转换为每个词出现的个数的向量
class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
ngram_range： tuple (min_n, max_n)，连在一起的的词汇的个数范围
token_pattern：分词的正则表达式
min_df:最小的词频，过滤出现次数少的词汇
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
In[53]:vectorizer.vocabulary_
Out[53]:
{'and': 0,
'document': 1,
'first': 2,
'is': 3,
'one': 4,
'second': 5,
'the': 6,
'third': 7,
'this': 8}

On[54]:X.toarray()
Out[54]:
array([[0, 1, 1, ..., 1, 0, 1],
[0, 1, 0, ..., 1, 0, 1],
[1, 0, 0, ..., 1, 1, 0],
[0, 1, 1, ..., 1, 0, 1]], dtype=int64)

build_analyzer() Return a callable that handles preprocessing and tokenization
build_preprocessor() Return a function to preprocess the text before tokenization
build_tokenizer() Return a function that splits a string into a sequence of tokens
decode(doc) Decode the input into a string of unicode symbols
fit(raw_documents[, y]) Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y]) Learn the vocabulary dictionary and return term-document matrix.
get_feature_names() Array mapping from feature integer indices to feature name
get_params([deep]) Get parameters for this estimator.
get_stop_words() Build or fetch the effective stop words list
inverse_transform(X) Return terms per document with nonzero entries in X.
set_params(**params) Set the parameters of this estimator.
transform(raw_documents) Transform documents to document-term matrix.
4.2.3.4。术语加权将文本转换为tfidf值的向量
class sklearn.feature_extraction.text.TfidfTransformer(norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
fit_transform(CountVectorizer.fit_transform.toarray())
class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
TfidfVectorizer综合了TfidfTransformer和CountVectorizer
4.2.3.5。解码的文本文件
chardet
4.2.3.6。应用与实例
4.2.3.7。词袋表征的局限性
4.2.3.8。矢量化大型文本语料库与哈希的把戏
4.2.3.9。执行的核心尺度与HashingVectorizer 文本的特征哈希
4.2.3.10。自定义矢量类

4.2.4。图像特征提取
4.2.4.1。补丁提取
4.2.4.2。图像连通图

4.3。数据的预处理
4.3.1。标准化，或均值去除和方差缩放
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
class sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
4.3.1.1。缩放范围的特征
class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
class sklearn.preprocessing.MaxAbsScaler(copy=True)
4.3.1.2。缩放数据稀疏
sklearn.preprocessing.maxabs_scale(X, axis=0, copy=True)
4.3.1.3。离群数据缩放
sklearn.preprocessing.robust_scale(X, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
class sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
4.3.1.4。围绕核矩阵
class sklearn.preprocessing.KernelCenterer
4.3.2。非线性变换
class sklearn.preprocessing.QuantileTransformer(n_quantiles=1000, output_distribution=’uniform’, ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)
sklearn.preprocessing.quantile_transform(X, axis=0, n_quantiles=1000, output_distribution=’uniform’, ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=False)
4.3.3。归一化
class sklearn.preprocessing.Normalizer(norm=’l2’, copy=True)
sklearn.preprocessing.normalize(X, norm=’l2’, axis=1, copy=True, return_norm=False)
4.3.4。二值化
4.3.4.1。特征二值化
class sklearn.preprocessing.Binarizer(threshold=0.0, copy=True)
4.3.5。编码的分类特征
class sklearn.preprocessing.OneHotEncoder(n_values=’auto’, categorical_features=’all’, dtype=<class ‘numpy.float64’>, sparse=True, handle_unknown=’error’)
4.3.6。缺失值插补
class sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)
strategy : string, optional (default=”mean”)
The imputation strategy.
If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.
copy : boolean, optional (default=True)
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
If X is not an array of floating values;
If X is sparse and missing_values=0;
If axis=0 and X is encoded as a CSR matrix;
If axis=1 and X is encoded as a CSC matrix.
4.3.7。生成多项式的特征
class sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
degree=2：[1, a, b, a^2, ab, b^2].
interaction_only=True：没有a^2,b^2.自己不跟自己乘
4.3.8。定制变压器
class sklearn.preprocessing.FunctionTransformer(func=None, inverse_func=None, validate=True, accept_sparse=False, pass_y=’deprecated’, kw_args=None, inv_kw_args=None)

4.4。无监督降维
4.4.1。主成分分析
class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)[source]
4.4.2。随机映射
4.4.3。特征群
4.5。随机投影
4.5.1。Johnson Lindenstrauss引理
4.5.2。高斯随机投影
4.5.3。稀疏随机投影
4.6。核近似
4.6.1。对于核近似奈斯特龙的方法
4.6.2。径向基函数核
4.6.3。加性Chi Squared Kernel
4.6.4。歪斜Chi Squared Kernel
4.6.5。数学细节

4.7。成对度量、亲和度和核
4.7.1。余弦相似度
4.7.2。线性核
sklearn.metrics.pairwise.linear_kernel(X, Y=None)
svm.SVC(kernel='linear', C=C)
svm.LinearSVC(C=C)
#该函数linear_kernel计算线性内核，即polynomial_kernel使用degree=1和coef0=0（均匀）的特殊情况
4.7.3。多项式核函数
sklearn.metrics.pairwise.polynomial_kernel(X, Y=None, degree=3, gamma=None, coef0=1)
svm.SVC(kernel='poly', degree=3, C=C))
4.7.4。Sigmoid核
sklearn.metrics.pairwise.sigmoid_kernel(X, Y=None, gamma=None, coef0=1)
svm.SVC(kernel='sigmoid', gamma=0.7, C=C)
4.7.5。径向基核函数
sklearn.metrics.pairwise.rbf_kernel(X, Y=None, gamma=None)
svm.SVC(kernel='rbf', gamma=0.7)
4.7.6。拉普拉斯核
sklearn.metrics.pairwise.laplacian_kernel(X, Y=None, gamm a=None)
4.7.7。卡方核
sklearn.metrics.pairwise.chi2_kernel(X, Y=None, gamma=1.0)
clf = svm.SVC(kernel='precomputed')
# linear kernel computation
gram = np.dot(X, X.T)
clf.fit(gram, y)
4.8。改变预测目标（Y）
4.8.1。标签化
4.8.2。标签编码
5。数据加载工具
5.1。通用数据接口
5.2。玩具的数据集
5.3。样品图片
5.4。样品的发电机
5.4.1。分类和聚类生成器
5.4.1.1。单标签
5.4.1.2。细粒度
5.4.1.3。双聚类
5.4.2。发电机的回归
5.4.3。流形学习生成器
5.4.4。发电机的分解
5.5。svmLight / libsvm格式数据
5.6。来自外部数据集的加载
5.7。Olivetti面临数据集
5.8。20新闻组文本数据集
5.8.1。使用
5.8.2。文本转换成向量
5.8.3。过滤文本以获得更真实的训练
5.9。从mldata.org库下载数据
5.10。人脸识别数据集中的标记人脸
5.10.1。使用
5.10.2。实例
5.11。森林植被类型
5.12。RCV1数据集
5.13。波士顿房价数据集
5.13.1企业。笔记
5.14。乳腺癌威斯康星（诊断）数据库
5.14.1企业。笔记
5.14.2。工具书类
5.15。糖尿病数据集
5.15.1公司。笔记
5.16。手写数字数据集的光学识别
5.16.1。笔记
5.16.2。工具书类
5.17。鸢尾属植物数据库
5.17.1。笔记
5.17.2。工具书类
5.18。Linnerrud数据集
5.18.1。笔记
5.18.2。工具书类
6。规模计算策略：更大的数据
6.1。使用核心学习的实例扩展
6.1.1。流实例
6.1.2。特征提取
6.1.3。增量学习
6.1.4。实例
6.1.5。笔记
7。计算性能
7.1。预测的延迟
7.1.1。体积与Atomic模式
7.1.2。特征数的影响
7.1.3。输入数据表示的影响
7.1.4。模型复杂度的影响
7.1.5。特征提取的延迟
7.2。预测的吞吐量
7.3。提示和技巧
7.3.1。线性代数库
7.3.2。模型压缩
7.3.3。模式重塑
7.3.4。链接

========= =======================================================
Colormap Description
========= =======================================================
autumn sequential linearly-increasing shades of red-orange-yellow
bone sequential increasing black-white color map with
a tinge of blue, to emulate X-ray film
cool linearly-decreasing shades of cyan-magenta
copper sequential increasing shades of black-copper
flag repetitive red-white-blue-black pattern (not cyclic at
endpoints)
gray sequential linearly-increasing black-to-white
grayscale
hot sequential black-red-yellow-white, to emulate blackbody
radiation from an object at increasing temperatures
hsv cyclic red-yellow-green-cyan-blue-magenta-red, formed
by changing the hue component in the HSV color space
inferno perceptually uniform shades of black-red-yellow
jet a spectral map with dark endpoints, blue-cyan-yellow-red;
based on a fluid-jet simulation by NCSA [#]_
magma perceptually uniform shades of black-red-white
pink sequential increasing pastel black-pink-white, meant
for sepia tone colorization of photographs
plasma perceptually uniform shades of blue-red-yellow
prism repetitive red-yellow-green-blue-purple-...-green pattern
(not cyclic at endpoints)
spring linearly-increasing shades of magenta-yellow
summer sequential linearly-increasing shades of green-yellow
viridis perceptually uniform shades of blue-green-yellow
winter linearly-increasing shades of blue-green
========= =======================================================

For the above list only, you can also set the colormap using the
corresponding pylab shortcut interface function, similar to Matlab::

imshow(X)
hot()
jet()

The next set of palettes are from the `Yorick scientific visualisation
package <http://dhmunro.github.io/yorick-doc/>`_, an evolution of
the GIST package, both by David H. Munro:

============ =======================================================
Colormap Description
============ =======================================================
gist_earth mapmaker's colors from dark blue deep ocean to green
lowlands to brown highlands to white mountains
gist_heat sequential increasing black-red-orange-white, to emulate
blackbody radiation from an iron bar as it grows hotter
gist_ncar pseudo-spectral black-blue-green-yellow-red-purple-white
colormap from National Center for Atmospheric
Research [#]_
gist_rainbow runs through the colors in spectral order from red to
violet at full saturation (like *hsv* but not cyclic)
gist_stern "Stern special" color table from Interactive Data
Language software
============ =======================================================

The following colormaps are based on the `ColorBrewer
<http://colorbrewer2.org>`_ color specifications and designs developed by
Cynthia Brewer:

ColorBrewer Diverging (luminance is highest at the midpoint, and
decreases towards differently-colored endpoints):

======== ===================================
Colormap Description
======== ===================================
BrBG brown, white, blue-green
PiYG pink, white, yellow-green
PRGn purple, white, green
PuOr orange, white, purple
RdBu red, white, blue
RdGy red, white, gray
RdYlBu red, yellow, blue
RdYlGn red, yellow, green
Spectral red, orange, yellow, green, blue
======== ===================================

ColorBrewer Sequential (luminance decreases monotonically):

======== ====================================
Colormap Description
======== ====================================
Blues white to dark blue
BuGn white, light blue, dark green
BuPu white, light blue, dark purple
GnBu white, light green, dark blue
Greens white to dark green
Greys white to black (not linear)
Oranges white, orange, dark brown
OrRd white, orange, dark red
PuBu white, light purple, dark blue
PuBuGn white, light purple, dark green
PuRd white, light purple, dark red
Purples white to dark purple
RdPu white, pink, dark purple
Reds white to dark red
YlGn light yellow, dark green
YlGnBu light yellow, light green, dark blue
YlOrBr light yellow, orange, dark brown
YlOrRd light yellow, orange, dark red
======== ====================================

ColorBrewer Qualitative:

(For plotting nominal data, :class:`ListedColormap` is used,
not :class:`LinearSegmentedColormap`. Different sets of colors are
recommended for different numbers of categories.)

* Accent
* Dark2
* Paired
* Pastel1
* Pastel2
* Set1
* Set2
* Set3

Other miscellaneous schemes:

============= =======================================================
Colormap Description
============= =======================================================
afmhot sequential black-orange-yellow-white blackbody
spectrum, commonly used in atomic force microscopy
brg blue-red-green
bwr diverging blue-white-red
coolwarm diverging blue-gray-red, meant to avoid issues with 3D
shading, color blindness, and ordering of colors [#]_
CMRmap "Default colormaps on color images often reproduce to
confusing grayscale images. The proposed colormap
maintains an aesthetically pleasing color image that
automatically reproduces to a monotonic grayscale with
discrete, quantifiable saturation levels." [#]_
cubehelix Unlike most other color schemes cubehelix was designed
by D.A. Green to be monotonically increasing in terms
of perceived brightness. Also, when printed on a black
and white postscript printer, the scheme results in a
greyscale with monotonically increasing brightness.
This color scheme is named cubehelix because the r,g,b
values produced can be visualised as a squashed helix
around the diagonal in the r,g,b color cube.
gnuplot gnuplot's traditional pm3d scheme
(black-blue-red-yellow)
gnuplot2 sequential color printable as gray
(black-blue-violet-yellow-white)
ocean green-blue-white
rainbow spectral purple-blue-green-yellow-orange-red colormap
with diverging luminance
seismic diverging blue-white-red
nipy_spectral black-purple-blue-green-yellow-red-white spectrum,
originally from the Neuroimaging in Python project
terrain mapmaker's colors, blue-green-yellow-brown-white,
originally from IGOR Pro
============= =======================================================

The following colormaps are redundant and may be removed in future
versions. It's recommended to use the names in the descriptions
instead, which produce identical output:

========= =======================================================
Colormap Description
========= =======================================================
gist_gray identical to *gray*
gist_yarg identical to *gray_r*
binary identical to *gray_r*
spectral identical to *nipy_spectral* [#]_
========= =======================================================

sklearn中一些参数