模型debug实战| Ng算法诊断理论的应用

最新推荐文章于 2022-11-11 22:45:48 发布

文文学霸

最新推荐文章于 2022-11-11 22:45:48 发布

阅读量438

点赞数

文章标签：算法可视化 python 机器学习人工智能

原文链接：http://xtf615.com/2017/04/03/practice-ml-advice/

版权

之前在文章"入职半年小结 | 给应届校招算法同学的几点建议"中提到，算法debug对于高质、高效地完成工作，拿到业务收益非常关键。故，本文主要从ng大佬的机器学习算法诊断^[1]讲义出发，主要参考资料^[2]，结合demo来介绍一些实践经验，能够有助于日常学习工作中对数据/特征/模型等进行有效的debug。

具体包括几个部分：

「数据的可视化」：data visualizing
「模型选择」：choosing a machine learning method suitable for the problem at hand
「过拟合和欠拟合识别和处理」：identifying and dealing with over and underfitting
「高维数据可视化」：dealing with high-dimension datasets
「不同代价函数优缺点」：pros and cons of different loss functions.

1.数据可视化

1.1 数据集获取

使用sklearn自带的make_classification方法获取demo数据。

from sklearn.datasets import make_classification
from pandas import DataFrame
X, y = make_classification(1000, n_features=20, n_informative=2, 
                           n_redundant=2, n_classes=2, random_state=0)
columns = map(lambda i:"col_"+ str(i),range(20)) + ["class"]
df = DataFrame(np.hstack((X, y[:, None])), columns=columns)

本文主要对二分类问题进行讨论，选取了1000个样本，20个特征。下表是部分数据：

　　显然，尽管维度很少，直接看这个数据也很难得到关于问题的任何有用信息。可以通过可视化数据来发现规律。

1.2 可视化

使用Seaborn开源库来进行可视化。第一步，使用pairplot方法来绘制「任意两个维度和类别的关系」，首先使用前100个数据，5个维度特征来进行绘图。

_ = sns.pairplot(df[:100], vars=["col_8", "col_11", "col_12", "col_14", "col_19"], hue="class", size=1.5)

　　上图25幅图中，对角线部分是单个特征的「直方图」，横轴是特征的取值，纵轴是特征取值的频次。反映了某个特征下，不同类别之间取值的频次分布差异。从图中可以看出特征11和特征14取值在不同类别间差异显著。

而非对角线部分是5个维度特征两两组合的结果，以散点图的形式呈现。散点图反映了任意两个维度组合特征和类别的关系，我们可以根据是否「线性可分」或者是否「存在明显的相关性」来判断「组合特征在类别判断中是否起到作用」。如图特征11和特征14的散点图，我们发现基本上是线性可分的，而特征12和特征19则存在明显的反相关。对于相互之间相关性强的特征必须舍弃其一，对于和类别相关性强的特征必须保留。　　

我们继续观察特征与特征之间以及特征与类别之间的相关性：

plt.figure(figsize=(12, 10))
plt.xticks(rotation=90)
_ = sns.heatmap(df.corr()) #df.corr()是求相关系数函数

　　如上图，我们使用热力图来绘制不同特征之间以及特征与类别之间的相关性。首先看最后一行，反映了类别和不同特征之间的关系。可以看到，特征11和类别关系最密切，即特征11在类别判断中能起到很重要的作用。特征14、12次之。再看特征12和特征19，我们发现存在着明显的反相关，特征11和特征14正相关性也很强。因此存在一些冗余的特征。因为很多模型是假设在给定类别的情况下，特征取值之间是独立的，比如朴素贝叶斯，所以对于相关性强的特征是需要做筛选的。而剩余的其他特征大部分是噪声，既和其他特征不相关，也和类别不相关。

2.模型初步选择

一旦对数据进行了可视化，就可以快速使用模型来进行粗糙的学习。由于机器学习模型多样，有的时候很难决定先用哪一种方法，根据一些总结的经验，搬运了一张网上的图谱：

from IPython.display import Image
Image(filename='machine-learning-method.png', width=800, height=600)

　　因为有1000个样本，并且是有监督分类问题，根据图谱推荐使用LinearSVC，我们首先使用线性核函数的SVM来尝试建模。回顾一下SVM的目标函数：

上式使用的是L2-regularized,L1-loss( )。因此penalty='l2',loss='hinge',即：

#http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
from sklearn.svm import LinearSVC
# 二者间距较大，存在过拟合的嫌疑，即训练集拟合的很好，分数很高。但是测试集分数很低
plot_learning_curve(LinearSVC(C=10.0,penalty='l2',loss='hinge'), "LinearSVC(C=10.0,penalty='l2',loss='hinge')",
                    X, y, ylim=(0.8, 1.01),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)

　　上式是学习曲线，对应之前提到的诊断方法中的方差/误差分析图。会在下一小节介绍该图的细节。现在先关注上图，只使用了20%(np.linspace第二个参数)，即200个数据进行训练测试。由图中可以看出，训练分数和泛化分数二者间距较大，并且训练分数处在一个很高的水准，根据之前介绍的偏差方差分析，可以得出，上述存在过拟合(over-fitting)的问题。注意，该学习曲线和ng讲义中的偏差方差分析图存在区别，ng讲义的图如下所示：

区别在于，之前使用的是误差，这里使用的是得分。因此测试集和训练集分数曲线相对位置调换，训练集分数曲线在上，测试集分数曲线在下。随着样本的增多，误差曲线下降，这里分数曲线则是上升。但是相同点在于，过拟合图对应的学习曲线，训练分数(误差)和泛化分数(误差)二者间距较大，且训练分数(误差)处在一个高水准。

2.1 学习曲线

这里先介绍下学习曲线绘制方法。

# http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        train_sizes=np.linspace(.1, 1.0, 5),baseline=None):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects
    """
    
    plt.figure()
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)
    print train_sizes
    print '-------------'
    print train_scores
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="b",
             label="Cross-validation score")

    if baseline:
        plt.axhline(y=baseline,color='red',linewidth=5,label='Desired Performance') #baseline
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.legend(loc="best")
    plt.grid("on") 
    if ylim:
        plt.ylim(ylim)
    plt.title(title)

简要解释下几个重要点。首先是参数，estimator代表模型，title标题，X是样本数据集，y是标签集，ylim是学习曲线y轴的取值范围(min,max)，cv是交叉验证折数，train_sizes=np.linspace(.1, 1.0, 5)代表划分训练集，np.linspace(.1, 1.0, 5)返回的结果[ 0.1, 0.325, 0.55 , 0.775, 1.]，即等间隔划分数据集，第一个参数是起始，第二个参数是终点，最后一个参数是划分份数。因为学习曲线的x轴代表样本的数量，即画出指标在训练集和验证集上样本数量变化的情况。我们不可能对每个样本量取值(从1一直递增到1000)都进行绘图，即不能画出平滑的曲线，而是取一些「关键的点」进行训练绘图，上述得到的train_sizes就是每次训练的样本占总样本的比例的数组。

接着是重要的一些代码。train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)返回的train_sizes是根据传入的train_sizes比例数组计算的实际训练样本数量数组。train_scores是训练集的得分，是一个二维数组，第一维等于train_sizes数组大小,即每次训练的分数，第二维等于交叉验证份数cv,即每次交叉验证的得分数组。test_scores是测试集的得分。因此可以取平均进行绘图，plt.fill_between方法是图中阴影的部分。

3. 过拟合处理

有许多方法可以解决过拟合问题。

3.1 增加样本数量

plot_learning_curve(LinearSVC(C=10.0,penalty='l2',loss='hinge'), "LinearSVC(C=10.0,penalty='l2',loss='hinge')",
                    X, y, ylim=(0.8, 1.01),
                    train_sizes=np.linspace(.1, 1.0, 5), baseline=0.9)

　　这里修改linspace第二个参数为1，使用全部样本进行训练。我们发现泛化分数随着样本的增多不断增大，并且泛化分数和训练分数的间距不断缩小。但是高偏差的时候间距也是小的。继续进一步判断，发现训练分数和泛化分数都处在一个较高的水准，高于期望分数，而高偏差时，训练分数和泛化分数都比较低，低于理想分数。因此，此时不存在过拟合或欠拟合的问题。

3.2 减少特征

根据前面的可视化分析，我们发现特征11和14和类别关联紧密，因此可以考虑先手动选择这两种特征进行训练。同样只在20%的样本上进行训练：

plot_learning_curve(LinearSVC(C=10.0,penalty='l2',loss='hinge'), "LinearSVC(C=10.0,penalty='l2',loss='hinge') Features: 11&14",
                    df[["col_11", "col_14"]], y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)

　　和最早的那幅过拟合图相比，这里的结果已经好很多，基本上解决了过拟合的问题。但是这里的特征选择方法有点作弊嫌疑，首先是因为手动选择的，其次是因为我们是在1000个样本上进行选择的，而我们最终却只使用200个样本进行训练绘图。下面进行特征自动选择：

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
# SelectKBest(f_classif, k=2) will select the k=2 best features according to their Anova F-value
plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features
                               ("svc", LinearSVC(C=10.0,penalty='l2',loss='hinge'))]),
                    "SelectKBest(f_classif, k=2) + LinearSVC(C=10.0,penalty='l2',loss='hinge')",
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)

　上述使用SelectKBest选择2个特征，我们发现在这个数据集上特征选择表现很好。注意，这种特征选择方法只是减少模型复杂度的一种方法。其他方法还包括，减少线性回归中多项式的阶数，减少神经网络中隐藏层的数量和节点数，增加高斯核函数的bandwidth( ),或减小等。

3.3 修改目标函数正则化项

#C表征了对离群点的重视程度，越大越重视，越大越容易过拟合。
#减小C可以一定程度上解决过拟合
plot_learning_curve(LinearSVC(C=0.1,penalty='l2',loss='hinge'), "LinearSVC(C=0.1,penalty='l2',loss='hinge')", 
                    X, y, ylim=(0.8, 1.01),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)

惩罚因子决定了你有多重视离群点带来的损失，显然当所有离群点的松弛变量( )的和一定时，你定的C越大，对目标函数的损失也越大，此时就暗示着你非常不愿意放弃这些离群点，最极端的情况是你把C定为无限大，这样只要稍有一个点离群，目标函数的值马上变成无限大，马上让问题变成无解，这就退化成了硬间隔问题，即C越大，你越希望在训练数据上少犯错误，而实际上这是不可能/没有意义的，于是就造成过拟合。因此这里减少能够一定程度上减少过拟合。我们可以使用网格搜索来寻找最佳C。

#使用网格搜索
from sklearn.grid_search import GridSearchCV
est = GridSearchCV(LinearSVC(penalty='l2',loss='hinge'), 
                   param_grid={"C": [0.0001,0.001, 0.01, 0.1, 1.0, 10.0]})
plot_learning_curve(est, "LinearSVC(C=AUTO)", 
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)
print "Chosen parameter on 100 datapoints: %s" % est.fit(X[:100], y[:100]).best_params_

输出结果：「Chosen parameter on 100 datapoints: {'C': 0.01}」

　特征选择看起来比修改正则化系数来的好。还有一种正则化方法，将LinearSVC的penalty设置为L1,官方文档解释为「The ‘l1’ leads to coef_ vectors that are sparse」,即L1可以导致稀疏参数矩阵，参数为0的特征不起作用，则相当于隐含的特征选择。不过注意,LinearSVC不支持L1-regularized和L1-loss,L1-regularized对应penalty='l1',L1-loss对应loss='hinge'。可参考[3]，因此需要把loss改成'squared_hinge'。另外，此时不能用对偶问题来解决。故dual=False。

plot_learning_curve(LinearSVC(C=0.1, penalty='l1', loss='squared_hinge',dual=False), 
                    "LinearSVC(C=0.1, penalty='l1')", 
                    X, y, ylim=(0.8, 1.0),
                    train_sizes=np.linspace(.05, 0.2, 5),baseline=0.9)

　　　结果看起来不错。学习到的参数如下：

est = LinearSVC(C=0.1, penalty='l1', loss='squared_hinge',dual=False)
est.fit(X[:150], y[:150])  # fit on 150 datapoints
print "Coefficients learned: %s" % est.coef_
print "Non-zero coefficients: %s" % np.nonzero(est.coef_)[1]

　　　可以看到特征11的权重最大，即最重要。

4. 欠拟合处理

之前使用的数据集分类结果都比较理想，我们尝试使用另一个二分类数据集。

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, random_state=2)#只有2个特征
plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25)", 
                    X, y, ylim=(0.4, 1.0),
                    train_sizes=np.linspace(.1, 1.0, 5))#效果非常差

　　由上图可以看出，训练分数和泛化分数差距很小，并且训练分数明显低于期望分数。根据之前的方差/偏差分析可知，这里存在着明显的偏差，即欠拟合问题。我们首先对数据进行可视化观察：

# 环形数据，外圈的数据是一种类别，内圈的数据是一种类别
columns = map(lambda i:"col_"+ str(i),range(2)) + ["class"]
df = DataFrame(np.hstack((X, y[:, None])), 
               columns = columns)
_ = sns.pairplot(df, vars=["col_0", "col_1"], hue="class", size=3.5)

　　根据上图，该数据集是环形数据，外圈的点代表一种类别，内圈的点代表另一种类别。显然上述数据是线性不可分的，使用再多数据或者减少特征都没用，我们的模型是错误的，需要进行欠拟合处理。

4.1 增加或使用更好的特征

我们尝试增加特征，根据散点图，显然不同类别距离原点的距离不同，我们可以增加到原点的距离这一特征。

#解决欠拟合方法1：增加特征
# X[:, [0]]**2 + X[:, [1]]**2)计算的是离原点的距离
X_orginal_distance = X[:, [0]]**2 + X[:, [1]]**2#X[:, [0]]将得到的列数据变成二维的形式，[[  8.93841424e-01],[ -7.63891636e-01]...]
df['col_3'] = X_orginal_distance 
#可以看到完全线性可分
_ = sns.pairplot(df, vars=["col_0", "col_1","col_3"], hue="class", size=3.5)

　　由最后一幅图，我们发现根据col_3新特征，就能将类别完全线性分隔开，因此col_3特征在区分类别上能起决定性作用。不妨看看热力图：

　　根据热力图，我们发现col_3和类别存在着非常强的负相关性。使用新增完的特征集进行预测：

X_extra = np.hstack((X,X[:,[0]]**2+X[:,[1]]**2))
plot_learning_curve(LinearSVC(C=10,penalty='l2',loss='hinge'), "LinearSVC(C=10,penalty='l2',loss='hinge')", 
                    X_extra, y, ylim=(0, 1.01),
                    train_sizes=np.linspace(.1, 1.0, 5),baseline=0.9)

　　根据结果，完全分开了样本。我们可以进一步思考，是否可以让模型进行自动生成新特征？

4.2 使用更复杂的模型

「使用复杂的模型，相当于更换了目标函数」。根据上面数据集非线性可分的特点，我们可尝试非线性分类器，使用RBF核的SVM进行分类。

from sklearn.svm import SVC
# note: we use the original X without the extra feature
# 使用RBF核，最小间隔gamma设为1.
plot_learning_curve(SVC(C=10, kernel="rbf", gamma=1.0),
                    "SVC(C=10, kernel='rbf', gamma=1.0)",
                    X, y, ylim=(0.5, 1.1), 
                    train_sizes=np.linspace(.1, 1.0, 5),baseline=0.9)

　　注意上述建模使用的是原始数据集X，而没有用新的特征。可以发现结果很理想，RBF核会将特征映射到高维空间，因此得到的非线性模型效果很好。

5. 高维特征处理与可视化

5.1 SGDClassfier增量学习

如果数据集增大，特征增多，那么上述SVM运行会变慢很多。根据之前的图谱推荐，此时可以使用SGDClassifier，该分类器也是一个线性模型,但是使用随机梯度下降法(stochastic gradient descent),SGDClassifier对特征缩放很敏感，因此可以考虑标准化数据集，使特征均值为0，方差为1.

SGDClassifier允许增量学习，会在线学习，在数据量很大的时候很有用。此时不适合采用交叉验证，我们采取progressive validation方法，即将数据集等分成块，每次在前一块训练，在后一块验证，并且使用增量学习，后面块的学习是在前面块学习的基础上继续学习的。

首先生成数据，20万+200特征+10个类别。

X, y = make_classification(200000, n_features=200, n_informative=25, 
                           n_redundant=0, n_classes=10, class_sep=2,
                           random_state=0)

建模和验证：

from sklearn.linear_model import SGDClassifier
def sgd_score(X,y):
    est = SGDClassifier(penalty="l2", alpha=0.001)
    progressive_validation_score = []
    train_score = []
    for datapoint in range(0, 199000, 1000):
        X_batch = X[datapoint:datapoint+1000]
        y_batch = y[datapoint:datapoint+1000]
        if datapoint > 0:
            progressive_validation_score.append(est.score(X_batch, y_batch))
        est.partial_fit(X_batch, y_batch, classes=range(10)) #增量学习或称为在线学习
        if datapoint > 0:
            train_score.append(est.score(X_batch, y_batch))
            
    plt.plot(train_score, label="train score",color='blue')
    plt.plot(progressive_validation_score, label="progressive validation score",color='red')
    plt.xlabel("Mini-batch")
    plt.ylabel("Score")
    plt.axhline(y=0.8,color='red',linewidth=5,label='Desired Performance') #baseline
    plt.legend(loc='best')
sgd_score(X,y)

　　上图表明，在50次mini-batches滞后，分数提高就很少了，因此我们可以提前停止训练。由于训练分数和泛化分数差距很小，其训练分数较低，因此可能存在欠拟合的可能。然而SGDClassifier不支持核技巧，根据图谱可以使用kernel approximation。　　

相较于核函数隐示的映射，kernel approximation使用显示的映射方法，这对在线学习非常重要，可以减少超大数据集的学习代价。使用SGDClassifier配合kernel approximation可以在大数据集上实现非线性学习的目的。

5.2 手写体数字识别

现在尝试对手写体数字问题进行建模。

5.2.1 可视化

# http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
from sklearn.datasets import load_digits
digits = load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
print "Dataset consist of %d samples with %d features each" % (n_samples, n_features)

# Plot images of the digits
n_img_per_row = 20 #最大为32，即展示1024个样本
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row)) # 200*200规格的像素矩阵

for i in range(n_img_per_row):
    ix = 10 * i + 1 #空1个像素点
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))#1行64个特征是通过8*8展平的,存入分块矩阵

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
_ = plt.title('A selection from the 8*8=64-dimensional digits dataset')

　　手写体数字的64维特征就是一个8*8数字图片每个像素点平铺开来的。因此我们可以通过上面代码进行重建图片。

print digits.images.shape #三维数组(1083L, 8L, 8L)，1083个样本
print img.shape #二维数组(200L,200L),每个样本占8*8小分块矩阵。每8行20个样本，一共可以放400个样本。
#可以扩大该二维数组，例如(320L,320L), 每个样本占8*8小分块矩阵， 每8行展示32个样本，最大可以展示1024个样本。即32*32
# digits.images[0] == img[1:9,1:9]
# digits.images[1] == img[1:9,11:19]
plt.matshow(digits.images[1],cmap=plt.cm.gray)  #第二个样本为数字1
plt.matshow(img[1:9,11:19],cmap=plt.cm.gray)  #第二个样本数字1

　　上述代码展示一个数字的结果，可以发现digits.images[1]和img[1:9,11:19]都是代表第二个样本，我们可以从图中看出第二个样本数字是1。进一步可视化：

# Helper function based on 
# http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py
# 我们之前已经讨论过手写数字的数据，每个手写的阿拉伯数字被表达为一个8*8的像素矩阵，
# 我们曾经使用每个像素点，也就是64个特征，使用logistic和knn的方法（分类器）去根据训练集判别测试集中的数字。
# 在这种做法中，我们使用了尚未被降维的数据。其实我们还可以使用降维后的数据来训练分类器。
# 现在，就让我们看一下对这个数据集采取各种方式降维的效果。
from matplotlib import offsetbox
def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=(10, 10))
    ax = plt.subplot(111)
    
    # 绘制每个样本这两个维度的值以及实际的数字
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 12})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # 定义一个标准点
        for i in range(digits.data.shape[0]):#样本数
            dist = np.sum((X[i] - shown_images) ** 2,axis=1)#计算要展示的点和目前所有的点的距离，
            #axis=1代表横着加，即每个样本x^2+y^2; 得到该样本和所有的点的距离的数组;axis=0，按列加，就变成了把每个样本的x^2全加起来，y^2全部加起来。
            if np.min(dist) < 4e-3: #选择最近的距离
                continue # don't show points that are too close
            shown_images = np.r_[shown_images, [X[i]]] # 纵向合并
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])#X[i]代表每个样本的两个维度的值，即横轴和纵轴的值，即两个维度决定的位置画出灰度图
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

5.2.2 降维

「随机降维」

#降维——随机投影
#把64维数据随机地投影到二维上
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)#随机投影到两个维度
stime = time.time()
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - stime))

「PCA降维」

# PCA降维
# linear线性降维
# TruncatedSVD是pca的一种方式，不需要计算协方差矩阵，适用于稀疏矩阵
# PCA for dense data or TruncatedSVD for sparse data
#implemented using a TruncatedSVD which does not require constructing the covariance matrix
# LSA的基本思想就是，将document从稀疏的高维Vocabulary空间映射到一个低维的向量空间，我们称之为隐含语义空间(Latent Semantic Space).
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
stime = time.time()
plot_embedding(X_pca,"Principal Components projection of the digits (time: %.3fs)" % (time.time() - stime))

「LDA线性变换」

print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
stime = time.time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time.time() - stime))

「t-SNE非线性变换」

#http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE
#非线性的变换
#最小化KL距离，Kullback-Leibler 
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
stime = time.time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time: %.3fs)" % (time.time() - stime))

　　可以发现，在该数据集上，非线性变换的结果比线性变换的结果更理想。

6. 损失函数选择

下面列出常用的一些「分类」损失函数。的取值为1或-1。下图中，「一定要注意」：横轴是，纵轴是损失值，分类损失是关于的单调函数（而不是关于）。

# adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.html
def modified_huber_loss(y_true, y_pred):
    z = y_pred * y_true
    loss = -4 * z
    loss[z >= -1] = (1 - z[z >= -1]) ** 2
    loss[z >= 1.] = 0
    return loss
xmin, xmax = -4, 4
xx = np.linspace(xmin, xmax, 100)
lw = 2
plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], color='gold', lw=lw,
         label="Zero-one loss")
plt.plot(xx, np.where(xx < 1, 1 - xx, 0), color='teal', lw=lw,
         label="Hinge loss")
plt.plot(xx, -np.minimum(xx, 0), color='yellowgreen', lw=lw,
         label="Perceptron loss")
plt.plot(xx, np.log2(1 + np.exp(-xx)), color='cornflowerblue', lw=lw,
         label="Log loss")
plt.plot(xx, np.where(xx < 1, 1 - xx, 0) ** 2, color='orange', lw=lw,
         label="Squared hinge loss")
plt.plot(xx, np.exp(-xx), color='red',lw=lw,linestyle='--',
         label="Exponential loss")
plt.plot(xx, modified_huber_loss(xx, 1), color='darkorchid', lw=lw,
         linestyle='--', label="Modified Huber loss")
plt.ylim((0, 8))
plt.legend(loc="upper right")
plt.xlabel(r"Decision function(multiplied by y) $y \cdot f(x)$")
plt.ylabel("$L(y \cdot f(x))$")
plt.show()

　　不同的代价函数有不同的优点：

0-1 loss: ，在分类问题中使用。这是ERM用的代价函数，然而是非凸的，因此必须使用其他代价函pr来近似替代。
hinge loss: 在SVM中使用，体现最大间隔思想，不容易受离群点影响，有很好的鲁棒性，然而不能提供较好的概率解释，又称为L1-Loss。
log loss: ，实际上就是Sigmoid函数取负对数。在逻辑回归(和二分类交叉熵损失实际上是等价的，只不过那里y取1或0，这里y取1或-1)使用，能提供较好的概率解释，然而容易受离群点影响；
Exponential loss: ，指数代价，在Boost中使用，容易受离群点影响，在AdaBoost中能够实现简单有效的算法。
perceptron loss: ，在感知机算法中使用。类似hinge loss，左移了一下。不同于hinge loss, percptron loss不对离超平面近的点进行惩罚。
squared hinge loss: ，对hinge loss进行改进，又称为L2-Loss，可微分，处处可导，(因为(1,0)处左右两边都可导，且导数都为0)。
modified huber loss: ，对squared hinge loss进一步改进，是一种平滑损失，能够容忍离群点的影响(离群点损失的影响降低, 平方级损失变为线性的)

参考

[1] ng机器学习：Advice for applying Machine Learning: https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

[2] Advice for applying Machine Learning：https://jmetzen.github.io/2015-01-29/ml_advice.html

[3] Liblinear does not support L1-regularized L1-loss ( hinge loss ) support vector classification. Why?: https://www.quora.com/Support-Vector-Machines/Liblinear-does-not-support-L1-regularized-L1-loss-hinge-loss-support-vector-classification-Why

喜欢的话麻烦分享、点赞、在看，多谢~

文文学霸

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
模型debug实战| Ng算法诊断理论的应用

之前在文章"入职半年小结 | 给应届校招算法同学的几点建议"中提到，算法debug对于高质、高效地完成工作，拿到业务收益非常关键。故，本文主要从ng大佬的机器学习算法诊断[...
复制链接

扫一扫