机器学习系列(4)_机器学习算法一览，应用建议与解决思路

2.机器学习算法简述

• 监督学习算法

• 无监督学习

• 半监督学习

2.2 从算法的功能角度分类

2.2.1 回归算法(Regression Algorithms)

• Ordinary Least Squares Regression (OLSR)
• Linear Regression
• Logistic Regression
• Stepwise Regression
• Locally Estimated Scatterplot Smoothing (LOESS)
• Multivariate Adaptive Regression Splines (MARS)

2.2.2 基于实例的算法(Instance-based Algorithms)

• k-Nearest Neighbour (kNN)
• Learning Vector Quantization (LVQ)
• Self-Organizing Map (SOM)
• Locally Weighted Learning (LWL)

2.2.3 决策树类算法(Decision Tree Algorithms)

• Classification and Regression Tree (CART)
• Iterative Dichotomiser 3 (ID3)
• C4.5 and C5.0 (different versions of a powerful approach)
• Chi-squared Automatic Interaction Detection (CHAID)
• M5
• Conditional Decision Trees

2.2.4 贝叶斯类算法(Bayesian Algorithms)

• Naive Bayes
• Gaussian Naive Bayes
• Multinomial Naive Bayes
• Averaged One-Dependence Estimators (AODE)
• Bayesian Belief Network (BBN)
• Bayesian Network (BN)

2.2.5 聚类算法(Clustering Algorithms)

• k-Means
• Hierarchical Clustering
• Expectation Maximisation (EM)

2.2.6 关联规则算法(Association Rule Learning Algorithms)

• Apriori algorithm
• Eclat algorithm

2.2.7 人工神经网络类算法(Artificial Neural Network Algorithms)

• Perceptron
• Back-Propagation
• Radial Basis Function Network (RBFN)

2.2.8 深度学习(Deep Learning Algorithms)

• Deep Boltzmann Machine (DBM)
• Deep Belief Networks (DBN)
• Convolutional Neural Network (CNN)
• Stacked Auto-Encoders

2.2.9 降维算法(Dimensionality Reduction Algorithms)

• Principal Component Analysis (PCA)
• Principal Component Regression (PCR)
• Partial Least Squares Regression (PLSR)
• Sammon Mapping
• Multidimensional Scaling (MDS)
• Linear Discriminant Analysis (LDA)
• Mixture Discriminant Analysis (MDA)
• Flexible Discriminant Analysis (FDA)

2.2.10 模型融合算法(Ensemble Algorithms)

• Random Forest
• Boosting
• Bootstrapped Aggregation (Bagging)
• Stacked Generalization (blending)
• Gradient Boosted Regression Trees (GBRT)

2.3 机器学习算法使用图谱

scikit-learn作为一个丰富的python机器学习库，实现了绝大多数机器学习的算法，有相当多的人在使用，于是我这里很无耻地把machine learning cheat sheet for sklearn搬过来了，原文可以看这里。哈哈，既然讲机器学习，我们就用机器学习的语言来解释一下，这是针对实际应用场景的各种条件限制，对scikit-learn里完成的算法构建的一颗决策树，每一组条件都是对应一条路径，能找到相对较为合适的一些解决方法，具体如下：

3. 机器学习问题解决思路

• 拿到数据后怎么了解数据(可视化)
• 选择最贴切的机器学习算法
• 定位模型状态(过/欠拟合)以及解决方法
• 大量极的数据的特征分析与可视化
• 各种损失函数(loss function)的优缺点及如何选择

3.1 数据与可视化

#numpy科学计算工具箱
import numpy as np
#使用make_classification构造1000个样本，每个样本有20个feature
from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=20, n_informative=2,
n_redundant=2, n_classes=2, random_state=0)
#存为dataframe格式
from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])),columns = range(20) + ["class"])


df[:6]


import matplotlib.pyplot as plt
import seaborn as sns
#使用pairplot去看不同特征维度pair下数据的空间分布状况
_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)
plt.show()


import matplotlib.pyplot as plt
plt.figure(figsize=(12, 10))
_ = sns.corrplot(df, annot=False)
plt.show()


3.2 机器学习算法选择

from sklearn.svm import LinearSVC
from sklearn.learning_curve import learning_curve
#绘制学习曲线，以确定模型的状况
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
train_sizes=np.linspace(.1, 1.0, 5)):
"""
画出data在某模型上的learning curve.
参数解释
----------
estimator : 你用的分类器。
title : 表格的标题。
X : 输入的feature，numpy类型
y : 输入的target vector
ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
cv : 做cross-validation的时候，数据分成的份数，其中一份作为cv集，其余n-1份作为training(默认为3份)
"""

plt.figure()
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.xlabel("Training examples")
plt.ylabel("Score")
plt.legend(loc="best")
plt.grid("on")
if ylim:
plt.ylim(ylim)
plt.title(title)
plt.show()

#少样本的情况情况下绘出学习曲线
plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",
X, y, ylim=(0.8, 1.01),
train_sizes=np.linspace(.05, 0.2, 5))


3.2.1 过拟合的定位与解决

• 增大样本量

#增大一些样本量
plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",
X, y, ylim=(0.8, 1.1),
train_sizes=np.linspace(.1, 1.0, 5))


• 减少特征的量(只用我们觉得有效的特征)

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0) Features: 11&14", X[:, [11, 14]], y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))


from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
# SelectKBest(f_classif, k=2) 会根据Anova F-value选出 最好的k=2个特征

plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features
("svc", LinearSVC(C=10.0))]), "SelectKBest(f_classif, k=2) + LinearSVC(C=10.0)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))


• 增强正则化作用(比如说这里是减小LinearSVC中的C参数)
正则化是我认为在不损失信息的情况下，最有效的缓解过拟合现象的方法。
plot_learning_curve(LinearSVC(C=0.1), "LinearSVC(C=0.1)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))


from sklearn.grid_search import GridSearchCV
estm = GridSearchCV(LinearSVC(),
param_grid={"C": [0.001, 0.01, 0.1, 1.0, 10.0]})
plot_learning_curve(estm, "LinearSVC(C=AUTO)",
X, y, ylim=(0.8, 1.0),
train_sizes=np.linspace(.05, 0.2, 5))
print "Chosen parameter on 100 datapoints: %s" % estm.fit(X[:500], y[:500]).best_params_


• l2正则化，它对于最后的特征权重的影响是，尽量打散权重到每个特征维度上，不让权重集中在某些维度上，出现权重特别高的特征。
• 而l1正则化，它对于最后的特征权重的影响是，让特征获得的权重稀疏化，也就是对结果影响不那么大的特征，干脆就拿不着权重。

plot_learning_curve(LinearSVC(C=0.1, penalty='l1', dual=False), "LinearSVC(C=0.1, penalty='l1')", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))


estm = LinearSVC(C=0.1, penalty='l1', dual=False)
estm.fit(X[:450], y[:450])  # 用450个点来训练
print "Coefficients learned: %s" % est.coef_
print "Non-zero coefficients: %s" % np.nonzero(estm.coef_)[1]


Coefficients learned: [[ 0.          0.          0.          0.          0.          0.01857999
0.          0.          0.          0.004135    0.          1.05241369
0.01971419  0.          0.          0.          0.         -0.05665314
0.14106505  0.        ]]
Non-zero coefficients: [5 9 11 12 17 18]


3.2.2 欠拟合定位与解决

#构造一份环形数据
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, random_state=2)
#绘出学习曲线
plot_learning_curve(LinearSVC(C=0.25),"LinearSVC(C=0.25)",X, y, ylim=(0.5, 1.0),train_sizes=np.linspace(.1, 1.0, 5))


f = DataFrame(np.hstack((X, y[:, None])), columns = range(2) + ["class"])
_ = sns.pairplot(df, vars=[0, 1], hue="class", size=3.5)


• 调整你的特征(找更有效的特征！！)
比如说我们观察完现在的数据分布，然后我们先对数据做个映射：
# 加入原始特征的平方项作为新特征
X_extra = np.hstack((X, X[:, [0]]**2 + X[:, [1]]**2))

plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25) + distance feature", X_extra, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5))


• 使用更复杂一点的模型(比如说用非线性的核函数)
我们对模型稍微调整了一下，用了一个复杂一些的非线性rbf kernel：
from sklearn.svm import SVC
# note: we use the original X without the extra feature
plot_learning_curve(SVC(C=2.5, kernel="rbf", gamma=1.0), "SVC(C=2.5, kernel='rbf', gamma=1.0)",X, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5))


3.3 关于大数据样本集和高维特征空间

3.3.1 大数据情形下的模型选择与学习曲线

SGDClassifier每次只使用一部分(mini-batch)做训练，在这种情况下，我们使用交叉验证(cross-validation)并不是很合适，我们会使用相对应的progressive validation：简单解释一下，estimator每次只会拿下一个待训练batch在本次做评估，然后训练完之后，再在这个batch上做一次评估，看看是否有优化。

#生成大样本，高纬度特征数据
X, y = make_classification(200000, n_features=200, n_informative=25, n_redundant=0, n_classes=10, class_sep=2, random_state=0)

#用SGDClassifier做训练，并画出batch在训练前后的得分差
from sklearn.linear_model import SGDClassifier
est = SGDClassifier(penalty="l2", alpha=0.001)
progressive_validation_score = []
train_score = []
for datapoint in range(0, 199000, 1000):
X_batch = X[datapoint:datapoint+1000]
y_batch = y[datapoint:datapoint+1000]
if datapoint > 0:
progressive_validation_score.append(est.score(X_batch, y_batch))
est.partial_fit(X_batch, y_batch, classes=range(10))
if datapoint > 0:
train_score.append(est.score(X_batch, y_batch))

plt.plot(train_score, label="train score")
plt.plot(progressive_validation_score, label="progressive validation score")
plt.xlabel("Mini-batch")
plt.ylabel("Score")
plt.legend(loc='best')
plt.show()


3.3.2 大数据量下的可视化

#直接从sklearn中load数据集
X = digits.data
y = digits.target
n_samples, n_features = X.shape
print "Dataset consist of %d samples with %d features each" % (n_samples, n_features)

# 绘制数字示意图
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
_ = plt.title('A selection from the 8*8=64-dimensional digits dataset')
plt.show()


#import所需的package
from sklearn import (manifold, decomposition, random_projection)
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)

#定义绘图函数
from matplotlib import offsetbox
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)

plt.figure(figsize=(10, 10))
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 12})

if hasattr(offsetbox, 'AnnotationBbox'):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]])  # just something big
for i in range(digits.data.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
# don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)

#记录开始时间
start_time = time.time()
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - start_time))


PCA降维

from sklearn import (manifold, decomposition, random_projection)
#TruncatedSVD 是 PCA的一种实现
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
#记录时间
start_time = time.time()
plot_embedding(X_pca,"Principal Components projection of the digits (time: %.3fs)" % (time.time() - start_time))


from sklearn import (manifold, decomposition, random_projection)
#降维
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
start_time = time.time()
X_tsne = tsne.fit_transform(X)
#绘图
plot_embedding(X_tsne,
"t-SNE embedding of the digits (time: %.3fs)" % (time.time() - start_time))


3.4 损失函数的选择

import numpy as np
import matplotlib.plot as plt
# 改自http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.html
xmin, xmax = -4, 4
xx = np.linspace(xmin, xmax, 100)
plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], 'k-',
label="Zero-one loss")
plt.plot(xx, np.where(xx < 1, 1 - xx, 0), 'g-',
label="Hinge loss")
plt.plot(xx, np.log2(1 + np.exp(-xx)), 'r-',
label="Log loss")
plt.plot(xx, np.exp(-xx), 'c-',
label="Exponential loss")
plt.plot(xx, -np.minimum(xx, 0), 'm-',
label="Perceptron loss")

plt.ylim((0, 8))
plt.legend(loc="upper right")
plt.xlabel(r"Decision function $f(x)$")
plt.ylabel("$L(y, f(x))$")
plt.show()


• **0-1损失函数(zero-one loss)**非常好理解，直接对应分类问题中判断错的个数。但是比较尴尬的是它是一个非凸函数，这意味着其实不是那么实用。
• hinge loss(SVM中使用到的)的健壮性相对较高(对于异常点/噪声不敏感)。但是它没有那么好的概率解释。
• **log损失函数(log-loss)**的结果能非常好地表征概率分布。因此在很多场景，尤其是多分类场景下，如果我们需要知道结果属于每个类别的置信度，那这个损失函数很适合。缺点是它的健壮性没有那么强，相对hinge loss会对噪声敏感一些。