Model Metrics
损失衡量的是模型在监督学习中预测结果的好坏
一些用于分类的指标
Accuracy: correct predictions
sum(y==y_hat)/y.size
Precision: 对某一个具体的类的预测
sum((y_hat==1)&(y==1))/sum(y_hat==1)
Recall: 对某一个具体的类的预测
sum((y_hat==1)&(y==1))/sum(y==1)
F1: the harmonic mean of precision and recall: 2pr/(p + r)
处理二分类问题
ROC曲线:接收者操作特征(receiveroperating characteristic), roc曲线上每个点反映着对同一信号刺激的感受性。
横轴:假正类率 (false postive rate, FPR),特异度,划分实例中所有负例占所有负例的比例;TNR=1-FPR
纵轴:真正类率 ==Recall(true postive rate, TPR),灵敏度,Sensitivity(正类覆盖率)
AUC(Area under Curve):Roc曲线下的面积,介于0.1和1之间,值越大越好
Underfiting & Overfitting
- Training error: model error on the training data
- Generalization error: model error on new data
Model Complexity
The capacity of a set of function to fit data points
- The number of learnable parameters
- The value range for those parameters
Data Complexity
- of examples
- of features in each example
- the separability of the classes
Generalization error
Model Validation
test dataset 从未被模型看到过,只能使用一次
Validation dataset 通常是数据集的子集,不用于模型训练,可多次用于超参数调整
-
不符合随机分布的数据集
-
K-fold cross validation
将原始数据集划分为相等的K部分,将一部分作为测试集,其余作为训练集,计算模型在测试集上的准确率,每次用不同的部分作为测试集,将平均准确率作为最终的模型准确率
from sklearn.model_selection import KFold
#划分为几块,是否随机打乱了,是否固定随机起点
KFold(n_splits=2,shuffle=False,random_state=None)
KFold(n_splits=2, random_state=None, shuffle=False)
- 常见错误
Bias&Variance 偏差和方差
偏差 学习模型和真实模型之间的区别位移
方差 学习模型内部结果差别
交叉项均为0
Bagging (Bootstrap AGGregatING)
并行式 随机有放回采样
对分类任务使用多数投票法,对回归任务使用简单平均回归法
把多个不稳定的模型组合成一个相对稳定方差较低的模型
此处利用的是方差不为0作的一个不等式和柯西不等式
即应该使用不那么稳定的learner来做base learner
主要效果降低模型方差
class Bagging:
def __init__(self, base_learner, n_learners):
self.learners = [clone(base_learner) for _ in range(n_learners)]
def fit(self, X, y):
for learner in self.learners:
#随机采样
examples = np.random.choice(
np.arange(len(X)), int(len(X)), replace=True)
learner.fit(X.iloc[examples, :], y.iloc[examples])
def predict(self, X):
preds = [learner.predict(X) for learner in self.learners]
#回归模型采用均值
return np.array(preds).mean(axis=0)
#随机森林 使用bagging算法
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
'''
n_estimators: the number of trees
boostrap: True is to boostrap samples
warm_start: False is to fit completely new forest
class_weight:'balanced','balanced_subsample'
max_samples: the number of samples to train base estimator
'''
model_RF = RandomForestClassifier(n_estimators=100, criterion='gini',max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,
random_state=None, verbose=0, warm_start=False,class_weight=None, ccp_alpha=0.0, max_samples=None)
model_RF = RandomForestRegressor(criterion='mse') # criterion:'mse','mae'
Boosting
把多个比较弱偏差比较大的模型组合成一个较强偏差较小的模型
按顺序地学习模型
主要效果降低模型偏差
- Ada boosting
每个子模型模型都在尝试增强(boost)整体的效果,通过不断的模型迭代,更新样本点的权重
如果某个样本点已经被准确地分类,那么在构造下一个训练集中,它的权值就被降低
如果某个样本点没有被准确地分类,那么它的权值就得到提高
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=500)
ada_clf.fit(X_train, y_train)
ada_clf.score(X_test, y_test)
- Gradient Boosting
梯度下降的决策树
class GradientBoosting:
def __init__(self, base_learner, n_learners, learning_rate):
self.learners = [clone(base_learner) for _ in range(n_learners)]
self.lr = learning_rate
def fit(self, X, y):
#残差
residual = y.copy()
for learner in self.learners:
#用原始特征和标号的残差拟合
learner.fit(X, residual)
#残差减去在训练集上的预测乘以学习率
residual -= self.lr * learner.predict(X)
def predict(self,X):
preds = [learner.predict(X) for learner in self.learners]
#求和
return np.array(preds).sum(axis=0) * self.lr
# gbdt
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
#loss: 'deviance' 表示将回归树结果通过逻辑回归进行分类,'exponential'相当与adaboost
#subsample: <1 总样本比例,抽样样本训练降低方差提高偏差
#criterion:'friedman_mse','mse','mae' split 质量
#n_iter_no_change:early stop,验证集分数没有变化的迭代次数
#分类类
model_gb = GradientBoostingClassifier( loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
min_impurity_decrease=0.0, init=None,
random_state=None, max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False, validation_fraction=0.1,
n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
# 回归类
model_gb = GradientBoostRegressor(ls='ls', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
min_impurity_decrease=0.0, min_impurity_split=None, init=None,
random_state=None, max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False, validation_fraction=0.1,
n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
Stacking
多层stacking很容易过拟合
因此不同级别用不同数据训练,每一级别重复使用K折bagging得到数据集用于下一级训练,代价昂贵
学习课程:李沐的斯坦福21秋季:实用机器学习(B站上有中文版)
英文版课程主页:https://c.d2l.ai/stanford-cs329p/