建模调参

最新推荐文章于 2022-09-11 17:47:03 发布

yifanBond

最新推荐文章于 2022-09-11 17:47:03 发布

阅读量821

点赞数

分类专栏：算法文章标签： python 机器学习

本文链接：https://blog.csdn.net/GG___Bond/article/details/109273245

版权

算法专栏收录该内容

2 篇文章 0 订阅

订阅专栏

建模调参

前期准备
1 建立模型及相关属性和方法
2 评价模型
3 调参
- 3.1 贪心算法
- 3.2 贝叶斯调参

前期准备

减少数据在内存占用空间

reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

知识点

N折交叉验证

整个训练集可分为三部分：训练集（train）、评估集（valid）、测试集（test），N折交叉验证将数据集分为N组，选取一组作为验证集，剩余N-1组作为训练集，得到N个验证集分类准确率score，取平均后作为此模型在此N折交叉验证下的性能指标。

1 建立模型及相关属性和方法

步骤
（1）模型api实例化
（2）分类器：通过fit方法，使用训练集训练模型，得到分类器
（3）分类器的属性：可以输出截距、回归系数等
（4）可使用该分类器预测test样本，进行结果可视化观测

详见： Liner-Regression 线性回归.

2 评价模型

2.1常用评价函数

平均绝对误差MAE
Mean absolute error

mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput=‘uniform_average’)
返回：loss：non-negative float or ndarray of non-negative floats

multioutput：
（1）选择’uniform_average’代表输出所有误差的等权均值，non-negative float型
（2）选择‘raw_values’代表输出误差原始值，返回一个ndarray
（3）选择array-like of shape (n_outputs)代表定义了各输出值的权重，进行加权值输出，non-negative float型

2.2 选择模型评估工具

例如： grid_search.GridSearchCV 和 cross_validation.cross_val_score

2.3 设计模型评价函数

设计scoring参数：需要使用make_scorer()生成一个scorer对象

API

sklearn.metrics.make_scorer()
两种用法：
（1）直接将metrics函数转换成可调用的scorer对象
（2）自定义函数，参数包括score_func(y, y_pred, **kwargs)，返回值为float

make_scorer(score_func, *, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)
返回：scorer：a callable object

主要参数

Parameter	数据类型	含义
score_func	score function or loss function	以_score结尾的函数，值越大越好；以_error结尾的函数，越小越好；也可以自己构建函数输入
greater is better	bool	是否值越大代表越好

2.4 N折交叉验证

API

sklearn.model_selection.cross_val_score()

cross_val_score(estimator, X, y=None, , groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2n_jobs’, error_score=nan)
返回：array,shape=(len(list(cv)),)

Parameters

参数	数据类型	含义
estimator	estimator	选择的模型实例
X	array-like of shape (n_samples, n_features)
y	array-like of shape (n_samples,) or (n_samples, n_outputs)

estimator：模型的实例
X：特征，list或array
y：标签
scoring：用以评价的函数
cv：几折交叉验证，默认五折
n_jobs：同时工作的cpu个数（-1代表全部）

2.5 可视化

学习曲线

自变量：训练集的大小
因变量：交叉验证的训练集分数和验证集分数
variance：estimator随着训练集变化的改变程度
学习曲线反映bias和variance的平衡关系，当训练集很小时，会出现训练集低bias高variance、验证集高bias低variance的情况；随着训练集增大，训练集bias逐渐增大variance逐渐降低，验证集bias逐渐降低variance逐渐增大。学习曲线可以找到bias和variance的平衡点。
sklearn.model_selection.learning_curve

sklearn.model_selection.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1. ]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=None, pre_dispatch=‘all’, verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False)
返回：
train_size_abs：ndarray
train_scores:ndarray shape like cv_size
test_scires:ndarray shape like cv_size

3 调参

在这里插入图片描述

3.1 贪心算法

可能会调到局部最优，而不是全局最优

3.2 贝叶斯调参

API
安装包：bayesian-optimization
调用：bayes_opt.BayesianOptimization()

yifanBond

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
建模调参

建模调参前期准备减少数据在内存占用空间1 N折交叉验证前期准备减少数据在内存占用空间reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间def reduce_mem_usage(df): """ iterate through all the columns of a dataframe and modify the data type to reduce memory usage. """ start_m
复制链接

扫一扫