2021-04-22

最新推荐文章于 2024-07-20 09:21:06 发布

DK.Leng

最新推荐文章于 2024-07-20 09:21:06 发布

阅读量54

点赞数 1

文章标签：算法 python

本文链接：https://blog.csdn.net/xuheng_____1/article/details/116005530

版权

Datawhale 数据挖掘入门：TASK4–建模调参

4.2内容介绍

线性回归模型：
线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；
2.模型性能验证：
评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；
3.嵌入式特征选择：
Lasso回归；
Ridge回归；
决策树；
4.模型对比：
常用线性模型；
常用非线性模型；
5.模型调参：
贪心调参方法；
网格调参方法；
贝叶斯调参方法；

4.3 代码示例

4.3.1 读取数据

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间


def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df


sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))



continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

4.3.2  线性回归 & 五折交叉验证 & 模拟真实业务情况

```python
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

DK.Leng

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-04-22

Datawhale 数据挖掘入门：TASK4–建模调参4.2内容介绍线性回归模型：线性回归对于特征的要求；处理长尾分布；理解线性回归模型；2.模型性能验证：评价函数与目标函数；交叉验证方法；留一验证方法；针对时间序列问题的验证；绘制学习率曲线；绘制验证曲线；3.嵌入式特征选择：Lasso回归；Ridge回归；决策树；4.模型对比：常用线性模型；常用非线性模型；5.模型调参：贪心调参方法；网格调参方法；贝叶斯调参方法；4.3 代码示例4.3.1 读取数据
复制链接

扫一扫