天池数据挖掘 -- 建模调参

最新推荐文章于 2022-11-01 16:01:55 发布

浮汐

最新推荐文章于 2022-11-01 16:01:55 发布

阅读量490

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/xfxlesson/article/details/105250366

版权

本文详细介绍了数据挖掘中的建模和调参过程，包括线性回归模型、模型性能验证、嵌入式特征选择、模型对比及调参方法。通过实例展示了如何处理长尾分布、应用交叉验证、进行特征选择，并对比了不同模型和调参策略的效果，以提升预测精度。

摘要由CSDN通过智能技术生成

内容介绍

1.线性回归模型：

线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；

2.模型性能验证：

评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；

3.嵌入式特征选择：

Lasso回归；
Ridge回归；
决策树；

4.模型对比：

常用线性模型；
常用非线性模型；

5.模型调参：

贪心调参方法；
网格调参方法；
贝叶斯调参方法；

建模

减少数据占用的内存

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
    
#导出储存数据
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

# 筛选数据特征
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','name', 'bodyType', 'fuelType', 'notRepairedDamage']]

2.线性回归 & 五折交叉验证 & 模拟真实业务情况

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature = sample_feature.replace('MISSING', 0)
print(sample_feature.head())
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
tr

最低0.47元/天解锁文章

浮汐

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
天池数据挖掘 -- 建模调参

内容介绍1.线性回归模型：线性回归对于特征的要求；处理长尾分布；理解线性回归模型；2.模型性能验证：评价函数与目标函数；交叉验证方法；留一验证方法；针对时间序列问题的验证；绘制学习率曲线；绘制验证曲线；3.嵌入式特征选择：Lasso回归；Ridge回归；决策树；4.模型对比：常用线性模型；常用非线性模型；5.模型调参：贪心调参方法；网格调...
复制链接

扫一扫

专栏目录