3打卡

ask04_建模调参(ing)
https://tianchi.aliyun.com/notebook-ai/detail?postId=95460
4.1 目标
掌握常用机器模型的建模与调参流程
4.2 内容准备
4.2.1 线性回归模型:
https://zhuanlan.zhihu.com/p/49480391
线性回归对于特征的要求;
处理长尾分布;
理解线性回归模型;
4.2.2 模型性能验证:
评价函数与目标函数;
交叉验证方法;
留一验证方法;
针对时间序列问题的验证;
绘制学习率曲线;
绘制验证曲线;
4.2.3 嵌入式特征选择:
Lasso回归;
Ridge回归;
决策树;
https://zhuanlan.zhihu.com/p/65304798
GBDT模型;
https://zhuanlan.zhihu.com/p/45145899
XGBoost模型;
https://zhuanlan.zhihu.com/p/86816771
LightGBM模型;
https://zhuanlan.zhihu.com/p/89360721

4.2.4 模型对比:
常用线性模型;
常用非线性模型;
4.2.5 模型调参:
贪心调参方法;
https://www.jianshu.com/p/ab89df9759c8
对所采用的贪心策略一定要仔细分析其是否满足无后效性(即某个状态以后的过程不会影响以前的状态,只与当前状态有关。)
网格调参方法;
https://blog.csdn.net/weixin_43172660/article/details/83032029
from sklearn.model_selection import GridSearchCV
Copy

通过循环遍历,尝试每一种参数组合,返回最好的得分值的参数组合
贝叶斯调参方法;
https://blog.csdn.net/linxid/article/details/81189154
conda install -c conda-forge bayesian-optimization
Copy

贝叶斯优化通过基于目标函数的过去评估结果建立替代函数(概率模型),来找到最小化目标函数的值。
贝叶斯方法与随机或网格搜索的不同之处在于,它在尝试下一组超参数时,会参考之前的评估结果,因此可以省去很多无用功。
超参数的评估代价很大,因为它要求使用待评估的超参数训练一遍模型,而许多深度学习模型动则几个小时几天才能完成训练,并评估模型,因此耗费巨大。
贝叶斯调参发使用不断更新的概率模型,通过推断过去的结果来“集中”有希望的超参数。
4.3 延申教材
《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《Python大战机器学习》 https://book.douban.com/subject/26987890/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《数据科学家访谈录》 https://book.douban.com/subject/30129410/
4.3 实验
4.3.1 数据预处理
选用合适的数据类型以减小内存占用。
def reduce_mem_usage(df):
“”“修改数据类型以减少内存占用。
“””
start_mem = df.memory_usage().sum()
print(‘Memory usage of dataframe is: {:.2f} KB’.format(start_mem/1024))
for col in df.columns:
col_type = df[col].dtype

    if col_type != object:        
        c_min = df[col].min()
        c_max = df[col].max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else: 
                df[col] = df[col].astype(np.float64) 
    else:
        df[col] = df[col].astype('category')
        
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} KB'.format(end_mem/1024))
print('Memory usage reduced: {:.2f} %'.format(100 * (start_mem - end_mem) / start_mem))
return df

Copy

4.3.2 线性回归实例
4.3.2.1 构造训练数据和测试数据
sample_feature = sample_feature.dropna().replace(’-’, 0).reset_index(drop=True)
sample_feature[‘notRepairedDamage’] = sample_feature[‘notRepairedDamage’].astype(np.float32)

train = sample_feature[continuous_feature_names + [‘price’]]
train_X = sample_feature[continuous_feature_names]
train_y = sample_feature[‘price’]
Copy

4.3.2.2 线性回归
from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)
Copy

训练结果
Image
据说这能看出来是长尾的?我只看出来了 T 和 P 差距较大
对长尾进行处理,做一个 \log(1+x)log(1+x) 变换,然后取得新结果
train_y_ln = np.log(train_y + 1)
Copy
Image
4.3.2.3 5折交叉验证
数据集常被分为 3 个部分,训练集、评估集、测试集。在训练集中分出来一部分数据作为阶段性测试的思想,即 交叉验证(cross validation)。
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer
Copy

使用线性回归模型,对未处理标签的特征数据进行五折交叉验证(Error 1.36)
scores = cross_val_score(model,
X=train_X, y=train_y,
verbose=1, cv=5,
scoring=make_scorer(log_transfer(mean_absolute_error)))
print(np.mean(scores))
Copy

使用线性回归模型,对处理过标签的特征数据进行五折交叉验证(Error 0.19)
scores = cross_val_score(model,
X=train_X, y=train_y_ln,
verbose=1, cv=5,
scoring=make_scorer(mean_absolute_error))
print(np.mean(scores))
Copy

先验知识:不能随意 k 折,比如这里,就要按时间先后顺序,用前边的预测后边的
import datetime

sample_feature = sample_feature.reset_index(drop=True)
split_point = 4 * len(sample_feature) // 5
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train[‘price’] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val[‘price’] + 1)
Copy

MAE=0.19577094341870593
4.3.2.4 绘制学习率曲线和验证曲线
from sklearn.model_selection import learning_curve, validation_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(‘Training example’)
plt.ylabel(‘score’)
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()#区域
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color=“r”)
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color=“g”)
plt.plot(train_sizes, train_scores_mean, ‘o-’, color=‘r’,
label=“Training score”)
plt.plot(train_sizes, test_scores_mean,‘o-’,color=“g”,
label=“Cross-validation score”)
plt.legend(loc=“best”)
return plt

plot_learning_curve(LinearRegression(), ‘Liner_model’,
train_X[:1000], train_y_ln[:1000],
ylim=(0.0, 0.5), cv=5, n_jobs=1)
Copy

Image
4.3.3 拟合其他模型
train = sample_feature[continuous_feature_names + [‘price’]].dropna()

train_X = train[continuous_feature_names]
train_y = train[‘price’]
train_y_ln = np.log(train_y + 1)
Copy

4.3.3.1 线性模型
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(), Ridge(), Lasso()]

res = {}
for model in models:

print(str(model))

model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln,\
                         verbose=0,\
                         scoring=make_scorer(mean_absolute_error))
res[model_name] = scores
print(model_name, 'is finished.')

Copy

L1正则,Lasso回归,使参数稀疏,例如此图中的 power 和 used_time 就显得很重要
Image
L2正则,岭回归(Ridge),使参数小
Image
4.3.3.2 非线性模型
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [
LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
MLPRegressor(solver=‘lbfgs’, max_iter=100),
XGBRegressor(n_estimators=100, objective=‘reg:squarederror’),
LGBMRegressor(n_estimators=100)
]

import time
result = dict()
for model in models:
st = time.time()
model_name = str(model).split(’(’)[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name, ’ is finished after ', time.time() - st, ‘s.’)
Copy
Image

5 折结果

LRDTRFGBMLPXGBLGBM
cv10.19080.20020.13290.1689760.5640.17000.1415
cv20.19380.19210.13460.1718599.4260.17180.1455
cv30.19410.19030.13360.17092673.330.17210.1439
cv40.19180.19080.13240.1691267.3980.16960.1425
cv50.19580.20590.13740.1741293.9160.17280.1449
t(s)0.652014.1303864.9356246.4334136.591547.876012.2726

Copy

可以看出,RF 效果最好,但非常慢
LR 很快,下过“相比之下”不行,但是要注意,这里的原模型应该本来就能被线性拟合,所以这里的结果还不错
LGBM 综合情况最好,又快又好

4.3.4 模型调参
本例是对 LGBM 的参数调参,有三种常用方法
objective = [‘regression’, ‘regression_l1’, ‘mape’, ‘huber’, ‘fair’]

num_leaves = [3, 5, 10, 15, 20, 40, 55]
max_depth = [3, 5, 10, 15, 20, 40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
Copy

4.3.4.1 贪心
best_obj = {}
for obj in objective:
model = LGBMRegressor(objective=obj)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_obj[obj] = score

best_leaves = {}
for leaves in num_leaves:
model = LGBMRegressor(objective=min(best_obj.items(),
key=lambda x:x[1])[0], num_leaves=leaves)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_leaves[leaves] = score

best_depth = {}
for depth in max_depth:
model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
max_depth=depth)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_depth[depth] = score

Copy

4.3.4.2 网格搜索
from sklearn.model_selection import GridSearchCV

%%time
params = {‘objective’:objective, ‘num_leaves’:num_leaves, ‘max_depth’:max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, params, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_

model = LGBMRegressor(objective=clf.best_params_[‘objective’], num_leaves=clf.best_params_[‘num_leaves’] , max_depth= clf.best_params_[‘max_depth’])

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv=5,
scoring=make_scorer(mean_absolute_error)))
Copy

非常耗时
CPU times: user 39min 58s, sys: 1min 1s, total: 40min 59s Wall time: 41min 5s
Copy

4.3.4.3 贝叶斯优化
from bayes_opt import BayesianOptimization

def LGBM_cv(num_leaves, max_depth, subsample, min_child_samples):
val = cross_val_score(
LGBMRegressor(objective=‘regression_l1’,
num_leaves=int(num_leaves),
max_depth=int(max_depth),
subsample=subsample,
min_child_samples=int(min_child_samples)),
X=train_X, y=train_y_ln, verbose=0, cv=5,
scoring=make_scorer(mean_absolute_error)
).mean()
return 1 - val

LGBM_bo = BayesianOptimization(
LGBM_cv,
{
‘num_leaves’:(2, 100),
‘max_depth’:(2,100),
‘subsample’:(0.1, 1),
‘min_child_samples’:(2, 100)
}
)

training

LGBM_bo.maximize()

resuslt

1 - LGBM_bo.max[‘target’]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值