3打卡

最新推荐文章于 2024-06-05 15:07:06 发布

laozhujie

最新推荐文章于 2024-06-05 15:07:06 发布

阅读量161

点赞数

本文链接：https://blog.csdn.net/laozhujie/article/details/105250634

版权

ask04_建模调参(ing)
https://tianchi.aliyun.com/notebook-ai/detail?postId=95460
4.1 目标
掌握常用机器模型的建模与调参流程
4.2 内容准备
4.2.1 线性回归模型：
https://zhuanlan.zhihu.com/p/49480391
线性回归对于特征的要求；
处理长尾分布；
理解线性回归模型；
4.2.2 模型性能验证：
评价函数与目标函数；
交叉验证方法；
留一验证方法；
针对时间序列问题的验证；
绘制学习率曲线；
绘制验证曲线；
4.2.3 嵌入式特征选择：
Lasso回归；
Ridge回归；
决策树；
https://zhuanlan.zhihu.com/p/65304798
GBDT模型；
https://zhuanlan.zhihu.com/p/45145899
XGBoost模型；
https://zhuanlan.zhihu.com/p/86816771
LightGBM模型；
https://zhuanlan.zhihu.com/p/89360721

4.2.4 模型对比：
常用线性模型；
常用非线性模型；
4.2.5 模型调参：
贪心调参方法；
https://www.jianshu.com/p/ab89df9759c8
对所采用的贪心策略一定要仔细分析其是否满足无后效性（即某个状态以后的过程不会影响以前的状态，只与当前状态有关。）
网格调参方法；
https://blog.csdn.net/weixin_43172660/article/details/83032029
from sklearn.model_selection import GridSearchCV
Copy

通过循环遍历，尝试每一种参数组合，返回最好的得分值的参数组合
贝叶斯调参方法；
https://blog.csdn.net/linxid/article/details/81189154
conda install -c conda-forge bayesian-optimization
Copy

贝叶斯优化通过基于目标函数的过去评估结果建立替代函数（概率模型），来找到最小化目标函数的值。
贝叶斯方法与随机或网格搜索的不同之处在于，它在尝试下一组超参数时，会参考之前的评估结果，因此可以省去很多无用功。
超参数的评估代价很大，因为它要求使用待评估的超参数训练一遍模型，而许多深度学习模型动则几个小时几天才能完成训练，并评估模型，因此耗费巨大。
贝叶斯调参发使用不断更新的概率模型，通过推断过去的结果来“集中”有希望的超参数。
4.3 延申教材
《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《Python大战机器学习》 https://book.douban.com/subject/26987890/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《数据科学家访谈录》 https://book.douban.com/subject/30129410/
4.3 实验
4.3.1 数据预处理
选用合适的数据类型以减小内存占用。
def reduce_mem_usage(df):
“”“修改数据类型以减少内存占用。
“””
start_mem = df.memory_usage().sum()
print(‘Memory usage of dataframe is: {:.2f} KB’.format(start_mem/1024))
for col in df.columns:
col_type = df[col].dtype

    if col_type != object:        
        c_min = df[col].min()
        c_max = df[col].max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else: 
                df[col] = df[col].astype(np.float64) 
    else:
        df[col] = df[col].astype('category')
        
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} KB'.format(end_mem/1024))
print('Memory usage reduced: {:.2f} %'.format(100 * (start_mem - end_mem) / start_mem))
return df

Copy

4.3.2 线性回归实例
4.3.2.1 构造训练数据和测试数据
sample_feature = sample_feature.dropna().replace(’-’, 0).reset_index(drop=True)
sample_feature[‘notRepairedDamage’] = sample_feature[‘notRepairedDamage’].astype(np.float32)

train = sample_feature[continuous_feature_names + [‘price’]]
train_X = sample_feature[continuous_feature_names]
train_y = sample_feature[‘price’]
Copy

4.3.2.2 线性回归
from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)
Copy

训练结果
Image
据说这能看出来是长尾的？我只看出来了 T 和 P 差距较大
对长尾进行处理，做一个 \log(1+x)log(1+x) 变换，然后取得新结果
train_y_ln = np.log(train_y + 1)
Copy
Image
4.3.2.3 5折交叉验证
数据集常被分为 3 个部分，训练集、评估集、测试集。在训练集中分出来一部分数据作为阶段性测试的思想，即交叉验证（cross validation）。
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer
Copy

使用线性回归模型，对未处理标签的特征数据进行五折交叉验证（Error 1.36）
scores = cross_val_score(model,
X=train_X, y=train_y,
verbose=1, cv=5,
scoring=make_scorer(log_transfer(mean_absolute_error)))
print(np.mean(scores))
Copy

使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（Error 0.19）
scores = cross_val_score(model,
X=train_X, y=train_y_ln,
verbose=1, cv=5,
scoring=make_scorer(mean_absolute_error))
print(np.mean(scores))
Copy

先验知识：不能随意 k 折，比如这里，就要按时间先后顺序，用前边的预测后边的
import datetime

sample_feature = sample_feature.reset_index(drop=True)
split_point = 4 * len(sample_feature) // 5
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train[‘price’] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val[‘price’] + 1)
Copy

MAE=0.19577094341870593
4.3.2.4 绘制学习率曲线和验证曲线
from sklearn.model_selection import learning_curve, validation_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(‘Training example’)
plt.ylabel(‘score’)
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()#区域
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color=“r”)
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color=“g”)
plt.plot(train_sizes, train_scores_mean, ‘o-’, color=‘r’,
label=“Training score”)
plt.plot(train_sizes, test_scores_mean,‘o-’,color=“g”,
label=“Cross-validation score”)
plt.legend(loc=“best”)
return plt

plot_learning_curve(LinearRegression(), ‘Liner_model’,
train_X[:1000], train_y_ln[:1000],
ylim=(0.0, 0.5), cv=5, n_jobs=1)
Copy

Image
4.3.3 拟合其他模型
train = sample_feature[continuous_feature_names + [‘price’]].dropna()

train_X = train[continuous_feature_names]
train_y = train[‘price’]
train_y_ln = np.log(train_y + 1)
Copy

4.3.3.1 线性模型
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(), Ridge(), Lasso()]

res = {}
for model in models:

print(str(model))

model_name = str(model).split('(')[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln,\
                         verbose=0,\
                         scoring=make_scorer(mean_absolute_error))
res[model_name] = scores
print(model_name, 'is finished.')

Copy

L1正则，Lasso回归，使参数稀疏，例如此图中的 power 和 used_time 就显得很重要
Image
L2正则，岭回归(Ridge)，使参数小
Image
4.3.3.2 非线性模型
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [
LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
MLPRegressor(solver=‘lbfgs’, max_iter=100),
XGBRegressor(n_estimators=100, objective=‘reg:squarederror’),
LGBMRegressor(n_estimators=100)
]

import time
result = dict()
for model in models:
st = time.time()
model_name = str(model).split(’(’)[0]
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
result[model_name] = scores
print(model_name, ’ is finished after ', time.time() - st, ‘s.’)
Copy
Image

5 折结果

	LR	DT	RF	GB	MLP	XGB	LGBM
cv1	0.1908	0.2002	0.1329	0.1689	760.564	0.1700	0.1415
cv2	0.1938	0.1921	0.1346	0.1718	599.426	0.1718	0.1455
cv3	0.1941	0.1903	0.1336	0.1709	2673.33	0.1721	0.1439
cv4	0.1918	0.1908	0.1324	0.1691	267.398	0.1696	0.1425
cv5	0.1958	0.2059	0.1374	0.1741	293.916	0.1728	0.1449
t(s)	0.6520	14.1303	864.9356	246.4334	136.5915	47.8760	12.2726

Copy

可以看出，RF 效果最好，但非常慢
LR 很快，下过“相比之下”不行，但是要注意，这里的原模型应该本来就能被线性拟合，所以这里的结果还不错
LGBM 综合情况最好，又快又好

4.3.4 模型调参
本例是对 LGBM 的参数调参，有三种常用方法
objective = [‘regression’, ‘regression_l1’, ‘mape’, ‘huber’, ‘fair’]

num_leaves = [3, 5, 10, 15, 20, 40, 55]
max_depth = [3, 5, 10, 15, 20, 40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
Copy

4.3.4.1 贪心
best_obj = {}
for obj in objective:
model = LGBMRegressor(objective=obj)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_obj[obj] = score

best_leaves = {}
for leaves in num_leaves:
model = LGBMRegressor(objective=min(best_obj.items(),
key=lambda x:x[1])[0], num_leaves=leaves)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_leaves[leaves] = score

best_depth = {}
for depth in max_depth:
model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
max_depth=depth)
score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln,
verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))
best_depth[depth] = score

Copy

4.3.4.2 网格搜索
from sklearn.model_selection import GridSearchCV

%%time
params = {‘objective’:objective, ‘num_leaves’:num_leaves, ‘max_depth’:max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, params, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_

model = LGBMRegressor(objective=clf.best_params_[‘objective’], num_leaves=clf.best_params_[‘num_leaves’] , max_depth= clf.best_params_[‘max_depth’])

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv=5,
scoring=make_scorer(mean_absolute_error)))
Copy

非常耗时
CPU times: user 39min 58s, sys: 1min 1s, total: 40min 59s Wall time: 41min 5s
Copy

4.3.4.3 贝叶斯优化
from bayes_opt import BayesianOptimization

def LGBM_cv(num_leaves, max_depth, subsample, min_child_samples):
val = cross_val_score(
LGBMRegressor(objective=‘regression_l1’,
num_leaves=int(num_leaves),
max_depth=int(max_depth),
subsample=subsample,
min_child_samples=int(min_child_samples)),
X=train_X, y=train_y_ln, verbose=0, cv=5,
scoring=make_scorer(mean_absolute_error)
).mean()
return 1 - val

LGBM_bo = BayesianOptimization(
LGBM_cv,
{
‘num_leaves’:(2, 100),
‘max_depth’:(2,100),
‘subsample’:(0.1, 1),
‘min_child_samples’:(2, 100)
}
)