天池二手车预测：建模调参

最新推荐文章于 2021-09-29 08:23:47 发布

weixin_43520514

最新推荐文章于 2021-09-29 08:23:47 发布

阅读量242

点赞数 1

分类专栏：数据挖掘文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_43520514/article/details/105248998

版权

数据挖掘专栏收录该内容

3 篇文章 0 订阅

订阅专栏

0.基础知识学习
（1）线性回归模型：线性拟合，梯度下降调参，正态分布
（2）决策树模型：
西瓜书第四章决策树学习
（3）梯度提升树GBDT学习
CART树：二分树，通过寻找最优特征及其最佳切分点划分输入空间 + 剪枝操作
GBDT模型是集成模型，是很多CART树的线性相加
（4）XGboost模型
（5）LightGBM模型

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# 调整数据类型，减少数据在内存中占用的空间
def reduce_mem_usage(df):
    """
    遍历dataFrame中每一列数据并进行数据类型优化以减少内存使用
    """
    start_mem = df.memory_usage().sum()
    print('Memory usage of dataFrame is {:.2f}MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.flaot64)
                    
        else:
            df[col] = df[col].astype('category')
        
    end_mem = df.memory_usage().sum()
    print('Memory usage after optimization is：{:.2f}MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

Memory usage of dataFrame is 60507328.00MB
Memory usage after optimization is：15724107.00MB
Decreased by 74.0%

sample_feature.columns

Index(['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power',
       'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
       'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city',
       'brand_amout', 'brand_price_average', 'brand_price_max',
       'brand_price_median', 'brand_price_min', 'brand_price_std',
       'brand_price_sum', 'power_bin'],
      dtype='object')

sample_feature.head(10)

	name	model	brand	bodyType	fuelType	gearbox	power	kilometer	notRepairedDamage	...	used_time	city	brand_amout	brand_price_average	brand_price_max	brand_price_median	brand_price_min	brand_price_std	brand_price_sum	power_bin
0	736	30.0	6	1.0	0.0	0.0	60	12.5	0.0	...	4384.0	1.0	10192.0	3576.0	35990.0	1800.0	13.0	4564.0	36457520.0	5.0
1	2262	40.0	1	2.0	0.0	0.0	0	15.0	-	...	4756.0	4.0	13656.0	9080.0	84000.0	6400.0	15.0	8992.0	124044600.0	NaN
2	14874	115.0	15	1.0	0.0	0.0	163	12.5	0.0	...	4384.0	2.0	1458.0	9848.0	45000.0	8496.0	100.0	5424.0	14373814.0	16.0
3	71865	109.0	10	0.0	0.0	1.0	193	15.0	0.0	...	7124.0	NaN	13992.0	8076.0	92900.0	5200.0	15.0	8248.0	113034208.0	19.0
4	111080	110.0	5	1.0	0.0	0.0	68	5.0	0.0	...	1531.0	6.0	4664.0	3306.0	31500.0	2300.0	20.0	3344.0	15414322.0	6.0
5	137642	24.0	10	0.0	1.0	0.0	109	10.0	0.0	...	2482.0	3.0	13992.0	8076.0	92900.0	5200.0	15.0	8248.0	113034208.0	10.0
6	2402	13.0	4	0.0	0.0	1.0	150	15.0	0.0	...	6184.0	3.0	16576.0	8344.0	99999.0	6000.0	12.0	8088.0	138279072.0	14.0
7	165346	26.0	14	1.0	0.0	0.0	101	15.0	0.0	...	6108.0	4.0	16072.0	3054.0	38990.0	1700.0	12.0	3606.0	49076652.0	10.0
8	2974	19.0	1	2.0	1.0	1.0	179	15.0	0.0	...	4800.0	4.0	13656.0	9080.0	84000.0	6400.0	15.0	8992.0	124044600.0	17.0
9	82021	7.0	7	5.0	0.0	0.0	88	15.0	0.0	...	6664.0	NaN	2360.0	4196.0	38900.0	2600.0	60.0	4752.0	9905909.0	8.0

10 rows × 38 columns

# 提取连续特征
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price', 'brand', 'model', 'brand']]

# 处理数据中的缺失值（删除），将‘-’替换为0并将‘notRepairedDamage’数据类型转换为float32
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

1.1 简单建模

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model = model.fit(train_X, train_y)

# 查看训练的线性回归模型的截距与权重
'intercept:' + str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

[('v_6', 3367077.4406224387),
 ('v_8', 700656.3543472802),
 ('v_9', 170626.24091385124),
 ('v_7', 32318.173103516365),
 ('v_12', 20480.56267793342),
 ('v_3', 17871.47564683771),
 ('v_11', 11482.123138800021),
 ('v_13', 11263.399851895827),
 ('v_10', 2681.353980785434),
 ('gearbox', 881.832820328575),
 ('fuelType', 363.90247374080616),
 ('bodyType', 189.58552716184977),
 ('city', 44.953622398283215),
 ('power', 28.557627369989373),
 ('brand_price_median', 0.5103099160112271),
 ('brand_price_std', 0.4503275546866113),
 ('brand_amout', 0.14881138893606274),
 ('brand_price_max', 0.0031902053613149382),
 ('seller', 2.9080547392368317e-07),
 ('train', 6.891787052154541e-08),
 ('offerType', -4.839152097702026e-06),
 ('brand_price_sum', -2.175000814125744e-05),
 ('name', -0.0002981582332371014),
 ('used_time', -0.0025261487756046805),
 ('brand_price_average', -0.4048195975444779),
 ('brand_price_min', -2.2467183600593943),
 ('power_bin', -34.45676039711011),
 ('v_14', -274.91399236959336),
 ('kilometer', -372.89762118323057),
 ('notRepairedDamage', -495.2282384086448),
 ('v_0', -2044.689562386807),
 ('v_5', -11046.342844467923),
 ('v_4', -15123.010532417948),
 ('v_2', -26106.90644371924),
 ('v_1', -45560.92511426905)]

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

# 绘制特征v9的值与标签的散点图
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price', 'Predicted Price'], loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

通过上图可以发现模型的预测结果（蓝色点）与真实结果（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，
说明模型存在一定的问题

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

It is clear to see the price shows a typical exponential distribution

# 对标签进行log（x + 1）变换，使其贴近于正态分布
train_y_ln = np.log(train_y + 1)
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

The transformed price seems like normal distribution

model = model.fit(train_X, train_y_ln)

print('intercept:' + str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:18.74880341534744

[('v_9', 8.050812690369153),
 ('v_5', 5.754994162620334),
 ('v_12', 1.62093530763775),
 ('v_1', 1.4779570497262025),
 ('v_11', 1.169744490862118),
 ('v_13', 0.9411182304172357),
 ('v_7', 0.7119510355048398),
 ('v_3', 0.6851314488405563),
 ('v_0', 0.008645108332593266),
 ('power_bin', 0.008483677741707147),
 ('gearbox', 0.007926459579133759),
 ('fuelType', 0.006684066538356764),
 ('bodyType', 0.004516720629709567),
 ('power', 0.0007176637371205766),
 ('brand_price_min', 3.3366062098929906e-05),
 ('brand_amout', 2.8979529047958453e-06),
 ('brand_price_median', 1.2322281792545508e-06),
 ('brand_price_std', 6.517052329781775e-07),
 ('brand_price_average', 6.336678789883344e-07),
 ('brand_price_max', 6.191737965170084e-07),
 ('seller', 2.319779923709575e-10),
 ('offerType', 1.141984284913633e-10),
 ('train', -1.5916157281026244e-12),
 ('brand_price_sum', -1.512410856518073e-10),
 ('name', -7.021723724238846e-08),
 ('used_time', -4.126537154502869e-06),
 ('city', -0.0022172506125169183),
 ('v_14', -0.004285615931568405),
 ('kilometer', -0.01383590363034406),
 ('notRepairedDamage', -0.2702944026708757),
 ('v_4', -0.8320763999815677),
 ('v_2', -0.950489908625956),
 ('v_10', -1.6271621034357788),
 ('v_8', -40.35060772204042),
 ('v_6', -238.78518046195867)]

# 再次进行可视化以观察效果
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index]))-1, color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price', 'Predicted Price'], loc='upper right')
plt.show()

1.2 五折交叉验证

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv=5, scoring=make_scorer(log_transfer(mean_absolute_error)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    2.0s finished

# 使用线性回归模型，对未处理标签的特征数据进行五折交叉验证(Error 1.36)
print('AVG:', np.mean(scores))

AVG: 1.365429593439596

# 使用线性回归模型，对处理过标签的特征数据进行五折交叉验证（Error 0.19）
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv=5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.4s finished

print('AVG:', np.mean(scores))

AVG: 0.19323301794380213

scores = pd.DataFrame(scores.reshape(1, -1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores

	cv1	cv2	cv3	cv4	cv5
0	0.1908	0.193762	0.194131	0.191823	0.19565

2.3 模拟真实业务情况
但在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。
通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。
在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大

import datetime

sample_feature = sample_feature.reset_index(drop=True)

split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

0.19566623218534546

2.4 绘制学习率曲线和验证曲线

from sklearn.model_selection import learning_curve, validation_curve

? learning_curve

def plot_learning_curve(estimator , title, X, y, ylim=None, cv=None, n_jobs=1, train_size=np.linspace(.1, 1.0, 5)):
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

多种模型对比
3.1 线性模型&嵌入式特征选择
在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(), 
         Ridge(), 
         Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv=5, scoring=make_scorer((mean_absolute_error)))
    result[model_name] = scores
    print(model_name + 'is finished')

LinearRegressionis finished
Ridgeis finished
Lassois finished

# 对三种方法的效果对比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	Ridge	Lasso
cv1	0.190303	0.194213	0.383467
cv2	0.193318	0.197295	0.383593
cv3	0.192839	0.196655	0.383413
cv4	0.193262	0.197033	0.382736
cv5	0.191895	0.195746	0.379170

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:17.249525951718784

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:3.118190236616388

model = Lasso().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:8.669914057713994

3.2 非线性模型

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

weixin_43520514

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
天池二手车预测：建模调参

0.基础知识学习（1）线性回归模型：线性拟合，梯度下降调参，正态分布（2）决策树模型：西瓜书第四章决策树学习（3）梯度提升树GBDT学习CART树：二分树，通过寻找最优特征及其最佳切分点划分输入空间 + 剪枝操作GBDT模型是集成模型，是很多CART树的线性相加（4）XGboost模型（5）LightGBM模型import pandas as pdimport numpy ...
复制链接

扫一扫

专栏目录