二手房价预测-Datawhale&天池数据挖掘学习4

最新推荐文章于 2023-11-15 20:38:59 发布

qq_26887833

最新推荐文章于 2023-11-15 20:38:59 发布

阅读量618

点赞数

文章标签： python 机器学习人工智能

本文链接：https://blog.csdn.net/qq_26887833/article/details/105259153

版权

基础知识

机器学习分类：

1. 监督学习

利用一组带标签的数据，学习从输入到输出的映射，然后将新数据用这种映射关心可以得到映射结果，以达到分类或回归的目的。
主要方法：线性回归，决策树，SVD等。

2. 非监督学习

输入的数据没有被标记，也没有确定的结果。
方法：K-means聚类，层次聚类等。

半监督学习

在实际情况中，获取的数据大部分都是无标签的，人们企图加入一些人为标注的样本，使得无标签的数据通过训练自动获取标签，这相当于对无监督学习的一种改进。
方法：生产模型算法等

4. 强化学习

强化学习（Reinforcement Learning, RL），又称再励学习、评价学习或增强学习，是机器学习的范式和方法论之一，用于描述和解决智能体（agent）在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。

基本概念

1.模型，策略与算法
2. 评价函数
3. 目标函数
4. 过拟合与欠拟合
5. 正则化
6. 交叉验证
7. 泛化能力
推荐学习资料

《机器学习》，周志华（俗称西瓜书）——必备读物

西瓜书真的是很经典，而且讲解的也非常清楚，对于初学者来说，对基本概念的理解非常有帮助，力荐之作。
视频的话，当然是吴恩达的机器学习视频了。

《统计学习方法》，李航（必备读物）
《深度学习》，（花书）
《Python大战机器人》

读取数据

首先，reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#reduce_mem_usage 函数通过调整数据类型，
# 帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
    to reduce memory usage.
    """
    start_mem = df.memory_usage().sum()
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
            end_mem = df.memory_usage().sum()
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df


sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

结果显示：
Memory usage of dataframe is 62099672.00 MB
Memory usage after optimization is: 50555630.00 MB
Decreased by 18.6%

查看变量名称：

#连续变量名称
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]
print(continuous_feature_names)

结果：
[‘SaleID’, ‘name’, ‘bodyType’, ‘fuelType’, ‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘seller’, ‘offerType’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’, ‘v_14’, ‘train’, ‘used_time’, ‘city’, ‘brand_amount’, ‘brand_price_max’, ‘brand_price_median’, ‘brand_price_min’, ‘brand_price_sum’, ‘brand_price_std’, ‘brand_price_average’, ‘power_bin’]

线性回归&五折交叉验证&模拟真实业务情况

数据集准备：

#训练集准备
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]
train_X = train[continuous_feature_names] #自变量
train_y = train['price'] #因变量

1 简单建模


#简单建模
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True) #标准化
model = model.fit(train_X, train_y)

##查看训练的线性回归模型的截距（intercept）与权重(coef)
'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

#绘制特征v_9的值与标签的散点图，
# 图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，
# 且部分预测值出现了小于0的情况，
# 说明我们的模型存在一些问题


#子采样，抽取50个样本
#subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
#len(subsample_index) #50
#真实值
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='yellow')
#预测值
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='pink')
plt.xlabel('v_9')
plt.ylabel('pplt.show()rice')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

在这里插入图片描述
绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（粉红点）与真实标签（黄色点）的分布差异较
大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题。

然后，再做检验：

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])#0.9分位点

在这里插入图片描述
很明显，价格呈现典型的指数分布。

对标签进行了 𝑙𝑜𝑔(𝑥 + 1) 变换处理后，使标签贴近于正态分布

#对预测变量做对数变换
train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

在这里插入图片描述

#对进行对数变换后的数据重新建模
model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='yellow')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='pink')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

2 五折交叉验证

###五折交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
print('AVG:', np.mean(scores))

结果

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.4s finished

结果：
AVG: 1.36580240424085

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
print(scores)

结果：

          cv1      cv2       cv3       cv4       cv5
MAE  1.348304  1.36349  1.380712  1.378401  1.358105

3 模拟真实业务情况

import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)


model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))

这段有点报错，还没检查完，待续……

4 绘制学习率曲线与验证曲线

多种模型对比

1 线性模型 & 嵌入式特征选择

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
          Ridge(),
          Lasso()]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
print(result)

     LinearRegression     Ridge     Lasso
cv1          0.190792  0.194832  0.383899
cv2          0.193758  0.197632  0.381893
cv3          0.194132  0.198123  0.384090
cv4          0.191825  0.195670  0.380526
cv5          0.195758  0.199676  0.383611

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)


model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

结果：
intercept:18.750745460114032
intercept:4.671710857050353
intercept:8.672182455497687

图形没有输出来，有待检查……

2 非线性模型

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100),
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'),
          LGBMRegressor(n_estimators = 100)]

from tqdm import tqdm_notebook
from tqdm import trange
import time


result = dict()

for model in models:
            model_name = str(model).split('(')[0]
            scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
            result[model_name] = scores
            print(model_name + ' is finished')


result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
print(result)

结果显示：

     LinearRegression  DecisionTreeRegressor  ...  XGBRegressor  LGBMRegressor
cv1          0.190792               0.197922  ...      0.142378       0.141544
cv2          0.193758               0.193300  ...      0.140922       0.145501
cv3          0.194132               0.189068  ...      0.139393       0.143887
cv4          0.191825               0.190922  ...      0.137492       0.142497
cv5          0.195758               0.201879  ...      0.143733       0.144852