kaggle小白入门——房价预测top2%~top1%

入门第二战,达到了top1%的分数,有点小兴奋,不过也有可能为公分的提高使模型过拟合了,但入门赛貌似也只能追求公分的提高。

言归正传,开战。

一、导包

# 数据处理及可视化
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 算法
from xgboost.sklearn import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
# 训练
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

二、获取数据

train = pd.read_csv("all/train.csv")
test = pd.read_csv("all/test.csv")
sample_submission = pd.read_csv("all/sample_submission.csv")

三、数据分析

1、SalePrice分布

sns.distplot(train.SalePrice)

可以看到SalePrice偏离了正态分布,需要调整,将SalePrice对数化。

sns.distplot(np.log(train.SalePrice + 1))

2、缺失值

(缺失值的处理参考https://www.kaggle.com/laurenstc/top-2-of-leaderboard-advanced-fe

拼接数据并将缺失值数量可视化

all_data = pd.concat((train.drop(["SalePrice"], axis=1), test))
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
plt.figure(figsize=(12, 6))
plt.xticks(rotation="90")
sns.barplot(x=all_data_na.index, y=all_data_na)

PoolQC:PoolQC的缺失值太多,按照文件的意思就是没有游泳池,可以用None来填充缺失值。理论上来说与PoolQC缺失值对应的PoolArea应该为0,但观察为数不多的有值PoolArea后,发现在测试集中960,1043,1139很特别,这三个例子不能用None填充。13个数据中,Ex出现4次,Gd出现4次,Fa出现2次。根据平均分布的思想,有两个缺失值需要填充Fa

all_data[all_data.PoolArea != 0][["PoolArea", "PoolQC"]]

MiscFeature直接删除。按照文件的意思可填充None。我也探索过MiscFeature与其对应的MiscVal以及GarageType之间的关系,试过一些填充办法,甚至确定了测试集中1089这个例子应该填充Gar2,但最终删除该特征对我的模型效果最好。

all_data[all_data.MiscVal > 10000][["MiscFeature", "MiscVal"]]

Alley:按照文件的意思可填充None。

Fence:同MiscFeature,直接删除。

FireplaceQu:按照文件的意思可填充None。

LotFrontage:这是许多Kernels重点填充的对象,有用算法预测的,有以Neighborhood分组填充的,有用R的MICE包填充的。经过我的观察,我认为比较合理的填充方式是以LotConfig和Neighborhood分组填充,但很遗憾的是,这些填充方法对我的模型都没有效果。真正对本模型有效的是填充0

Garage系列:按照文件的意思,GarageType,GarageFinish,GarageQual,GarageCond填充None,GarageYrBlt,GarageCars,GarageArea填充0。但通过观察,测试集中有一些不同寻常的例子不能如此填充,将使用median和mode进行填充,当然我也试过我认为更加合理的填充方式,比如通过以Neighborhood和GarageType分组填充,但效果并不理想。

all_data[(all_data.GarageType.notnull()) & (all_data.GarageYrBlt.isnull())][["Neighborhood", "YearBuilt", "YearRemodAdd", "GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond"]]

Bsmt系列:按照文件的意思,对应的填充None或0。同样存在下面一些特殊例子不应如此填充,我也试过其它的填充方法,但最终还是选择了就以None或0填充。

train.loc[[332, 948]][["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "BsmtFullBath", "BsmtHalfBath"]]

test.loc[[27, 580, 725, 757, 758, 888, 1064]][["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "BsmtFullBath", "BsmtHalfBath"]]

MSZoning以mode值RL填充。也试过按MSSubClass分组填充,但没有效果。

MasVnrTypeNone

MasVnrArea0

Utilities:测试集中无NoSeWa,基本对价格无影响,删除。

plt.scatter(train.Utilities, train.SalePrice)

其余特征的缺失值使用mode值填充。

缺失值填充的方式很多,大家可以多尝试,多探索,找到对自己模型最好的填充方法。

四、特征工程

1、价格log化

y = train["SalePrice"]
y = np.log(y+1)

2、特殊例子缺失值填充

# PoolQC
test.loc[960, "PoolQC"] = "Fa"
test.loc[1043, "PoolQC"] = "Gd"
test.loc[1139, "PoolQC"] = "Fa"

# Garage
test.loc[666, "GarageYrBlt"] = 1979
test.loc[1116, "GarageYrBlt"] = 1979

test.loc[666, "GarageFinish"] = "Unf"
test.loc[1116, "GarageFinish"] = "Unf"

test.loc[1116, "GarageCars"] = 2
test.loc[1116, "GarageArea"] = 480

test.loc[666, "GarageQual"] = "TA"
test.loc[1116, "GarageQual"] = "TA"

test.loc[666, "GarageCond"] = "TA"
test.loc[1116, "GarageCond"] = "TA"

3、缺失值填充

# PoolQC
train = train.fillna({"PoolQC": "None"})
test = test.fillna({"PoolQC": "None"})

# Alley
train = train.fillna({"Alley": "None"})
test = test.fillna({"Alley": "None"})

# FireplaceQu
train = train.fillna({"FireplaceQu": "None"})
test = test.fillna({"FireplaceQu": "None"})

# LotFrontage
train = train.fillna({"LotFrontage": 0})
test = test.fillna({"LotFrontage": 0})

# Garage
train = train.fillna({"GarageType": "None"})
test = test.fillna({"GarageType": "None"})
train = train.fillna({"GarageYrBlt": 0})
test = test.fillna({"GarageYrBlt": 0})
train = train.fillna({"GarageFinish": "None"})
test = test.fillna({"GarageFinish": "None"})
test = test.fillna({"GarageCars": 0})
test = test.fillna({"GarageArea": 0})
train = train.fillna({"GarageQual": "None"})
test = test.fillna({"GarageQual": "None"})
train = train.fillna({"GarageCond": "None"})
test = test.fillna({"GarageCond": "None"})

# Bsmt
train = train.fillna({"BsmtQual": "None"})
test = test.fillna({"BsmtQual": "None"})
train = train.fillna({"BsmtCond": "None"})
test = test.fillna({"BsmtCond": "None"})
train = train.fillna({"BsmtExposure": "None"})
test = test.fillna({"BsmtExposure": "None"})
train = train.fillna({"BsmtFinType1": "None"})
test = test.fillna({"BsmtFinType1": "None"})
train = train.fillna({"BsmtFinSF1": 0})
test = test.fillna({"BsmtFinSF1": 0})
train = train.fillna({"BsmtFinType2": "None"})
test = test.fillna({"BsmtFinType2": "None"})
test = test.fillna({"BsmtFinSF2": 0})
test = test.fillna({"BsmtUnfSF": 0})
test = test.fillna({"TotalBsmtSF": 0})
test = test.fillna({"BsmtFullBath": 0})
test = test.fillna({"BsmtHalfBath": 0})

# MasVnr
train = train.fillna({"MasVnrType": "None"})
test = test.fillna({"MasVnrType": "None"})
train = train.fillna({"MasVnrArea": 0})
test = test.fillna({"MasVnrArea": 0})

# MiscFeature,Fence,Utilities
train = train.drop(["Fence", "MiscFeature", "Utilities"], axis=1)
test = test.drop(["Fence", "MiscFeature", "Utilities"], axis=1)

# other
test = test.fillna({"MSZoning": "RL"})
test = test.fillna({"Exterior1st": "VinylSd"})
test = test.fillna({"Exterior2nd": "VinylSd"})
train = train.fillna({"Electrical": "SBrkr"})
test = test.fillna({"KitchenQual": "TA"})
test = test.fillna({"Functional": "Typ"})
test = test.fillna({"SaleType": "WD"})

4、探索离群值并删除

(探索离群值的方法借鉴https://www.kaggle.com/jack89roberts/top-7-using-elasticnet-with-interactions,我使用了Ridge和ElasticNet训练了训练集,并对训练集进行预测,找出两个算法中预测效果都不理想的样本作为离群值)

dummies

train_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[: train.shape[0]]
test_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[train.shape[0]:]

使用Ridge寻找离群值

rr = Ridge(alpha=10)
rr.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.1388301732996231

y_pred = rr.predict(train_dummies)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid) / std_resid
z = np.array(z)
outliers1 = np.where(abs(z) > abs(z).std() * 3)[0]
outliers1

输出:array([ 30, 88, 142, 277, 308, 328, 365, 410, 438, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 935, 968, 970, 1062, 1168, 1170, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453], dtype=int64)

plt.figure(figsize=(6, 6))
plt.scatter(y, y_pred)
plt.scatter(y.iloc[outliers1], y_pred[outliers1])
plt.plot(range(10, 15), range(10, 15), color="red")

使用ElasticNet探索离群值

er = ElasticNet(alpha=0.001, l1_ratio=0.58)
er.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.1388301732996231

y_pred = er.predict(train_dummies)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid) / std_resid
z = np.array(z)
outliers2 = np.where(abs(z) > abs(z).std() * 3)[0]
outliers2

输出:array([ 30, 88, 142, 277, 328, 410, 457, 462, 495, 523, 533, 581, 588, 628, 632, 666, 681, 688, 710, 711, 714, 728, 738, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453], dtype=int64)

plt.figure(figsize=(6, 6))
plt.scatter(y, y_pred)
plt.scatter(y.iloc[outliers2], y_pred[outliers2])
plt.plot(range(10, 15), range(10, 15), color="red")

将两次算法预测效果都不好的点作为离群值

outliers = []
for i in outliers1:
    for j in outliers2:
        if i == j:
            outliers.append(i)
outliers

输出(这里输出格式有点问题,后面手动删除离群值):[30, 88, 142, 277, 328, 410, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453]

删除离群值

train = train.drop([30, 88, 142, 277, 328, 410, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453])
y = train["SalePrice"]
y = np.log(y+1)

五、建立模型

(使用了GBDT,XGBOOST,Lasso,Ridge,并将它们组合)

dummies

train_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[: train.shape[0]]
test_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[train.shape[0]:]

GBDT

​​​​​​​gbr = GradientBoostingRegressor(max_depth=4, n_estimators=150)
gbr.fit(train_dummies, y)
np.sqrt(-cross_val_score(gbr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.10041800215081471

XGB

xgbr = XGBRegressor(max_depth=5, n_estimators=400)
xgbr.fit(train_dummies, y)
np.sqrt(-cross_val_score(xgbr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.10051704266055339

Lasso

lsr = Lasso(alpha=0.00047)
lsr.fit(train_dummies, y)
np.sqrt(-cross_val_score(lsr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.09072389427316205

Ridge

rr = Ridge(alpha=13)
rr.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出:0.09161386485828467

交叉验证得到的分数是玄学,有时候这里提高了分数,但提交之后反而会下降;这里下降了分数,提交之后反而提高。

组合模型(我也试过Stacking,能提高分数,但效果不及简单组合)

train_predict = 0.1 * gbr.predict(train_dummies) + 0.3 * xgbr.predict(train_dummies) + 0.3 * lsr.predict(train_dummies) + 0.3 * rr.predict(train_dummies)

这是一个对训练集的预测模型的组合,先组合训练集预测模型是因为我将要手动修改预测值。

手动修改预测值

(参考https://www.kaggle.com/agehsbarg/top-10-0-10943-stacking-mice-and-brutal-force,非常有效,但没什么道理可讲,也许不能用到更大的数据集,但在房价预测的公分排行榜上,的确有很好效果)

观察我们的模型对训练集的预测效果

plt.figure(figsize=(6, 6))
plt.scatter(y, train_predict)
plt.plot(range(10, 15), range(10, 15), color="red")

你会看到底部的点不像顶部的点那样在红线上,表示底部的点的预测效果可能并不好,可以手动调整,采用分位数选出你想要调整的预测值,进行调整。quantile的参数以及调整大小的倍数都可以自己调整,寻找能让分数最好的参数。

q1 = pd.DataFrame(train_predict).quantile(0.0042)
pre_df = pd.DataFrame(train_predict)
pre_df["SalePrice"] = train_predict
pre_df = pre_df[["SalePrice"]]
pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] = pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] *0.99
train_predict = np.array(pre_df.SalePrice)
plt.figure(figsize=(6, 6))
plt.scatter(y, train_predict)
plt.plot(range(10, 15), range(10, 15), color="red")

预测并提交

test_predict = 0.1 * gbr.predict(test_dummies) + 0.3 * xgbr.predict(test_dummies) + 0.3 * lsr.predict(test_dummies) + 0.3 * rr.predict(test_dummies)
q1 = pd.DataFrame(test_predict).quantile(0.0042)
pre_df = pd.DataFrame(test_predict)
pre_df["SalePrice"] = test_predict
pre_df = pre_df[["SalePrice"]]
pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] = pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] *0.96
test_predict = np.array(pre_df.SalePrice)
sample_submission["SalePrice"] = np.exp(test_predict)-1
sample_submission.to_csv("all/1.csv", index=False)

可以看到我最终选用了0.96的倍数,该参数让我达到了一个不错的分数。这样提交可以得到0.11052的分数,达到了top 2%。

六、总结

1、缺失值填充十分重要,可以有效的提高分数,大家可以自己尝试不同的填充方法。比如PoolQC的三个特殊缺失值例子的填充,可以尝试不同的组合。

2、离群值的探索,以上提出的离群值并不是对本模型最好的,大家可以试着增加新的离群值,或者减少旧的离群值,会有一些离群值组合让你再次提升分数。

3、多余特征的寻找,这是我初次建立最简单的模型时,偶然发现的,当时我手动尝试哪个特征好,哪个特征不好,找出了两个不好的特征,一直沿用到这最终的模型,没想到一直有用。删除这两个特征,会让本模型分数再次提高。

4、从泰坦尼克知道,创建一些新的特征会很有用,房价预测的很多kernels也创建了一些新特征,但在本模型没什么用处。

5、很多kernels提到的特征的skew等问题,PCA,对本模型都没什么用

最终,删除离群值[30, 88, 410, 462, 495, 523, 588, 628, 632, 874, 898, 968, 970, 1182, 1298, 1324, 1432],删除特征LandSlope, Exterior2nd可让本模型达到0.10955,top 1%。

  • 12
    点赞
  • 51
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值