DataCastle租金预测数据竞赛个人总结

最新推荐文章于 2025-02-26 08:00:17 发布

buttogo

最新推荐文章于 2025-02-26 08:00:17 发布

阅读量6.5k

点赞数 12

分类专栏：数据分析文章标签：数据分析 python 机器学习

本文链接：https://blog.csdn.net/buttogo/article/details/106127080

版权

数据分析专栏收录该内容

3 篇文章

订阅专栏

DataCastle租金预测数据竞赛个人总结

赛题链接
赛题任务：
给定房屋租金价格的各个影响因素数据，建立模型预测国内某城市房屋的租金价格。
数据字段：
（1）ID：编号；
（2）时间：房屋信息采集的时间；
（3）小区名：房屋所在小区，已脱敏处理；
（4）小区房屋出租数量：小区出租房屋数量，已脱敏处理；
（5）楼层：0、1、2分别表示楼层低，中，高；
（6）总层数：房屋所在建筑的总楼层数，已脱敏处理；
（7）房屋面积：房屋面积数值，已脱敏处理；
（8）房屋朝向：房屋的朝向；
（9）居住状态：房屋的居住状态，表示是否已出租或居住中，已脱敏处理；
（10）卧室数量：户型信息，数字表示卧室的个数；
（11）卫的数量：户型信息，数字表示卫生间的个数；
（12）厅的数量：户型信息，数字表示厅的个数；
（13）出租方式：是否整租，1为整租，0为合租；
（14）区：房屋所在的区级行政单位，用数字表示；
（15）位置：小区所在商圈位置，已脱敏处理；
（16) 地铁线路:数字表示第几条线路，已脱敏处理；
（17) 地铁站点房屋临近的地铁站，脱敏处理;
（18) 距离：房屋距地铁站距离，脱敏处理;
（19) 装修情况：房屋的装修档次，数值越高表示装修档次越高，脱敏处理;
（20) Label：月租金，标签值，脱敏处理。
评分标准：
通过计算MSE来衡量回归模型的优劣。MSE越小，说明回归模型越好。

载入库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn import linear_model
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer

载入数据

train = pd.read_csv("train.csv")
test = pd.read_csv("test_noLabel.csv")

EDA

# 查看训练集
train.head()

在这里插入图片描述

# 查看测试集
test.head()

在这里插入图片描述

# 查看训练集和测试集数据大小、数据类型、缺失情况等信息
train.info()
print('-------------------')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196539 entries, 0 to 196538
Data columns (total 20 columns):
ID          196539 non-null int64
位置          196508 non-null float64
出租方式        24230 non-null float64
区           196508 non-null float64
卧室数量        196539 non-null int64
卫的数量        196539 non-null int64
厅的数量        196539 non-null int64
地铁站点        91778 non-null float64
地铁线路        91778 non-null float64
小区名         196539 non-null int64
小区房屋出租数量    195538 non-null float64
居住状态        20138 non-null float64
总楼层         196539 non-null float64
房屋朝向        196539 non-null object
房屋面积        196539 non-null float64
时间          196539 non-null int64
楼层          196539 non-null int64
装修情况        18492 non-null float64
距离          91778 non-null float64
Label       196539 non-null float64
dtypes: float64(12), int64(7), object(1)
memory usage: 30.0+ MB
-------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56279 entries, 0 to 56278
Data columns (total 19 columns):
ID          56279 non-null int64
位置          56269 non-null float64
出租方式        4971 non-null float64
区           56269 non-null float64
卧室数量        56279 non-null int64
卫的数量        56279 non-null int64
厅的数量        56279 non-null int64
地铁站点        26494 non-null float64
地铁线路        26494 non-null float64
小区名         56279 non-null int64
小区房屋出租数量    56257 non-null float64
居住状态        4483 non-null float64
总楼层         56279 non-null float64
房屋朝向        56279 non-null object
房屋面积        56279 non-null float64
时间          56279 non-null int64
楼层          56279 non-null int64
装修情况        4207 non-null float64
距离          26494 non-null float64
dtypes: float64(11), int64(7), object(1)
memory usage: 8.2+ MB

# 查看训练集具体缺失百分比
train_missing = (train.isnull().sum()/len(train))*100
train_missing = train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False)
train_missing

装修情况        90.591180
居住状态        89.753688
出租方式        87.671658
距离          53.302907
地铁线路        53.302907
地铁站点        53.302907
小区房屋出租数量     0.509314
区            0.015773
位置           0.015773
dtype: float64

# 查看测试集具体缺失百分比
test_missing = (test.isnull().sum()/len(test))*100
test_missing = test_missing.drop(test_missing[test_missing==0].index).sort_values(ascending=False)
test_missing

装修情况        92.524743
居住状态        92.034329
出租方式        91.167220
距离          52.923826
地铁线路        52.923826
地铁站点        52.923826
小区房屋出租数量     0.039091
区            0.017769
位置           0.017769
dtype: float64

训练集和测试集中装修情况、居住状态、出租方式三组数据缺失严重，考虑将这三个特征直接删除。

train.drop(['装修情况','居住状态','出租方式'],axis=1,inplace=True)
test.drop(['装修情况','居住状态','出租方式'],axis=1,inplace=True)

# 查看相关性
columns = train.columns.drop('ID')
correlation = train[columns].corr()
plt.figure(figsize=(20, 10)) 
sns.heatmap(correlation,square = True, annot=True, fmt='0.2f',vmax=0.8)

在这里插入图片描述
通过相关性分析可以看出，房屋面积、卫的数量、卧室数量、厅的数量和租金之间相关性最高，应着重分析，其次是地铁线路、区、总楼层，其他特征相关性比较低。

# 房屋面积
sns.regplot(x=train['房屋面积'],y=train['Label'])

在这里插入图片描述
从图中可以看出房屋面积数据中存在异常点，下面将异常点删除：

train = train.drop(train[train['房屋面积']>1400].index)
sns.regplot(x=train['房屋面积'],y=train['Label'])

在这里插入图片描述

# 卫的数量
sns.boxplot(x=train['卫的数量'],y=train['Label'])

在这里插入图片描述

# 卧室数量
sns.boxplot(x=train['卧室数量'],y=train['Label'])

在这里插入图片描述

# 厅的数量
sns.boxplot(x=train['厅的数量'],y=train['Label'])

在这里插入图片描述
通过以上可视化分析可以看出，房屋面积、卫的数量、卧室数量以及厅的数量和租金之间均大致呈现正相关关系。

在之前的缺失值分析中，地铁线路、地铁站点和距离缺失比例相同，进一步观察可知三者缺失位置也完全一致，可以猜测缺失原因为该房源周围没有地铁线路，因此将缺失的地铁线路填充为0。

train['地铁线路'] = train['地铁线路'].fillna(0)

特征工程

在数据中存在一个object类型的数据“房屋朝向”，需对其进行进一步处理。

train['房屋朝向'].value_counts()

南              54767
东南             54353
东              31952
西南             17470
北              10428
西               9798
西北              5179
南 北             4003
东北              3287
东南 南             848
东 东南             823
东 西              741
南 西南             423
东 南              401
东南 西南            240
南 西              215
东南 西北            152
西南 西             122
东 北              103
西 西北              87
南 西南 北            86
西 北               84
西南 西北             74
东南 南 西南           70
东南 东北             69
东 东北              67
西北 北              59
南 西北              59
东南 西              57
北 东北              57
               ...  
东 南 西 北           35
东 西北              35
南 东               24
西北 东北             23
东 东南 南            18
南 东北              16
东南 南 北            13
南 西 北             11
东 东南 西南           11
东 南 西             11
东南 西南 西北          11
东 南 北              8
东 西 北              7
南 西南 西             7
东 西北 北             6
北 南                5
西 西北 北             5
东 东南 南 西南 西        4
东南 西南 西            4
东 南 西北 北           4
东 西 东北             2
东 东南 北             2
西南 西 东北            1
东 西南 北             1
北 西                1
东 南 西南             1
东南 西 北             1
南 北 东北             1
南 西南 西 西北          1
东南 南 西南 西          1
Name: 房屋朝向, Length: 64, dtype: int64

通过查看房屋朝向不同值的出现次数，可以发现每个房源有多个朝向，在处理时将房屋朝向特征分为“东”、“南”、“西”、“北”、“东南”、“东北”、“西南”、“西北”八个特征，并将原来的房屋朝向特征删除：

def east(x):
    if ('东' in x and '东南' not in x and '东北' not in x)\
    or ('东' in x and '东南' in x and '东北' not in x and x.count('东')==2)\
    or('东' in x and '东南' not in x and '东北' in x and x.count('东')==2)\
    or ('东' in x and '东南' in x and '东北' in x and x.count('东')==3):
        y = 1
    else:
        y = 0
    return y

def west(x):
    if ('西' in x and '西南' not in x and '西北' not in x)\
    or ('西' in x and '西南' in x and '西北' not in x and x.count('东')==2)\
    or('西' in x and '西南' not in x and '西北' in x and x.count('东')==2)\
    or ('西' in x and '西南' in x and '西北' in x and x.count('东')==3):
        y = 1
    else:
        y = 0
    return y

def south(x):
    if ('南' in x and '东南' not in x and '西南' not in x)\
    or ('南' in x and '东南' in x and '西南' not in x and x.count('东')==2)\
    or('南' in x and '东南' not in x and '西南' in x and x.count('东')==2)\
    or ('南' in x and '东南' in x and '西南' in x and x.count('东')==3):
        y = 1
    else:
        y = 0
    return y

def north(x):
    if ('北' in x and '西北' not in x and '东北' not in x)\
    or ('北' in x and '西北' in x and '东北' not in x and x.count('东')==2)\
    or('北' in x and '西北' not in x and '东北' in x and x.count('东')==2)\
    or ('北' in x and '西北' in x and '东北' in x and x.count('东')==3):
        y = 1
    else:
        y = 0
    return y

train['东']=train['房屋朝向'].apply(lambda x: east(x))
train['西']=train['房屋朝向'].apply(lambda x: west(x))
train['南']=train['房屋朝向'].apply(lambda x: south(x))
train['北']=train['房屋朝向'].apply(lambda x: north(x))
train['东南'] = train['房屋朝向'].apply(lambda x : 1 if '东南' in x else 0)
train['西南'] = train['房屋朝向'].apply(lambda x : 1 if '西南' in x else 0)
train['东北'] = train['房屋朝向'].apply(lambda x : 1 if '东北' in x else 0)
train['西北'] = train['房屋朝向'].apply(lambda x : 1 if '西北' in x else 0)
train.drop('房屋朝向',axis=1,inplace=True)

test['东'] = test['房屋朝向'].apply(lambda x: east(x))
test['西'] = test['房屋朝向'].apply(lambda x: west(x))
test['南'] = test['房屋朝向'].apply(lambda x: south(x))
test['北'] = test['房屋朝向'].apply(lambda x: north(x))
test['东南'] = test['房屋朝向'].apply(lambda x : 1 if '东南' in x else 0)
test['西南'] = test['房屋朝向'].apply(lambda x : 1 if '西南' in x else 0)
test['东北'] = test['房屋朝向'].apply(lambda x : 1 if '东北' in x else 0)
test['西北'] = test['房屋朝向'].apply(lambda x : 1 if '西北' in x else 0)
test.drop('房屋朝向',axis=1,inplace=True)

接下来，在原有数据的基础上构造一些新的特征：

# 房间总数
train['房间总数'] = train['卫的数量'] + train['卧室数量'] + train['厅的数量']
test['房间总数'] = test['卧室数量'] + test['厅的数量'] + test['卫的数量']

# 每间房间的平均面积
train['平均面积'] = train['房屋面积'] / train['房间总数']
test['平均面积'] = test['房屋面积'] / test['房间总数']

# 卫的面积
train['卫的面积'] = train['房屋面积']*(train['卫的数量']/train['房间总数'])
test['卫的面积'] = test['房屋面积']*(test['卫的数量']/test['房间总数'])

# 卧室面积
train['卧室面积'] = train['房屋面积']*(train['卧室数量']/train['房间总数'])
test['卧室面积'] = test['房屋面积']*(test['卧室数量']/test['房间总数'])

# 厅的面积
train['厅的面积'] = train['房屋面积']*(train['厅的数量']/train['房间总数'])
test['厅的面积'] = test['房屋面积']*(test['厅的数量']/test['房间总数'])

# 楼层比
train['楼层比'] = (train['楼层'] + 1) / train['总楼层']
test['楼层比'] = (test['楼层'] + 1) / test['总楼层']

# 每个小区附近的地铁站点数
temp = train.groupby('小区名')['地铁站点'].count().reset_index()
temp.columns = ['小区名','地铁站点数量']
train = train.merge(temp, how = 'left',on = '小区名')
test = test.merge(temp, how = 'left',on = '小区名')

# 每个小区出租房源的平均房屋面积
area_mean = train.groupby('小区名')['房屋面积'].mean().reset_index()
area_mean.columns = ['小区名','小区房屋平均面积']
train = train.merge(area_mean, how = 'left',on = '小区名')
test = test.merge(area_mean, how = 'left',on = '小区名')

# 每个小区楼房的平均楼层高度
height_mean = train.groupby('小区名')['总楼层'].mean().reset_index()
height_mean.columns = ['小区名','小区楼房平均高度']
train = train.merge(height_mean, how = 'left',on = '小区名')
test = test.merge(height_mean, how = 'left',on = '小区名')

建模分析

train.columns

Index(['ID', '位置', '区', '卧室数量', '卫的数量', '厅的数量', '地铁站点', '地铁线路', '小区名',
       '小区房屋出租数量', '总楼层', '房屋面积', '时间', '楼层', '距离', 'Label', '东', '西', '南',
       '北', '东南', '西南', '东北', '西北', '房间总数', '平均面积', '卫的面积', '卧室面积', '厅的面积',
       '楼层比', '地铁站点数量', '小区房屋平均面积', '小区楼房平均高度'],
      dtype='object')

feature_cols = ['位置', '区', '卧室数量', '卫的数量', '厅的数量', '地铁站点', '地铁线路', '小区名',
       '小区房屋出租数量', '总楼层', '房屋面积', '时间', '楼层', '距离', '东', '西', '南',
       '北', '东南', '西南', '东北', '西北', '房间总数', '平均面积', '卫的面积', '卧室面积', '厅的面积',
       '楼层比', '地铁站点数量', '小区房屋平均面积', '小区楼房平均高度']

# 提取特征列，标签列构造训练样本和测试样本
X_data = train[feature_cols]
Y_data = train['Label']
X_test  = test[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)

X train shape: (196519, 31)
X test shape: (56279, 31)

选择xgb和lgb两种模型进行分析：

def build_model_xgb(x_train,y_train):
    estimator = xgb.XGBRegressor(max_depth=10,subsample=0.7,colsample_bytree=0.75,reg_lambda=0.1,n_estimators=300)
    param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2]}
    model = GridSearchCV(estimator, param_grid)
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 300)
    param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2]}
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

# 划分数据集
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MSE_lgb = mean_squared_error(y_val,val_lgb)
print('MSE of val with lgb:',MSE_lgb)

print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)

Train lgb...
MSE of val with lgb: 2.378892214877016
Predict lgb...

print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MSE_xgb = mean_squared_error(y_val,val_xgb)
print('MSE of val with xgb:',MSE_xgb)

print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)

Train xgb...
MSE of val with xgb: 2.2006889137976047
Predict xgb...

将预测结果导出为csv文件：

sub_lgb = pd.DataFrame()
sub_lgb['ID'] = test.ID
sub_lgb['Label'] = subA_lgb
sub_lgb.to_csv("sub_lgb.csv",index=False)

sub_xgb = pd.DataFrame()
sub_xgb['ID'] = test.ID
sub_xgb['Label'] = subA_xgb
sub_xgb.to_csv("sub_xgb.csv",index=False)

模型融合

进行简单的stacking融合：

# 第一层
train_lgb_pred = model_lgb.predict(x_train)
train_xgb_pred = model_xgb.predict(x_train)

Stack_X_train = pd.DataFrame()
Stack_X_train['Method_1'] = train_lgb_pred
Stack_X_train['Method_2'] = train_xgb_pred

Stack_X_val = pd.DataFrame()
Stack_X_val['Method_1'] = val_lgb
Stack_X_val['Method_2'] = val_xgb

Stack_X_test = pd.DataFrame()
Stack_X_test['Method_1'] = subA_lgb
Stack_X_test['Method_2'] = subA_xgb

# 第二层 
def build_model_lr(x_train,y_train):
    reg_model = linear_model.LinearRegression()
    reg_model.fit(x_train,y_train)
    return reg_model

model_lr_Stacking = build_model_lr(Stack_X_train,y_train)
# 训练集
train_pre_Stacking = model_lr_Stacking.predict(Stack_X_train)
print('MSE of Stacking-LR:',mean_squared_error(y_train,train_pre_Stacking))

# 验证集
val_pre_Stacking = model_lr_Stacking.predict(Stack_X_val)
print('MSE of Stacking-LR:',mean_squared_error(y_val,val_pre_Stacking))

# 预测集
print('Predict Stacking-LR...')
subA_Stacking = model_lr_Stacking.predict(Stack_X_test)

MSE of Stacking-LR: 0.8490747301439259
MSE of Stacking-LR: 2.3832075373636856
Predict Stacking-LR...

将模型融合后结果导出为csv文件：

sub_stack = pd.DataFrame()
sub_stack['ID'] = test.ID
sub_stack['Label'] = subA_Stacking
sub_stack.to_csv("C:sub_stacking.csv",index=False)

可以看出，模型融合后MSE下降，对模型预测结果起到了一定的提升作用，提交后线上结果也有所改善。

总结

此次竞赛相对而言较为简单，主要是想通过竞赛熟悉数据分析的整个流程，锻炼自己数据分析的能力。在做比赛的过程中也参考了其他人的经验分享，对自己这个竞赛新手帮助很大，因此希望自己的分享也能够给大家带来一定的帮助。此次比赛的一些经验主要有以下几点：

特征工程真的很重要，需要花时间仔细考虑。之前一直听别人说感觉并不深，但通过不断对特征进行调整，分数的提升较大，相比换模型而言分数提升程度更为明显。自己的特征工程处理也不够完善，大家可以结合实际租房时的考虑因素构造更多特征。
自己选择了lgb和xgb两种模型，也是比赛中经常用到的性能较高的算法。两种算法都能对缺失值进行处理，所以没有完全处理缺失值。
遇到的难点：在对房屋朝向进行处理时，因为每个房源有不同的朝向，将其分解到八个方向上时，“东南”等中间方位也容易被识别成“东”、“南”两个基本方位，因此需要限制识别条件。自己写的识别函数较为复杂，大家可以进行进一步的优化。