Kaggle房价预测随机森林方法

最新推荐文章于 2024-07-23 13:00:29 发布

秋天要到了

最新推荐文章于 2024-07-23 13:00:29 发布

阅读量911

点赞数

本文链接：https://blog.csdn.net/qq_43680030/article/details/84033793

版权

本文介绍了参与Kaggle的'House Prices: Advanced Regression Techniques'比赛的经验，分享了使用随机森林进行房价预测的方法。参考了多个教程，并提供了Python代码实现，最终在Kaggle上取得了0.15105的得分。

摘要由CSDN通过智能技术生成

分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！

House Prices: Advanced Regression Techniques

赛题链接：

[ https://www.kaggle.com/c/house-prices-advanced-regression-techniques]

Python代码

#-*- coding: utf-8 -*-import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the "../input/" directory.# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directoryfrom subprocess import check_output#print(check_output(["dir", "./"]).decode("utf8"))# 加载数据train_data = pd.read_csv("./data/train.csv")test_data = pd.read_csv("./data/test.csv")# 定义若干个对数据进行清理的函数，这些函数主要作用在pandas的DataFrame数据类型上# 查看数据集属性值得确实情况def show_missing(houseprice):    missing = houseprice.columns[houseprice.isnull().any()].tolist()    return missing# 查看 categorical 特征的值情况def cat_exploration(houseprice, column):    print(houseprice[column].value_counts())# 对数据集中某一列的缺失值进行补全def cat_imputation(houseprice, column, value):    houseprice.loc[houseprice[column].isnull(), column] = value# LotFrontage# check correlation with LotAreaprint(test_data['LotFrontage'].corr(test_data['LotArea']))  # 0.64print(train_data['LotFrontage'].corr(train_data['LotArea']))  # 0.42test_data['SqrtLotArea'] = np.sqrt(test_data['LotArea'])train_data['SqrtLotArea'] = np.sqrt(train_data['LotArea'])# print(test_data['LotFrontage'].corr(test_data['SqrtLotArea']))# print(train_data['LotFrontage'].corr(train_data['SqrtLotArea']))cond = test_data['LotFrontage'].isnull()test_data.LotFrontage[cond] = test_data.SqrtLotArea[cond]#缺失值用房屋边长补全cond = train_data['LotFrontage'].isnull()train_data.LotFrontage[cond] = train_data.SqrtLotArea[cond]del test_data['SqrtLotArea']del train_data['SqrtLotArea']# MSZoning# 在test测试集中有缺失, train中没有cat_exploration(test_data, 'MSZoning')print(test_data[test_data['MSZoning'].isnull() == True])# MSSubClass  MSZoningprint(pd.crosstab(test_data.MSSubClass, test_data.MSZoning))#test_data中建筑类型缺失值补齐 30:RM 20:RL 70:RMtest_data.loc[test_data['MSSubClass'] == 20, 'MSZoning'] = 'RL'test_data.loc[test_data['MSSubClass'] == 30, 'MSZoning'] = 'RM'test_data.loc[test_data['MSSubClass'] == 70, 'MSZoning'] = 'RM'# Alleyprint(cat_exploration(test_data, 'Alley'))print(cat_exploration(train_data, 'Alley'))# Alley这个特征有太多的nans,这里填充None，也可以直接删除，不使用。后面在根据特征的重要性选择特征是，也可以舍去cat_imputation(test_data, 'Alley', 'None')cat_imputation(train_data,