基于统计检验和随机森林探究房产价值影响因素-CSDN博客

本文链接：https://blog.csdn.net/m0_53814833/article/details/138862326

项目链接：一键运行
数据来源：和鲸或者kaggle
个人账号信息：
和鲸社区
 知乎ID
CSDN博客

1.项目背景

本数据集记录了房屋的基本规格，例如总面积、卧室与浴室数量、楼层情况等，同时也涵盖了对现代生活便利性至关重要的细节，包括是否紧邻主干道、是否设有客人房、有无地下室、是否配备热水供暖及空调系统，以及停车便利性等。此外，该数据集还特别标注了房产所处地段的受欢迎程度与装修状态，这些都是在评估房产价值时不可忽视的重要方面。

在当今房地产市场中，准确评估住宅物业的价值对于买家、卖家以及房地产投资者而言至关重要，房产价值的准确评估不仅可以帮助买家和卖家做出明智的决策，还可以为投资者提供可靠的参考，以便在市场中抓住最佳投资机会。为了实现这一目标，本项目对一个包含诸多影响住宅价格核心因素的数据集进行了全面分析和建模，通过对这些核心因素的深入分析，我们不仅能够清晰地解析房产价值，还可以识别出影响房价的关键因素，并建立一个可靠的预测模型，为房产估值提供科学依据。

2.数据说明

字段	说明
price	房产的价格。
area	房产的总面积，以平方英尺为单位。
bedrooms	房产中的卧室数量。
bathrooms	房产中的浴室数量。
stories	房产的楼层数。
mainroad	房产是否位于主要道路旁（是/否）。
guestroom	房产是否有客房（是/否）。
basement	房产是否有地下室（是/否）。
hotwaterheating	房产是否有热水供暖系统（是/否）。
airconditioning	房产是否有空调（是/否）。
parking	房产提供的停车位数量。
prefarea	房产是否位于首选区域（是/否）。
furnishingstatus	房产的装修状态（精装修、半装修、毛坯）。

3.Python库导入及数据读取

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import spearmanr, pointbiserialr, ttest_ind
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

# 和鲸不需要运行这一步，自己本地环境需要
from pylab import mpl
mpl.rcParams["font.sans-serif"] = ["SimHei"] # 设置显示中文字体 宋体
mpl.rcParams["axes.unicode_minus"] = False #字体更改后，会导致坐标轴中的部分字符无法正常显示，此时需要设置正常显示负号

data = pd.read_csv('D:\Desktop\商业数据分析案例\房产行情评估数据集\Housing_Price_Data.csv')

4.数据预览及数据处理

# 查看数据维度
data.shape

(545, 13)

#查看数据信息
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB

#查看各列缺失值
data.isna().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

#查看重复值
data.duplicated().sum()

5.描述性分析

data.describe(include='all')

	price	area	bedrooms	bathrooms	stories	mainroad	guestroom	basement	hotwaterheating	airconditioning	parking	prefarea	furnishingstatus
count	5.450000e+02	545.000000	545.000000	545.000000	545.000000	545	545	545	545	545	545.000000	545	545
unique	NaN	NaN	NaN	NaN	NaN	2	2	2	2	2	NaN	2	3
top	NaN	NaN	NaN	NaN	NaN	yes	no	no	no	no	NaN	no	semi-furnished
freq	NaN	NaN	NaN	NaN	NaN	468	448	354	520	373	NaN	417	227
mean	4.766729e+06	5150.541284	2.965138	1.286239	1.805505	NaN	NaN	NaN	NaN	NaN	0.693578	NaN	NaN
std	1.870440e+06	2170.141023	0.738064	0.502470	0.867492	NaN	NaN	NaN	NaN	NaN	0.861586	NaN	NaN
min	1.750000e+06	1650.000000	1.000000	1.000000	1.000000	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN
25%	3.430000e+06	3600.000000	2.000000	1.000000	1.000000	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN
50%	4.340000e+06	4600.000000	3.000000	1.000000	2.000000	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN
75%	5.740000e+06	6360.000000	3.000000	2.000000	2.000000	NaN	NaN	NaN	NaN	NaN	1.000000	NaN	NaN
max	1.330000e+07	16200.000000	6.000000	4.000000	4.000000	NaN	NaN	NaN	NaN	NaN	3.000000	NaN	NaN

# 定义一个函数，在countplot上添加数量文本
def add_count_labels(ax):
    for p in ax.patches:
        height = int(p.get_height())
        ax.annotate(f'{
     height}', (p.get_x() + p.get_width() / 2., height),
                    ha='center', va='bottom')


plt.figure(figsize=(15,15))
plt.subplot(3,3,1)
sns.histplot(data['price'], kde=True)
plt.title('房产价格分布')
plt.xlabel('房产价格')
plt.ylabel('频率')

plt.subplot(3,3,2)
sns.histplot(data['area'], kde=True)
plt.title('房产面积分布')
plt.xlabel('房产面积')
plt.ylabel('频率')

plt.subplot(3,3,3)
ax = sns.countplot(x=data['bedrooms'])
plt.title('房产中的卧室数量分布')
plt.xlabel('房产中的卧室数量')
plt.ylabel('数量&