题目简介
如果让一个购房者描述一下他们梦想中的房子,他们可能不会从地下室天花板的高度或靠近东西方向的铁路开始。但这场游乐场竞赛的数据集证明,与卧室数量或白色尖桩篱笆相比,它对价格谈判的影响要大得多。
79个解释变量(几乎)描述了爱荷华州埃姆斯市住宅的每个方面,这个竞赛挑战你预测每个房子的最终价格。
获取数据
import pandas as pd
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
分析数据
返回数据形状, 可以看到有 1460 行,81 列
print(train_data.shape)
(1460, 81)
返回前5条数据,SalePrice 为 价格, Id 为索引,其他为特征值
print(train_data.head())
Id MSSubClass MSZoning ... SaleType SaleCondition SalePrice
0 1 60 RL ... WD Normal 208500
1 2 20 RL ... WD Normal 181500
2 3 60 RL ... WD Normal 223500
3 4 70 RL ... WD Abnorml 140000
4 5 60 RL ... WD Normal 250000
获取数据集的简单描述,总行数、每个属性的类型和非空值的数量
print(train_data.info())
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
缺失值检查, 数据为缺失的数量,如 LotFrontage 列总数1460 缺失为259
print(train_data.isna().sum().where(lambda x: x > 0).dropna())
LotFrontage 259.0
Alley 1369.0
MasVnrType 8.0
MasVnrArea 8.0
BsmtQual 37.0
BsmtCond 37.0
BsmtExposure 38.0
BsmtFinType1 37.0
BsmtFinType2 38.0
Electrical 1.0
FireplaceQu 690.0
GarageType 81.0
GarageYrBlt 81.0
GarageFinish 81.0
GarageQual 81.0
GarageCond 81.0
PoolQC 1453.0
Fence 1179.0
MiscFeature 1406.0
通过describe()方法可以显示数值属性的描述
mean 平均值, std 标准差:反应数据的离散程度,一个较大的标准差,代表大部分数值和其平均值之间差异较大;一个较小的标准差,代表这些数值较接近平均值。
print(train_data.describe())
Id MSSubClass LotFrontage LotArea OverallQual
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315
std 421.610009 42.300571 24.284752 9981.264932 1.382997
min 1.000000 20.000000 21.000000 1300.000000 1.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000
数据分布趋势,bins表示直方图的“柱”的个数,每个“柱”的值=该“柱”跨越的所有x对应的y值之和
import matplotlib.pyplot as plt
train_data_x = train_data.iloc[:, 1:-1]
train_data_x.hist(bins=50, figsize=(20, 15))
plt.show()
相关性分析,结果小则线性相关性差
corr_matrix = train_data.corr()
print(corr_matrix["SalePrice"].sort_values(ascending=False))
SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
GarageYrBlt 0.486362
MasVnrArea 0.477493
Fireplaces 0.466929
BsmtFinSF1 0.386420
数据处理
缺失值处理
# 缺失值填充NA,
# 处理文本转数字,缺失值非数字用中位数填充,非数字用LabelEncoder处理
def preprocessing(data):
for i in data.columns:
if data[i].dtype == "object":
data[i].fillna("nan", inplace=True)
data[i] = LabelEncoder().fit_transform(data[i])
else:
data[i].fillna(data[i].mean(), inplace=True)
return data
train_data_encoded = preprocessing(train_data)
print(train_data_encoded)
划分测试集和训练集
排除第一行的id属性
x = train_data_encoded.iloc[:, 1:-1]
y = train_data_encoded.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)
模型训练
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(x_train, y_train)
# 算一下均方根误差
housing_predictions = forest_reg.predict(x_train)
forest_mse = mean_squared_error(y_train, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print(forest_rmse)
11052.088424558873
测试集验证,测试分数较低
score = forest_reg.score(x_test, y_test)
print(score)
0.8753273576332103
总结:
测试分数较低,下次学习优化和调参