kaggle入门-房价预测-未完结(混乱版)

参考:https://zhuanlan.zhihu.com/p/335673241
https://blog.csdn.net/wangc1994/article/details/100760804

数据读取

import pandas as pd

# 读取数据
train_data = pd.read_csv("./data/train.csv")
test_data = pd.read_csv("./data/test.csv")

数据处理

离群点

GrLivArea: Above grade (ground) living area square feet
在这里插入图片描述

# 地上居住面积中的离群点
train_data = train_data.drop(train_data[(train_data['SalePrice'] < 300000) & (train_data['GrLivArea'] > 4000)].index)


填充缺失

# 找出所有缺失的列, 并排序
miss_data = all_features.isnull().sum()
miss_data_top = miss_data[miss_data > 0].sort_values(ascending=False)
print(miss_data_top)

PoolQC          2908
MiscFeature     2812
Alley           2719
Fence           2346
FireplaceQu     1420
LotFrontage      486
GarageFinish     159
GarageQual       159
GarageCond       159
GarageYrBlt      159
GarageType       157
BsmtExposure      82
BsmtCond          82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MasVnrArea        23
MSZoning           4
BsmtFullBath       2
BsmtHalfBath       2
Functional         2
Utilities          2
GarageArea         1
GarageCars         1
Electrical         1
KitchenQual        1
TotalBsmtSF        1
BsmtUnfSF          1
BsmtFinSF2         1
BsmtFinSF1         1
Exterior2nd        1
Exterior1st        1
SaleType           1
# 删除缺失过多的列
_ = ['PoolQC', 'MiscFeature', 'Alley', 'Fence']
for __ in _:
    all_features = all_features.drop([__], axis=1)
# 填空值
cols1 = ["FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", 
"BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for _ in cols1:
    all_features[_] = all_features[_].fillna("None")
# 填0
cols2=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for _ in cols2:
    all_features[_] = all_features[_].fillna(0)
# 填众数
cols3 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for _ in cols3:
    all_features[_] = all_features[_].fillna(all_features[_].mode()[0])
# 填均值
all_features["LotFrontage"] = all_features["LotFrontage"].fillna(np.mean(all_features["LotFrontage"]))

地下室总面积

TotalBsmtSF: Total square feet of basement area
在这里插入图片描述

地皮面积
LotArea: Lot size in square feet
在这里插入图片描述

数值型转换为标称型

数据类型为整数或浮点数,但只是起到标记和简化数据的作用。

住宅类型
MSSubClass: Identifies the type of dwelling involved in the sale.

20	1-STORY 1946 & NEWER ALL STYLES
30	1-STORY 1945 & OLDER
40	1-STORY W/FINISHED ATTIC ALL AGES
...
# 数值型转换为标称型
num2std = ['MSSubClass', 'YrSold', 'MoSold']
for _ in num2std:
    all_features[_] = all_features[_].astype(str) 


定序型转换为数字型

数据类型为字符串,但是具有顺序信息,例如好与坏,多与少,时序信息等。转化为数字型可以让模型学习到更多这些信息。

壁炉质量
FireplaceQu: Fireplace quality

   Ex	Excellent - Exceptional Masonry Fireplace
   Gd	Good - Masonry Fireplace in main level
   TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa	Fair - Prefabricated Fireplace in basement
   Po	Poor - Ben Franklin Stove
   NA	No Fireplace

地下室高度
BsmtQual: Evaluates the height of the basement

   Ex	Excellent (100+ inches)	
   Gd	Good (90-99 inches)
   TA	Typical (80-89 inches)
   Fa	Fair (70-79 inches)
   Po	Poor (<70 inches
   NA	No Basement

其余还有:
地下室整体情况BsmtCond、地下室完工面积等级BsmtFinType12
车库质量GarageQual、车库现状GarageCond、车库装修GarageFinish
外部质量ExterQual、外部现状ExterCond、中央空调CentralAir
供热质量HeatingQC、泳池质量PoolQC、厨房质量KitchenQual
栅栏质量Fence、家庭功能性Functional、花园受光程度BsmtExposure
物业坡度LandSlope、物业基本形状LotShape、车道铺路情况PavedDrive
去物业的小巷类型Alley、去物业的道路类型Street

# 定序型转换为数字型
sort2num = ['GarageYrBlt', 'GarageType', 'MasVnrType', 'FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond','ExterQual', 
        'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1','MSZoning', 'Electrical', 
        'BsmtFinType2', 'Functional', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'CentralAir']
for _ in sort2num:
    ll = preprocessing.LabelEncoder()
    ll.fit(list(all_features[_].values))
    all_features[_] = ll.transform(list(all_features[_].values))


# 对数字型标准化
numeric_ = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_] = all_features[numeric_].apply(lambda x:((x-x.mean())/(x.std()))) 

# 对离散型变量变为one-hot
all_features = pd.get_dummies(all_features, dummy_na=True)

# 分配训练集和测试集
num_train = train_data.shape[0]
train_features = torch.tensor(all_features[:num_train].values, dtype = torch.float)
test_features = torch.tensor(all_features[num_train:].values, dtype = torch.float)
train_labels = torch.tensor(train_data.SalePrice.values, dtype = torch.float).view(-1,1)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值