Learn:Build Meachine Learning Model——以预测Melbourne房子价格为例

预测Melbourne房子价格

有监督模型
采用的决策树回归模型

导入数据,初步分析

import pandas as pd

melbourne_file_path =r'G:\kaggle\melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
Index([u'Suburb', u'Address', u'Rooms', u'Type', u'Price', u'Method',
       u'SellerG', u'Date', u'Distance', u'Postcode', u'Bedroom2', u'Bathroom',
       u'Car', u'Landsize', u'BuildingArea', u'YearBuilt', u'CouncilArea',
       u'Lattitude', u'Longtitude', u'Regionname', u'Propertycount'],
      dtype='object')
melbourne_data.head()
SuburbAddressRoomsTypePriceMethodSellerGDateDistancePostcode...BathroomCarLandsizeBuildingAreaYearBuiltCouncilAreaLattitudeLongtitudeRegionnamePropertycount
0Abbotsford85 Turner St2h1480000.0SBiggin3/12/20162.53067.0...1.01.0202.0NaNNaNYarra-37.7996144.9984Northern Metropolitan4019.0
1Abbotsford25 Bloomburg St2h1035000.0SBiggin4/02/20162.53067.0...1.00.0156.079.01900.0Yarra-37.8079144.9934Northern Metropolitan4019.0
2Abbotsford5 Charles St3h1465000.0SPBiggin4/03/20172.53067.0...2.00.0134.0150.01900.0Yarra-37.8093144.9944Northern Metropolitan4019.0
3Abbotsford40 Federation La3h850000.0PIBiggin4/03/20172.53067.0...2.01.094.0NaNNaNYarra-37.7969144.9969Northern Metropolitan4019.0
4Abbotsford55a Park St4h1600000.0VBNelson4/06/20162.53067.0...1.02.0120.0142.02014.0Yarra-37.8072144.9941Northern Metropolitan4019.0

5 rows × 21 columns

#多少条记录
len(melbourne_data)
13580
melbourne_data.describe()
#从count看出,有很多缺失值,比如BuildingArea有6452个NaN
RoomsPriceDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltLattitudeLongtitudePropertycount
count13580.0000001.358000e+0413580.00000013580.00000013580.00000013580.00000013518.00000013580.0000007130.0000008205.00000013580.00000013580.00000013580.000000
mean2.9379971.075684e+0610.1377763105.3019152.9147281.5342421.610075558.416127151.9676501964.684217-37.809203144.9952167454.417378
std0.9557486.393107e+055.86872590.6769640.9659210.6917120.9626343990.669241541.01453837.2737620.0792600.1039164378.581772
min1.0000008.500000e+040.0000003000.0000000.0000000.0000000.0000000.0000000.0000001196.000000-38.182550144.431810249.000000
25%2.0000006.500000e+056.1000003044.0000002.0000001.0000001.000000177.00000093.0000001940.000000-37.856822144.9296004380.000000
50%3.0000009.030000e+059.2000003084.0000003.0000001.0000002.000000440.000000126.0000001970.000000-37.802355145.0001006555.000000
75%3.0000001.330000e+0613.0000003148.0000003.0000002.0000002.000000651.000000174.0000001999.000000-37.756400145.05830510331.000000
max10.0000009.000000e+0648.1000003977.00000020.0000008.00000010.000000433014.00000044515.0000002018.000000-37.408530145.52635021650.000000

处理缺失值

BuildingArea YearBuilt CouncilArea
48%缺失值 40%缺失值 10%缺失值

#选择删除,删除含有缺失值的记录(行)
#dropna()
melbourne_data=melbourne_data.dropna(axis=0)#默认为删除行 默认how='any' 还有一个how='all',该条记录所有值为Na值时候才删除
melbourne_data.head()
#可以看到index为0、3、5....的有NaN值的记录都被删了
SuburbAddressRoomsTypePriceMethodSellerGDateDistancePostcode...BathroomCarLandsizeBuildingAreaYearBuiltCouncilAreaLattitudeLongtitudeRegionnamePropertycount
1Abbotsford25 Bloomburg St2h1035000.0SBiggin4/02/20162.53067.0...1.00.0156.079.01900.0Yarra-37.8079144.9934Northern Metropolitan4019.0
2Abbotsford5 Charles St3h1465000.0SPBiggin4/03/20172.53067.0...2.00.0134.0150.01900.0Yarra-37.8093144.9944Northern Metropolitan4019.0
4Abbotsford55a Park St4h1600000.0VBNelson4/06/20162.53067.0...1.02.0120.0142.02014.0Yarra-37.8072144.9941Northern Metropolitan4019.0
6Abbotsford124 Yarra St3h1876000.0SNelson7/05/20162.53067.0...2.00.0245.0210.01910.0Yarra-37.8024144.9993Northern Metropolitan4019.0
7Abbotsford98 Charles St2h1636000.0SNelson8/10/20162.53067.0...1.02.0256.0107.01890.0Yarra-37.8060144.9954Northern Metropolitan4019.0

5 rows × 21 columns

melbourne_data.describe()
RoomsPriceDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltLattitudeLongtitudePropertycount
count6196.0000006.196000e+036196.0000006196.0000006196.0000006196.0000006196.0000006196.0000006196.0000006196.0000006196.0000006196.0000006196.000000
mean2.9314071.068828e+069.7510973101.9477082.9020341.5763401.573596471.006940141.5686451964.081988-37.807904144.9902017435.489509
std0.9710796.751564e+055.61206586.4216040.9700550.7113620.929947897.44988190.83482438.1056730.0758500.0991654337.698917
min1.0000001.310000e+050.0000003000.0000000.0000001.0000000.0000000.0000000.0000001196.000000-38.164920144.542370389.000000
25%2.0000006.200000e+055.9000003044.0000002.0000001.0000001.000000152.00000091.0000001940.000000-37.855438144.9261984383.750000
50%3.0000008.800000e+059.0000003081.0000003.0000001.0000001.000000373.000000124.0000001970.000000-37.802250144.9958006567.000000
75%4.0000001.325000e+0612.4000003147.0000003.0000002.0000002.000000628.000000170.0000002000.000000-37.758200145.05270010175.000000
max8.0000009.000000e+0647.4000003977.0000009.0000008.00000010.00000037000.0000003112.0000002018.000000-37.457090145.52635021650.000000

Selecting The Prediction Target选择预测目标

以房价为预测目标

#以[Price]作为真实的y
y= melbourne_data.Price

Choosing “Features” 选择特征

作为模型的输入的那些columns叫做“特征”
哪些列影响着房价呢?
有时候,把除了target那一列外的所有columns作为特征
有时候,可能选更少的一些比较好

melbourne_data.columns
Index([u'Suburb', u'Address', u'Rooms', u'Type', u'Price', u'Method',
       u'SellerG', u'Date', u'Distance', u'Postcode', u'Bedroom2', u'Bathroom',
       u'Car', u'Landsize', u'BuildingArea', u'YearBuilt', u'CouncilArea',
       u'Lattitude', u'Longtitude', u'Regionname', u'Propertycount'],
      dtype='object')
#an example:选取这些列作为特征
melbourne_features=['Rooms', 'Bathroom',  'Landsize', 'Lattitude', 'Longtitude']
#By convention,this data is called X
X= melbourne_data[melbourne_features]
print(X.head()) #6189*5
print(y.head()) #6169*1
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
1    1035000.0
2    1465000.0
4    1600000.0
6    1876000.0
7    1636000.0
Name: Price, dtype: float64
X.describe()
RoomsBathroomLandsizeLattitudeLongtitude
count6196.0000006196.0000006196.0000006196.0000006196.000000
mean2.9314071.576340471.006940-37.807904144.990201
std0.9710790.711362897.4498810.0758500.099165
min1.0000001.0000000.000000-38.164920144.542370
25%2.0000001.000000152.000000-37.855438144.926198
50%3.0000001.000000373.000000-37.802250144.995800
75%4.0000002.000000628.000000-37.758200145.052700
max8.0000008.00000037000.000000-37.457090145.526350

Building Model: 定义model、训练fit、预测、评估

使用机器学习库scikit-learn, 简称sklearn
**Steps:

  • define: 用哪种模型 决策树?其他?
  • fit: 从数据中获取模式,即用数据训练模型 建模的核心
  • predict: 预测新样本
  • evaluate: 模型预测的准确程度
#用决策树做回归: DecisionTreeRegressor类
# from sklearn import tree
# clf= tree.DecisionTreeRegressor(random_state=1)
# clf.fit(X, y)
# clf.predict([[3,1.0,122.0, -37.8072,144.9941]])

from sklearn.tree import DecisionTreeRegressor

#Define model
melbourne_model= DecisionTreeRegressor(random_state=1)#random_state=1:确保每次运行是同一个结果
#Fit model
melbourne_model.fit( X, y)
#use model to Predict
melbourne_model.predict([[3,1.0,122.0, -37.8072,144.9941]])
array([1200000.])
melbourne_model.predict(X.head())
array([1035000., 1465000., 1600000., 1876000., 1636000.])
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值