Kaggle教程 机器学习入门学习笔记

机器学习入门学习笔记

[跳转]《Kaggle教程 机器学习入门》系列课程目录

>> 决策树

  • 简介:是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树
    在这里插入图片描述

>> 举个例子,我们来看看澳大利亚墨尔本的房价数据。

import numpy as np
import pandas as pd

# 文件路径
melbourne_file_path = 'data/melb_data.csv'

# 读取并保存数据到DataFrame类型变量melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path, index_col="Date") 
print(melbourne_data.shape)

# 打印数据概览
melbourne_data.describe()
RoomsPriceDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltLattitudeLongtitudePropertycount
count13580.0000001.358000e+0413580.00000013580.00000013580.00000013580.00000013518.00000013580.0000007130.0000008205.00000013580.00000013580.00000013580.000000
mean2.9379971.075684e+0610.1377763105.3019152.9147281.5342421.610075558.416127151.9676501964.684217-37.809203144.9952167454.417378
std0.9557486.393107e+055.86872590.6769640.9659210.6917120.9626343990.669241541.01453837.2737620.0792600.1039164378.581772
min1.0000008.500000e+040.0000003000.0000000.0000000.0000000.0000000.0000000.0000001196.000000-38.182550144.431810249.000000
25%2.0000006.500000e+056.1000003044.0000002.0000001.0000001.000000177.00000093.0000001940.000000-37.856822144.9296004380.000000
50%3.0000009.030000e+059.2000003084.0000003.0000001.0000002.000000440.000000126.0000001970.000000-37.802355145.0001006555.000000
75%3.0000001.330000e+0613.0000003148.0000003.0000002.0000002.000000651.000000174.0000001999.000000-37.756400145.05830510331.000000
max10.0000009.000000e+0648.1000003977.00000020.0000008.00000010.000000433014.00000044515.0000002018.000000-37.408530145.52635021650.000000
# 查看数据前5行
melbourne_data.head()
# Suburb	郊区,城市 Address	联系地址 Rooms	房间数 
# Type	类型 Price	价格   Method 	方法 
# SellerG	卖方,售货员 Distance 	距离 
# Postcode	邮政编码 Bedroom2	卧房 Bathroom 	浴室
# Car	汽车 Landsize 	 土地大小 BuildingArea	建筑面积
# YearBuilt 	制造年份 CouncilArea	会议面积 Lattitude	纬度
# Longtitude	经度 Regionname	地区名字  Propertycount 财产
SuburbAddressRoomsTypePriceMethodSellerGDistancePostcodeBedroom2BathroomCarLandsizeBuildingAreaYearBuiltCouncilAreaLattitudeLongtitudeRegionnamePropertycount
Date
3/12/2016Abbotsford85 Turner St2h1480000.0SBiggin2.53067.02.01.01.0202.0NaNNaNYarra-37.7996144.9984Northern Metropolitan4019.0
4/02/2016Abbotsford25 Bloomburg St2h1035000.0SBiggin2.53067.02.01.00.0156.079.01900.0Yarra-37.8079144.9934Northern Metropolitan4019.0
4/03/2017Abbotsford5 Charles St3h1465000.0SPBiggin2.53067.03.02.00.0134.0150.01900.0Yarra-37.8093144.9944Northern Metropolitan4019.0
4/03/2017Abbotsford40 Federation La3h850000.0PIBiggin2.53067.03.02.01.094.0NaNNaNYarra-37.7969144.9969Northern Metropolitan4019.0
4/06/2016Abbotsford55a Park St4h1600000.0VBNelson2.53067.03.01.02.0120.0142.02014.0Yarra-37.8072144.9941Northern Metropolitan4019.0

1、选择建模数据

# dropna 删除有缺失的数据 (na可以看作是"not available")
melbourne_data = melbourne_data.dropna(axis=0)

2、选择预测目标 选择特征值

# 将房价保存到y变量的代码
y = melbourne_data.Price

# 房间  浴室 土地 经度 伟度
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

X.describe()
RoomsBathroomLandsizeLattitudeLongtitude
count13580.00000013580.00000013580.00000013580.00000013580.000000
mean2.9379971.534242558.416127-37.809203144.995216
std0.9557480.6917123990.6692410.0792600.103916
min1.0000000.0000000.000000-38.182550144.431810
25%2.0000001.000000177.000000-37.856822144.929600
50%3.0000001.000000440.000000-37.802355145.000100
75%3.0000002.000000651.000000-37.756400145.058305
max10.0000008.000000433014.000000-37.408530145.526350
X.head()
RoomsBathroomLandsizeLattitudeLongtitude
Date
3/12/201621.0202.0-37.7996144.9984
4/02/201621.0156.0-37.8079144.9934
4/03/201732.0134.0-37.8093144.9944
4/03/201732.094.0-37.7969144.9969
4/06/201641.0120.0-37.8072144.9941

3、构建模型

  • 使用scikit-learn库创建你的第一个模型
from sklearn.tree import DecisionTreeRegressor # 决策树

# 定义模型. 指定一个参数random_state确保每次运行结果一致
melbourne_model = DecisionTreeRegressor(random_state=1)

# 拟合模型
melbourne_model.fit(X, y)

'''
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')
'''

4、预测

print("Making predictions for the following 5 houses:")
#print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
print(y.head())

'''
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]
Date
3/12/2016    1480000.0
4/02/2016    1035000.0
4/03/2017    1465000.0
4/03/2017     850000.0
4/06/2016    1600000.0
Name: Price, dtype: float64
'''

5、模型验证 MAE

  • 从一个称为平均绝对误差(Mean Absolute Error,也称为MAE)的度量标准开始。
  • MAE 越小越好
from sklearn.metrics import mean_absolute_error # 导入MAE模块

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

'''
MAE: 62509.0528227786
'''

6、数据分割为训练和验证数据 train_test_split

from sklearn.model_selection import train_test_split

# 将数据分割为训练和验证数据,都有特征和预测目标值
# 分割基于随机数生成器。为random_state参数提供一个数值可以保证每次得到相同的分割
# 执行下面代码
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# 定义模型
melbourne_model = DecisionTreeRegressor()
# 拟合模型
melbourne_model.fit(train_X, train_y)

# 根据验证数据获得预测价格
val_predictions = melbourne_model.predict(val_X)
print("MAE:", mean_absolute_error(val_y, val_predictions))

'''
MAE: 247974.12793323514
'''

7. 尝试不同的模型

欠拟合与过拟合(overfitting underfitting)

  • 过拟合: 树节点太多 模型与训练数据几乎完全匹配
  • 欠拟合: 树节点太少 模型不能捕捉到数据中的重要特征和模式
    在这里插入图片描述
from sklearn.metrics import mean_absolute_error # 平均绝对误差模块
from sklearn.tree import DecisionTreeRegressor # 决策树

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# 比较不同max_leaf_nodes值的MAE
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

'''
Max leaf nodes: 5  		 Mean Absolute Error:  354662
Max leaf nodes: 50  		 Mean Absolute Error:  266447
Max leaf nodes: 500  		 Mean Absolute Error:  231301
Max leaf nodes: 5000  		 Mean Absolute Error:  249163
'''

>>随机森林

  • 决策树给你留下一个难题。一颗较深、叶子多的树将会过拟合,因为每一个预测都来自叶子上仅有的几个历史训练数据。一颗较浅、叶子少的树将会欠拟合,因为它不能在原始数据中捕捉到那么多的差异
  • 随机森林使用了许多树,它通过对每棵成分树的预测进行平均来进行预测。它通常比单个决策树具有更好的预测精度,
    在这里插入图片描述
from sklearn.ensemble import RandomForestRegressor # 隨機森林模块
from sklearn.metrics import mean_absolute_error # MAE 平均绝对误差

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

'''
MAE: 191525.59192369733
'''

>>结论

  • 可能还有进一步改进的空间,但是这比最佳决策树250,000的误差有很大的改进。
  • 你可以修改一些参数来提升随机森林的性能,就像我们改变单个决策树的最大深度一样。
  • 但是,随机森林模型的最佳特性之一是,即使没有这种调优,它们通常也可以正常工作。
  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值