机器学习入门学习笔记
[跳转]《Kaggle教程 机器学习入门》系列课程目录
>> 决策树
- 简介:是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/c2d4635fafb5dc89622f76b923a8b863.png)
>> 举个例子,我们来看看澳大利亚墨尔本的房价数据。
import numpy as np
import pandas as pd
melbourne_file_path = 'data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path, index_col="Date")
print(melbourne_data.shape)
melbourne_data.describe()
| Rooms | Price | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount |
---|
count | 13580.000000 | 1.358000e+04 | 13580.000000 | 13580.000000 | 13580.000000 | 13580.000000 | 13518.000000 | 13580.000000 | 7130.000000 | 8205.000000 | 13580.000000 | 13580.000000 | 13580.000000 |
mean | 2.937997 | 1.075684e+06 | 10.137776 | 3105.301915 | 2.914728 | 1.534242 | 1.610075 | 558.416127 | 151.967650 | 1964.684217 | -37.809203 | 144.995216 | 7454.417378 |
std | 0.955748 | 6.393107e+05 | 5.868725 | 90.676964 | 0.965921 | 0.691712 | 0.962634 | 3990.669241 | 541.014538 | 37.273762 | 0.079260 | 0.103916 | 4378.581772 |
min | 1.000000 | 8.500000e+04 | 0.000000 | 3000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1196.000000 | -38.182550 | 144.431810 | 249.000000 |
25% | 2.000000 | 6.500000e+05 | 6.100000 | 3044.000000 | 2.000000 | 1.000000 | 1.000000 | 177.000000 | 93.000000 | 1940.000000 | -37.856822 | 144.929600 | 4380.000000 |
50% | 3.000000 | 9.030000e+05 | 9.200000 | 3084.000000 | 3.000000 | 1.000000 | 2.000000 | 440.000000 | 126.000000 | 1970.000000 | -37.802355 | 145.000100 | 6555.000000 |
75% | 3.000000 | 1.330000e+06 | 13.000000 | 3148.000000 | 3.000000 | 2.000000 | 2.000000 | 651.000000 | 174.000000 | 1999.000000 | -37.756400 | 145.058305 | 10331.000000 |
max | 10.000000 | 9.000000e+06 | 48.100000 | 3977.000000 | 20.000000 | 8.000000 | 10.000000 | 433014.000000 | 44515.000000 | 2018.000000 | -37.408530 | 145.526350 | 21650.000000 |
melbourne_data.head()
| Suburb | Address | Rooms | Type | Price | Method | SellerG | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount |
---|
Date | | | | | | | | | | | | | | | | | | | | |
---|
3/12/2016 | Abbotsford | 85 Turner St | 2 | h | 1480000.0 | S | Biggin | 2.5 | 3067.0 | 2.0 | 1.0 | 1.0 | 202.0 | NaN | NaN | Yarra | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 |
4/02/2016 | Abbotsford | 25 Bloomburg St | 2 | h | 1035000.0 | S | Biggin | 2.5 | 3067.0 | 2.0 | 1.0 | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 |
4/03/2017 | Abbotsford | 5 Charles St | 3 | h | 1465000.0 | SP | Biggin | 2.5 | 3067.0 | 3.0 | 2.0 | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 |
4/03/2017 | Abbotsford | 40 Federation La | 3 | h | 850000.0 | PI | Biggin | 2.5 | 3067.0 | 3.0 | 2.0 | 1.0 | 94.0 | NaN | NaN | Yarra | -37.7969 | 144.9969 | Northern Metropolitan | 4019.0 |
4/06/2016 | Abbotsford | 55a Park St | 4 | h | 1600000.0 | VB | Nelson | 2.5 | 3067.0 | 3.0 | 1.0 | 2.0 | 120.0 | 142.0 | 2014.0 | Yarra | -37.8072 | 144.9941 | Northern Metropolitan | 4019.0 |
1、选择建模数据
melbourne_data = melbourne_data.dropna(axis=0)
2、选择预测目标 选择特征值
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
| Rooms | Bathroom | Landsize | Lattitude | Longtitude |
---|
count | 13580.000000 | 13580.000000 | 13580.000000 | 13580.000000 | 13580.000000 |
mean | 2.937997 | 1.534242 | 558.416127 | -37.809203 | 144.995216 |
std | 0.955748 | 0.691712 | 3990.669241 | 0.079260 | 0.103916 |
min | 1.000000 | 0.000000 | 0.000000 | -38.182550 | 144.431810 |
25% | 2.000000 | 1.000000 | 177.000000 | -37.856822 | 144.929600 |
50% | 3.000000 | 1.000000 | 440.000000 | -37.802355 | 145.000100 |
75% | 3.000000 | 2.000000 | 651.000000 | -37.756400 | 145.058305 |
max | 10.000000 | 8.000000 | 433014.000000 | -37.408530 | 145.526350 |
X.head()
| Rooms | Bathroom | Landsize | Lattitude | Longtitude |
---|
Date | | | | | |
---|
3/12/2016 | 2 | 1.0 | 202.0 | -37.7996 | 144.9984 |
4/02/2016 | 2 | 1.0 | 156.0 | -37.8079 | 144.9934 |
4/03/2017 | 3 | 2.0 | 134.0 | -37.8093 | 144.9944 |
4/03/2017 | 3 | 2.0 | 94.0 | -37.7969 | 144.9969 |
4/06/2016 | 4 | 1.0 | 120.0 | -37.8072 | 144.9941 |
3、构建模型
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)
'''
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, splitter='best')
'''
4、预测
print("Making predictions for the following 5 houses:")
print("The predictions are")
print(melbourne_model.predict(X.head()))
print(y.head())
'''
The predictions are
[1480000. 1035000. 1465000. 850000. 1600000.]
Date
3/12/2016 1480000.0
4/02/2016 1035000.0
4/03/2017 1465000.0
4/03/2017 850000.0
4/06/2016 1600000.0
Name: Price, dtype: float64
'''
5、模型验证 MAE
- 从一个称为平均绝对误差(Mean Absolute Error,也称为MAE)的度量标准开始。
- MAE 越小越好
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
'''
MAE: 62509.0528227786
'''
6、数据分割为训练和验证数据 train_test_split
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)
val_predictions = melbourne_model.predict(val_X)
print("MAE:", mean_absolute_error(val_y, val_predictions))
'''
MAE: 247974.12793323514
'''
7. 尝试不同的模型
欠拟合与过拟合(overfitting underfitting)
- 过拟合: 树节点太多 模型与训练数据几乎完全匹配
- 欠拟合: 树节点太少 模型不能捕捉到数据中的重要特征和模式
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/8891eac5f2e5f1068a950aa026b32ab7.png)
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
'''
Max leaf nodes: 5 Mean Absolute Error: 354662
Max leaf nodes: 50 Mean Absolute Error: 266447
Max leaf nodes: 500 Mean Absolute Error: 231301
Max leaf nodes: 5000 Mean Absolute Error: 249163
'''
>>随机森林
- 决策树给你留下一个难题。一颗较深、叶子多的树将会过拟合,因为每一个预测都来自叶子上仅有的几个历史训练数据。一颗较浅、叶子少的树将会欠拟合,因为它不能在原始数据中捕捉到那么多的差异
- 随机森林使用了许多树,它通过对每棵成分树的预测进行平均来进行预测。它通常比单个决策树具有更好的预测精度,
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/59798ab427e61d249d7fa14060fd7d5b.png)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
'''
MAE: 191525.59192369733
'''
>>结论
- 可能还有进一步改进的空间,但是这比最佳决策树250,000的误差有很大的改进。
- 你可以修改一些参数来提升随机森林的性能,就像我们改变单个决策树的最大深度一样。
- 但是,随机森林模型的最佳特性之一是,即使没有这种调优,它们通常也可以正常工作。