Kaggle学习 Learn Machine Learning 7.Random Forests 随机森林

最新推荐文章于 2024-07-13 00:38:02 发布

青冬

最新推荐文章于 2024-07-13 00:38:02 发布

阅读量732

点赞数

分类专栏： Kaggle

转载必须得到本人许可，未经允许，不得转载

本文链接：https://blog.csdn.net/qq_36610426/article/details/80615917

版权

Kaggle 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

7.Random Forests随机森林

本文是Kaggle自助学习下的文章，转回到目录点击这里

Thistutorial is part of the series LearnMachine Learning. At the endof this step, you will be able to use your first sophisticated machine learningmodel, the Random Forest.本教程是Learn Machine Learning系列的一部分。在这一步结束时，您将能够使用您的第一个复杂的机器学习模型 随机森林。

Introduction介绍

Decisiontrees leave you with a difficult decision. A deep tree with lots of leaves willoverfit because each prediction is coming from historical data from only thefew houses at its leaf. But a shallow tree with few leaves will perform poorlybecause it fails to capture as many distinctions in the raw data.决策树有一个特别烦的地方。一棵长着许多叶子的深树将会过拟合，因为每一个预测都来自于历史数据，而这些数据仅仅来自于树叶上的少数几个房子。但是，只有很少叶子的浅层树的性能也会很差，因为它无法捕获原始数据中的许多特色特征。

Even today's mostsophisticated modeling techniques face this tension between underfitting andoverfitting. But, many models have clever ideas that can lead to betterperformance. We'll look at the random forest as an example.即使是当今最先进的建模技术，也面临着过拟合和欠拟合之间的紧张关系。但是，还有许多模型有着更聪明的办法，可以达到更好的性能。我们将以随机森林为例。

The random forest usesmany trees, and it makes a prediction by averaging the predictions of eachcomponent tree. It generally has much better predictive accuracy than a singledecision tree and it works well with default parameters. If you keep modeling,you can learn more models with even better performance, but many of those aresensitive to getting the right parameters.随机森林使用许多树，并通过对每个树的预测的值进行平均来得到最终的答案。它通常比单个决策树具有更好的预测准确性，并且与默认参数一致。如果继续建模，则可以学习更多具有更好性能的模型，但其中许多模型对参数很敏感。

Example 例

You've alreadyseen the code to load the data a few times. At the end of data-loading, we havethe following variables: 您已经见过几次加载数据的代码了。在数据加载结束时，我们有以下变量：

l train_X

l val_X

l train_y

l val_y

We build a RandomForest similarly to how we built a decision tree inscikit-learn.我们将建立一个随机森林，这类似于我们在SKlearn中建立决策树。

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()

forest_model.fit(train_X, train_y)

melb_preds = forest_model.predict(val_X)

print(mean_absolute_error(val_y, melb_preds))

可以看到，就是改了个model而已。（但从3000+的错误率降到了2300+）

Conclusion 结论

There islikely room for further improvement, but this is a big improvement over thebest decision tree error of 250,000. There are parameters which allow you tochange the performance of the Random Forest much as we changed the maximumdepth of the single decision tree. But one of the best features of RandomForest models is that they generally work reasonably even without this tuning.这儿可能还有进一步改进的空间，但这比250,000的最佳决策树错误有了很大的改进。有些参数可以让我们改变随机森林的性能，就像我们改变单个决策树的最大深度一样。但随机森林模型的一个最好的特点是，即使没有这种调整，它们一般也能合理地工作。（但是有的话当然会更棒。）

You'll soon learn theXGBoost model, which provides better performance when tuned well with the rightparameters (but which requires some skill to get the right model parameters).您很快就会学习XGBoost模型，它在使用正确的参数进行调优时提供了更好的性能(但是需要一些技巧才能得到正确的模型参数)。XGBoost（eXtreme Gradient Boosting）是工业界逐渐风靡的基于GradientBoosting算法的一个优化的版本，可以给预测模型带来能力的提升。

Continue 继续

You willsee more big improvements in your models as soon as you start the Intermediatetrack in Learn Machine Learning . Butyou now have a model that's a good starting point to compete in a machinelearning competition!一旦您开始学习机器学习中的中级课程，您将会在模型中看到更多重大的改进。你现在仅仅有一个模型，但这也是在机器学习竞赛中竞争的一个很好的起点！

Follow these stepsto make submissions for your currentmodel. Then you can watch your progress in subsequent steps as you climb up theleaderboard with your continually improving models.请按照以下步骤为当前模型进行提交。然后，您可以在后续步骤中观察到您的进度，并随着您不断改进的模型，去攀登排行榜吧！

本文是Kaggle自助学习下的文章，转回到目录点击这里