促进机器学习

Bagging is a technique used in ML to make “weak classifiers” strong enough to make good predictions. The technique which we use is - we make use of many weak classifiers to make a prediction on our *data and then combine the results by taking their average or by picking the most likely prediction made by the majority of these weak models. The main catch here is that all the weak classifiers used are independent i.e they are not influenced by the other classifier errors or prediction, and they predict individually.

装袋是ML中使用的一种技术,可以使“弱分类器”足够强大以做出良好的预测。 我们使用的技术是-我们利用许多弱分类器对*数据进行预测,然后通过取平均值或选择大多数弱模型中最可能的预测来组合结果。 这里的主要问题是,所有使用的弱分类器都是独立的,即它们不受其他分类器错误或预测的影响,并且它们分别进行预测。

Boosting is an ensemble technique in which the weak classifiers model are not made to work independently, but sequentially. The logic behind this is that the subsequent weak classifiers model learns from the mistakes of the previous predictors. The highest error appears most in the subsequent model while low error diminishes. The weak classifier model can be chosen from a range of models like decision trees, regressors, classifiers etc. The stopping criteria should be chosen carefully or it could lead to overfitting on training data.

Boosting是一种集成技术,其中弱分类器模型并非独立运行,而是顺序运行 。 其背后的逻辑是,后续的弱分类器模型将从先前的预测变量的错误中学习。 最高误差出现在后续模型中,而最低误差则减小。 弱分类器模型可以从决策树,回归器,分类器等一系列模型中选择。应谨慎选择停止标准,否则可能会导致训练数据过度拟合。

*data: Data which we are feeding to these weak classifiers at the same time is the training data. Randomness in data is introduced before feeding them to these classifiers to prevent the issue of overfitting.

*数据:我们同时馈送到这些弱分类器的数据是训练数据。 在将数据提供给这些分类器之前引入数据的随机性,以防止过拟合的问题。

什么是弱分类器/强分类器? (What are weak classifiers/strong classifiers?)

Image for post
Fig 1
图。1

If our model is making a prediction around 0.5, this means that the classifier is “weak”. The reason behind this is, suppose we need to identify from the data whether the image is a “dog” or a “cat”. “0” denotes dog and “1" denotes cat. Here if our classifier model gives prediction close to 0.5 this denotes that our classifier model is weak in determining whether the image is a perfect cat or dog.

如果我们的模型在0.5左右进行预测,则意味着分类器为“ ”。 这背后的原因是,假设我们需要从数据中识别图像是“狗”还是“猫”。 “ 0”表示狗,“ 1”表示猫,如果分类器模型给出的预测值接近0.5,则表示我们的分类器模型在确定图像是猫还是狗时较弱。

While on the other hand “strong classifier” will predict close to either 0 or 1 which will ensure that the model is confident enough to identify the image is cat or dog.

另一方面,“ 强分类器 ”将预测接近0或1,这将确保模型有足够的信心确定图像是猫还是狗。

提升和装袋算法: (Boosting and Bagging Algorithms:)

We select weak classifier models which we are going to use in the boosting algorithms to make them work effectively. The weak classifier models which can be used are “Decision Trees” and many others. The decision trees are although fast and easy to understand but they have very high variance i.e they try to overfit the data and completely memories it rather than generalizing it. Therefore they are weak learner classifiers.

我们选择弱分类器模型,这些模型将在增强算法中使用,以使其有效工作。 可以使用的弱分类器模型是“ 决策树 ”等。 决策树虽然快速且易于理解,但它们具有很大的差异,即它们试图过度拟合数据并完全存储数据,而不是对其进行概括。 因此,它们是学习者的弱分类器。

Some common boosting algorithms are:

一些常见的提升算法是:

  1. AdaBoost

    AdaBoost
  2. XGBoost

    XGBoost
  3. Gradient Tree Boosting

    梯度树增强

Common bagging algorithm is:

常见的套袋算法为:

  1. Random Forest

    随机森林

装袋算法 (BAGGING ALGORITHM)

随机森林: (Random Forest:)

Image for post
Fig 2.1
图2.1

TASK: The table represents the data we have, with each column represent features(Gender, Age, Location, Job, Hobby, App). Now we are building app recommendation system, i.e which app is more likely to be downloaded by which category of people. For eg we can see from the data that if [Gender: F, Age: 15, location: US and so on…] is more likely to download Pokemon Go.

任务:该表代表我们拥有的数据,每列代表功能(性别,年龄,位置,工作,兴趣爱好,应用)。 现在,我们正在构建应用程序推荐系统,即哪个类别的人更有可能下​​载哪个应用程序。 例如,我们可以从数据中看出,如果[性别:F,年龄:15,位置:美国等...]更可能下载Pokemon Go。

FEATURES: Here output feature is “App” while other features are input features. So we will feed input features to our model and it will spit which app will be downloaded by that person.

特点:这里输出功能是“ App”,其他功能是输入功能。 因此,我们会将输入功能提供给我们的模型,并吐出该人将下载哪个应用程序

WEAK LEARNER CLASSIFIER: Now for the model, we are using “Decision Trees”. We can clearly see in the flowchart in Fig 2.1 that the Decision Tree has memorized every input data and overfitting has occurred.

较弱的学习者分类:现在,对于该模型,我们使用“ 决策树 ”。 我们可以在图2.1的流程图中清楚地看到,决策树存储了每个输入数据,并且发生了过度拟合

Image for post
Fig 2.2
图2.2

CONCEPT: We select some random input features like (Gender, Job, Hobby) and feed it to “Decision Tree”. We repeat this procedure several times and so we now have many decision trees with us with their predicted output. Now we choose the output which is common in most of the decision tree models.

概念:我们选择一些随机输入功能,例如(性别,工作,爱好),并将其输入“决策树”。 我们重复此过程几次,因此现在有了许多决策树及其预测输出。 现在,我们选择大多数决策树模型中常见的输出。

Random Forest can be used both for classification and regression. A forest is comprised of trees i.e “Decision Trees”. It is said that the more trees it has, the more robust a forest is. It also provides a pretty good indicator of the feature importance.

随机森林可用于分类和回归。 森林由树木组成,即“决策树”。 据说树木越多,森林就越健壮。 它还提供了功能重要性的很好指示。

APPLICATION: Random forests has a variety of applications, such as recommendation engines, image classification, and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity, and predict diseases.

应用程序: 随机森林具有多种应用程序,例如推荐引擎,图像分类和特征选择。 它可用于对忠实的贷款申请人进行分类,识别欺诈活动并预测疾病。

PROS OF RANDOM FOREST:

随机森林的优点:

  1. It overcomes the problem of overfitting by averaging or combining the results of different decision trees.

    它通过平均或组合不同决策树的结果来克服过拟合的问题。
  2. Random forests work well for a large range of data items than a single decision tree does.

    与单个决策树相比, 随机森林在较大范围的数据项上效果很好。

  3. Random forest has less variance then single decision tree.

    随机森林的方差小于单个决策树。

  4. Random forests are very flexible and possess very high accuracy.

    随机森林非常灵活,并且具有很高的准确性。

  5. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

    在随机森林算法中不需要数据缩放。 即使在没有缩放的情况下提供数据后,它仍保持良好的准确性。
  6. Random Forest algorithms maintain good accuracy even a large proportion of the data is missing.

    即使丢失了大部分数据, Random Forest算法也能保持良好的准确性。

CONS OF RANDOM FOREST:

随机森林的缺点:

  1. Complexity is the main disadvantage of Random forest algorithms.

    复杂性是随机森林算法的主要缺点。

  2. Construction of Random forests is much harder and time-consuming than decision trees.

    与决策树相比, 随机森林的建设更加困难且耗时。

  3. More computational resources are required to implement Random Forest algorithm.

    实现随机森林算法需要更多的计算资源。

  4. It is less intuitive in the case when we have a large collection of decision trees.

    当我们有大量的决策树集合时,它就不太直观了。
  5. The prediction process using random forests is very time-consuming in comparison with other algorithms.

    与其他算法相比,使用随机森林的预测过程非常耗时。

引导算法 (BOOSTING ALGORITHM)

1.梯度提升算法: (1. Gradient Boosting Algorithm:)

Image for post
Fig 3.1 Loss Function — Mean Square Error
图3.1损失函数-均方误差

The logic behind this is to minimize the loss function by using gradient descent and update our prediction on the basis of the learning rate.

其背后的逻辑是通过使用梯度下降来最小化损失函数,并根据学习率更新我们的预测。

Image for post
Fig 3.2
图3.2

CONCEPT:

概念:

  1. Use a weak classifier model like “Linear Regression, Decision Trees” on the data and get the predicted output (Output-1).

    对数据使用弱分类器模型,例如“线性回归,决策树”,并获得预测的输出( Output-1 )。

  2. Now calculate the error through any loss function you want to use like MSE(Mean Square Error).[Error=Actual_Output-Output-1]

    现在,通过您要使用的任何损失函数 (例如MSE(均方误差))来计算误差[Error = Actual_Output-Output-1]

  3. Now use another weak classifier model on the loss/error and get the output. (Output-2)

    现在对损失/错误使用另一个弱分类器模型,并获得输出。 (输出2)

  4. Now add both these outputs(Output-1+Output-2).We can correlate with Fig 3.2 and can introduce the learning rate. **Remember that output-2 is the outcome when the error was being fitted to the model.

    现在将这两个输出( Output-1 + Output-2 )相加。我们可以与图3.2关联,并可以介绍学习率。 **请记住,输出2是将误差拟合到模型时的结果。

  5. Now fit this (Output-1+Output-2) to a new weak classifier model and repeat the above procedures until the sum becomes constant. Also, make sure to not overfit the data.

    现在,将其( Output-1 + Output-2 )拟合到新的弱分类器模型,并重复上述过程,直到总和变为常数。 另外,请确保不要过度拟合数据。

2. XG Boosting算法: (2. XG Boosting Algorithm:)

XG Boost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. This is a major enhancement over gradient boosting although it is built over it, through system optimization and algorithmic enhancement, it outperforms it.

XG Boost是使用梯度提升框架的基于决策树的集成机器学习算法。 这是对梯度增强的主要增强,尽管它是基于梯度增强构建的,但通过系统优化和算法增强,其性能却胜过梯度增强。

System Optimization:

系统优化:

  1. Parallelization: XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This nesting of loops limits parallelization because without completing the inner loop (more computationally demanding of the two), the outer loop cannot be started. Therefore, to improve run time, the order of loops is interchanged using initialization through a global scan of all instances and sorting using parallel threads. This switch improves algorithmic performance by offsetting any parallelization overheads in computation.

    并行化 :XGBoost使用并行化实现来实现顺序树的构建过程。 由于用于构建基础学习者的循环的可互换性,这是可能的。 枚举一棵树的叶子节点的外部循环,以及计算特征的第二个内部循环。 循环的这种嵌套限制了并行化,因为如果不完成内部循环(两者在计算上的要求更高),就无法启动外部循环。 因此,为了缩短运行时间,通过初始化所有实例的全局扫描并使用并行线程进行排序,可以使用初始化来交换循环的顺序。 此开关通过抵消计算中的任何并行化开销来提高算法性能。

  2. Tree Pruning: The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational performance significantly.

    树木修剪: GBM框架中树木分裂的停止准则本质上是贪婪的,并且取决于分裂时的负损失准则。 XGBoost使用指定的“ max_depth”参数代替首先使用的标准,并开始向后修剪树。 这种“深度优先”的方法大大提高了计算性能。

  3. Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.

    硬件优化 :该算法旨在有效利用硬件资源。 这是通过在每个线程中分配内部缓冲区来存储梯度统计信息的高速缓存感知来实现的。 诸如“核外”计算之类的进一步增强功能可在处理不适合内存的大数据帧的同时优化可用磁盘空间。

Algorithmic Enhancements:

算法增强:

  1. Regularization: It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting.

    正则化 :它通过LASSO(L1)和Ridge(L2) 正则化惩罚更复杂的模型,以防止过度拟合。

  2. Sparsity Awareness: XGBoost naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.

    稀疏意识 :XGBoost通过根据训练损失自动“学习”最佳缺失值来自然接受输入的稀疏功能,并更有效地处理数据中不同类型的稀疏模式

  3. Weighted Quantile Sketch: XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.

    加权分位数草图: XGBoost使用分布式加权分位数草图算法来有效地找到加权数据集之间的最佳分割点。

  4. Cross-validation: The algorithm comes with built-in cross-validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.

    交叉验证 :该算法在每次迭代中均带有内置的交叉验证方法,从而无需显式编程此搜索并指定单次运行所需的增强迭代的确切次数。

翻译自: https://medium.com/@sauryathome/boosting-machine-learning-b424e84066a3

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值