rapids_从几小时到几秒：使用RAPIDS cuML和Scikit-learn机器学习模型集成，加速，装袋和堆叠的速度提高了100倍...

最新推荐文章于 2024-06-11 22:30:00 发布

weixin_26726011

最新推荐文章于 2024-06-11 22:30:00 发布

阅读量778

点赞数

文章标签：机器学习 python 人工智能 java 深度学习

原文链接：https://medium.com/rapids-ai/100x-faster-machine-learning-model-ensembling-with-rapids-cuml-and-scikit-learn-meta-estimators-d869788ee6b1

版权

本文介绍了如何利用RAPIDS cuML库，结合Scikit-learn的元估计器，实现机器学习模型的集成、装袋和堆叠速度提升100倍。通过这种方式，可以在GPU上高效地执行数据处理和建模，显著提高大规模数据集的分析效率。

摘要由CSDN通过智能技术生成

rapids

By Nick Becker and Dante Gama Dessavre

尼克·贝克尔 ( Nick Becker)和但丁·伽玛 ( Dante Gama)Dessavre

介绍 (Introduction)

To achieve peak performance, data scientists often turn to a technique called model ensembling, in which multiple algorithms are combined in clever ways to achieve better results. Common examples include Random Forest (Bagging) and Gradient Boosted Decision Trees (Boosting), but we can use ensemble learning with arbitrary models, too. Scikit-learn provides straightforward APIs for common ensembling approaches so data scientists can easily get up and running. Unfortunately, these techniques are computationally expensive. In many cases, they’re so computationally expensive that they aren’t cost or time effective.

为了达到最佳性能，数据科学家经常求助于一种称为模型集成的技术，该技术将多种算法巧妙地组合在一起以获得更好的结果。常见示例包括随机森林( Bagging )和梯度增强决策树( Boosting )，但是我们也可以将集成学习与任意模型结合使用。 Scikit-learn为常见的集成方法提供了直接的API ，因此数据科学家可以轻松地启动并运行。不幸的是，这些技术在计算上是昂贵的。在许多情况下，它们的计算量如此之大，以致于成本或时间效率都不高。

What if you could train these complex ensemble models faster than you’re currently training your single models?

如果您可以比现在训练单个模型更快地训练这些复杂的集成模型，该怎么办？

In this post, we’ll walk through how you can now use RAPIDS cuML with scikit-learn’s ensemble model APIs to achieve more than 100x faster boosting, bagging, stacking, and more. This is possible because of the well-defined interfaces and use of duck typing in the scikit-learn codebase. Using cuML estimators as drop-in replacements mean data scientists can have their cake and eat it, too.

在本文中，我们将逐步介绍如何将RAPIDS cuML 与 scikit-learn的集成模型API一起使用，以实现100倍以上的增强，装袋，堆叠等操作。这是可能的，因为在scikit-learn代码库中定义了明确的接口并使用了鸭子输入。使用cuML估计器作为临时替代品，意味着数据科学家也可以吃蛋糕和吃蛋糕。

为什么合奏？ (Why Ensemble?)

Different kinds of model ensembles can provide many benefits, including reduced variance and higher accuracy (see ESLR sections 8.7 and 8.8). As a result, models like Random Forests and libraries like XGBoost have become very popular. But, we can ensemble with non-tree algorithms, too. As a concrete example, see the following example.

不同种类的模型集成可以提供很多好处，包括减少方差和提高准确性(请参见ESLR第8.7和8.8节)。结果，像Random Forests这样的模型和像XGBoost这样的库已经变得非常流行。但是，我们也可以与非树算法集成。作为一个具体示例，请参见以下示例。

Standard Support Vector Regression (SVR) achieves an out-of-sample R2 of 0.41. Boosted SVR achieves an out-of-sample R2 of 0.50, noticeably higher. Though it delivers better results, the boosted scikit-learn SVR is much slower to train and use. Data scientists shouldn’t have to choose between building ensemble models and fast training. Using cuML with scikit-learn gives data scientists the tools they need to do both.

标准支持向量回归(SVR)达到0.41的样本外R2。增强型SVR达到0.50的样本外R2，明显更高。尽管可以提供更好的结果，但增强的scikit-learning SVR的培训和使用速度却慢得多。数据科学家不必在构建集成模型和快速培训之间进行选择。将cuML与scikit-learn结合使用可为数据科学家提供完成这两项工作所需的工具。

与cuML + Scikit-learn集成 (Ensembling with cuML + Scikit-learn)

We’ve recently enhanced cuML’s support of scikit-learn APIs and interoperability standards so that it can be used with scikit-learn’s ensemble APIs. Even when working with NumPy based CPU inputs and outputs (currently required for these ensemble model scikit-learn APIs), there are massive speedups. In the following sections, we’ll walk through several small examples that highlight both the ease of use and the impact of using cuML with datasets of a range of sizes.

我们最近增强了cuML对scikit-learn API和互操作性标准的支持，以便它可以与scikit-learn的集成API一起使用。即使使用基于NumPy的CPU输入和输出(这些集成模型scikit-learn API当前需要)，也可以实现极大的加速。在以下各节中，我们将通过几个小示例来突出显示cuML的易用性以及将cuML与各种大小的数据集一起使用的影响。

投票分类器 (Voting Classifier)

Scikit-learn’s VotingClassifier lets your final prediction come from a vote between multiple independently trained models. In the following example, we vote between the predictions from Logistic Regression and Support Vector Classifier models, giving more weight to the predictions from the SVC model.

Scikit-learn的VotingClassifier使您的最终预测来自多个独立训练的模型之间的投票。在以下示例中，我们在Logistic回归模型和支持向量分类器模型的预测之间进行投票，从而更加重视SVC模型的预测。

With just 50,000 records in the data, using cuML for the Logistic Regression and SVC estimators in the VotingClassifier provides a 100x speedup. By the time we hit 200,000 records, the speedup factor jumps to almost 300x. cuML’s algorithms scale more effectively than their CPU equivalents because of the GPU’s massive parallelism, high-bandwidth memory, and ability to process more data before saturating the available computational resources.

在数据中只有50,000条记录的情况下，使用cuML进行Logistic回归，并在VotingClassifier中使用SVC估计器可以使速度提高100倍。当我们达到200,000条记录时，加速因子跃升至几乎300倍。由于GPU的大量并行性，高带宽内存以及能够在使可用计算资源饱和之前处理更多数据的能力，因此cuML的算法比其CPU同类产品更有效地扩展。

堆叠分类器 (Stacking Classifier)

Scikit-learn’s StackingClassifier takes the predictions from individual models as inputs to a “second-stage” classifier to make a final prediction. In the following example, we stack predictions from Logistic Regression and Support Vector Classifier models and use a Logistic Regression to make the final predictions.

Scikit-learn的StackingClassifier将来自各个模型的预测作为“第二阶段”分类器的输入，以进行最终预测。在以下示例中，我们堆叠来自Logistic回归和支持向量分类器模型的预测，并使用Logistic回归进行最终预测。

With a dataset of 100,000 rows and ten features, training the StackingClassifier is 35x faster, and scoring is more than 350x faster with cuML estimators.

拥有100,000行和10个要素的数据集，训练StackingClassifier的速度提高了35倍，而使用cuML估计器的评分速度提高了350倍以上。

袋装回归 (Bagged Regression)

Scikit-learn’s BaggingRegressor builds independent models on random samples drawn from the data (bootstrapping), and then aggregates the results to make a final prediction. This is quite similar to Random Forest but can be used with any estimator. In the following example, we bootstrap aggregate K-Nearest Neighbors Regression. KNN can easily overfit, so bagging is a great way to reduce the variance when using this high-capacity model.

Scikit-learn的BaggingRegressor在从数据中抽取的随机样本上建立独立模型( 自举 )，然后汇总结果以做出最终预测。这与“随机森林”非常相似，但可以与任何估算器一起使用。在下面的示例中，我们引导聚合K最近邻回归。 KNN很容易过拟合，因此使用此大容量模型时，装袋是减少差异的一种好方法。

We’ve increased our data size to 250,000 records for this example. With 250,000 rows, using cuML for Bagged KNN Regression is 245x faster. From 1.3 hours down to 19 seconds by swapping one line of code.

在此示例中，我们已将数据大小增加到250,000条记录。对于250,000行，将cuML用于袋装KNN回归的速度快245倍。通过交换一行代码，从1.3 小时降至19 秒。

增强回归 (Boosted Regression)

Scikit-learn’s BoostingRegressor builds a model using the AdaBoost algorithm. At a high level, this involves fitting and predicting on the data, increasing the weight of the “difficult” samples in the data, and continuing to train the model with the new sample weights. In the following example, we boost Support Vector Regression.

Scikit-learn的BoostingRegressor使用AdaBoost算法构建模型。从高层次上讲，这涉及对数据进行拟合和预测，增加数据中“困难”样本的权重，并继续使用新的样本权重训练模型。在以下示例中，我们增强了支持向量回归。

Even with just 20,000 rows and ten features, dropping cuML’s SVR into scikit-learn’s BoostingRegressor API gives a 140x speedup during training and a 400x speedup during scoring.

即使只有20,000行和十个功能，将cuML的SVR放到scikit-learn的BoostingRegressor API中也可以在训练过程中提高140倍的速度，在评分过程中提高400倍的速度。

结论 (Conclusion)

Ensemble modeling can lead to better models but is often too computationally expensive to justify. By integrating and dramatically speeding up scikit-learn’s meta-estimators, cuML now allows data scientists to train ensemble models faster than they could previously train individual models. Ensemble learning and autoML libraries built around scikit-learn APIs can unlock speedups like those shown above by allowing users to swap scikit-learn estimators for cuML estimators explicitly or implicitly (via duck typing).

集成建模可以产生更好的模型，但是通常在计算上过于昂贵，无法证明其合理性。通过集成并大大加快scikit-learn的元估计量，cuML现在使数据科学家能够以比以前训练单个模型更快的速度训练整体模型。围绕scikit-learn API构建的集成学习和autoML库可以通过允许用户将明示或暗示地(通过鸭类输入)交换cuci估计器的scikit-learn估计器来解锁如上所示的加速。

Today, these ensemble modeling APIs require using CPU-based inputs and outputs (e.g., NumPy arrays). The PyData community has been actively working on efforts to streamline using arbitrary arrays (including GPU arrays) in libraries relying on NumPy. Eventually, we hope to support all of these meta-estimators end-to-end on the GPU for even more considerable speedups.

如今，这些集成建模API要求使用基于CPU的输入和输出(例如，NumPy数组)。该PyData社区一直积极努力的工作，以简化使用任意阵列(包括GPU阵列)在依靠NumPy的库。最终，我们希望在GPU上端到端支持所有这些元估计器，以实现更大的加速。

Want to help drive data science software forward? Check out cuML and scikit-learn on Github and file a feature request or contribute a pull request. Want to get started with RAPIDS and access these 100x+ speedups? Check out the Getting Started webpage, with links to help you download pre-built Docker containers or install directly via Conda.

想要帮助推动数据科学软件发展吗？在Github上查看cuML和scikit-learn并提交功能请求或提交拉取请求。是否想开始使用RAPIDS并获得这些100倍以上的加速比？请查看“ 入门”网页，其中包含可帮助您下载预制Docker容器或直接通过Conda安装的链接。