
本文探讨了使用个人计算机CPU、GPU和云服务(如Google Colab、Kaggle)进行机器学习项目时的速度权衡。在Python中,使用XGBoost并借助GPU(如NVIDIA TITAN RTX)可显著加快训练速度。使用GPU云服务(如Google Colab的TESLA T4 GPU)和RAPIDS库(NVIDIA的Python数据科学库)也能进一步提高效率,尤其是在处理大规模数据时。


介绍 (Introduction)

Thanks to recent advances in storage capacity and memory management, it has become much easier to create machine learning and deep learning projects from the comfort of your own home.


In this article, I will introduce you to different possible approaches to machine learning projects in Python and give you some indications on their trade-offs in execution speed. Some of the different approaches are:

在本文中,我将向您介绍使用Python进行机器学习项目的各种可能方法,并提供一些有关它们在执行速度上的权衡的指示。 一些不同的方法是:

  • Using a personal computer/laptop CPU (Central processing unit)/GPU (Graphics processing unit).

    使用个人计算机/笔记本电脑CPU(中央处理单元)/ GPU(图形处理单元)。
  • Using cloud services (Kaggle, Google Colab).

    使用云服务(Kaggle,Google Colab)。

First of all, we need to import all the necessary dependencies:


import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score

For this example, I decided to fabricate a simple dataset using Gaussian Distributions consisting of four features and two labels (0/1):


# Creating a linearly separable dataset using Gaussian Distributions.
# The first half of the number in Y is 0 and the other half 1.
# Therefore I made the first half of the 4 features quite different from
# the second half of the features (setting the value of the means quite 
# similar) so that make quite simple the classification between the 
# classes (the data is linearly separable).
dataset_len = 40000000
dlen = int(dataset_len/2)
X_11 = pd.Series(np.random.normal(2,2,dlen))
X_12 = pd.Series(np.random.normal(9,2,dlen))
X_1 = pd.concat([X_11, X_12]).reset_index(drop=True)
X_21 = pd.Series(np.random.normal(1,3,dlen))
X_22 = pd.Series(np.random.normal(7,3,dlen))
X_2 = pd.concat([X_21, X_22]).reset_index(drop=True)
X_31 = pd.Series(np.random.normal(3,1,dlen))
X_32 = pd.Series(np.random.normal(3,4,dlen))
X_3 = pd.concat([X_31, X_32]).reset_index(drop=True)
X_41 = pd.Series(np.random.normal(1,1,dlen))
X_42 = pd.Series(np.random.normal(5,2,dlen))
X_4 = pd.concat([X_41, X_42]).reset_index(drop=True)
Y = pd.Series(np.repeat([0,1],dlen))
df = pd.concat([X_1, X_2, X_3, X_4, Y], axis=1)
df.columns = ['X1', 'X2', 'X3', 'X_4', 'Y']

Finally, now we just have to prepare our dataset to be fed into a machine learning model (dividing it into features and labels, and training and test sets):


train_size = 0.80
X = df.drop(['Y'], axis = 1).values
y = df['Y']

# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 

# Encode labels
y = label_encoder.fit_transform(y) 

# identify shape and indices
num_rows, num_columns = df.shape
delim_index = int(num_rows * train_size)

# Splitting the dataset in training and test sets
X_train, y_train = X[:delim_index, :], y[:delim_index]
X_test, y_test = X[delim_index:, :], y[delim_index:]

# Checking sets dimensions
print('X_train dimensions: ', X_train.shape, 'y_train: ', y_train.shape)
print('X_test dimensions:', X_test.shape, 'y_validation: ', y_test.shape)

# Checking dimensions in percentages
total = X_train.shape[0] + X_test.shape[0]
print('X_train Percentage:', (X_train.shape[0]/total)*100, '%')
print('X_test Percentage:', (X_test.shape[0]/total)*100, '%')

The output train test split result is shown below:


X_train dimensions:  (32000000, 4) y_train:  (32000000,)
X_test dimensions: (8000000, 4) y_validation:  (8000000,)
X_train Percentage: 80.0 %
X_test Percentage: 20.0 %

We are now ready to get started benchmarking the different approaches. In all the following examples, we will be using XGBoost (Gradient Boosted Decision Trees) as our classifier.

现在,我们准备开始对不同方法进行基准测试。 在以下所有示例中,我们将使用XGBoost(梯度增强决策树)作为分类器。

1)CPU (1) CPU)

Training an XGBClassifier on my personal machine (without using a GPU), led to the following results:



model = XGBClassifier(tree_method='hist')
model.fit(X_train, y_train)
CPU times: user 8min 1s, sys: 5.94 s, total: 8min 7s
Wall time: 8min 6s
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, tree_method='hist', verbosity=1)

Once we've trained our model, we can now check it's prediction accuracy:


sk_pred = model.predict(X_test)
sk_pred = np.round(sk_pred)
sk_acc = round(accuracy_score(y_test, sk_pred), 2)
print("XGB accuracy using Sklearn:", sk_acc*100, '%')
XGB accuracy using Sklearn: 99.0 %

In summary, using a standard CPU machine, it took about 8 minutes to train our classifier to achieve 99% accuracy.


2)GPU (2) GPU)

I will now instead make use of an NVIDIA TITAN RTX GPU on my personal machine to speed up the training. In this case, in order to activate the GPU mode of XGB, we need to specify the tree_method as gpu_hist instead of hist.

现在,我将在个人计算机上使用NVIDIA TITAN RTX GPU来加快培训速度。 在这种情况下,为了激活XGB的GPU模式,我们需要将tree_method指定为gpu_hist而不是hist


model = XGBClassifier(tree_method='gpu_hist')
model.fit(X_train, y_train)

Using the TITAN RTX led in this example to just 8.85 seconds of execution time (about 50 times faster than using just the CPU!).

在本例中,使用TITAN RTX的执行时间仅为8.85秒(比仅使用CPU快50倍!)。

sk_pred = model.predict(X_test)
sk_pred = np.round(sk_pred)
sk_acc = round(accuracy_score(y_test, sk_pred), 2)
print("XGB accuracy using Sklearn:", sk_acc*100, '%')
XGB accuracy using Sklearn: 99.0 %

This considerable improvement in speed was possible thanks to the ability of the GPU to take the load off from the CPU, freeing up RAM memory and parallelizing the execution of multiple tasks.


3)GPU云服务 (3) GPU Cloud Services)

I will now go over two examples of free GPU cloud services (Google Colab and Kaggle) and show you what benchmark score they are able to achieve. In both cases, we need to explicitly turn on the GPUs on the respective notebooks and specify the XGBoost tree_method as gpu_hist.

现在,我将通过两个免费GPU云服务示例(Google Colab和Kaggle),向您展示它们能够实现的基准测试得分。 在这两种情况下,我们都需要显式打开相应笔记本上的GPU,并将XGBoost tree_method指定为gpu_hist

Google Colab (Google Colab)

Using Google Colab NVIDIA TESLA T4 GPUs, the following scores have been registered:

使用Google Colab NVIDIA TESLA T4 GPU,已记录以下分数:

CPU times: user 5.43 s, sys: 1.88 s, total: 7.31 s
Wall time: 7.59 s

卡格勒 (Kaggle)

Using Kaggle instead led to a slightly higher execution time:


CPU times: user 5.37 s, sys: 5.42 s, total: 10.8 s
Wall time: 11.2 s
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, tree_method='gpu_hist', verbosity=1)

Using either Google Colab or Kaggle both led to a remarkable decrease in execution time.

使用Google Colab或Kaggle都可以大大减少执行时间。

One downside of using these services is the limited amount of CPU and RAM available. In fact, slightly increasing the dimensions of the example dataset caused Google Colab to run out of RAM memory (which wasn't an issue when using the TITAN RTX).

使用这些服务的缺点之一是可用的CPU和RAM数量有限。 实际上,略微增加示例数据集的大小会导致Google Colab用完RAM内存(使用TITAN RTX时这不是问题)。

One possible way to fix this type of problem when working with constrained memory devices is to optimize the code to consume the least amount of memory possible (using fixed point precision and more efficient data structures).


4)奖励积分:RAPIDS (4) Bonus Point: RAPIDS)

As an additional point, I will now introduce you to RAPIDS, an open-source collection of Python libraries by NVIDIA. In this example, we will make use of its integration with the XGBoost library to speed up our workflow in Google Colab. The full notebook for this example (with instructions on how to set up RAPIDS in Google Colab) is available here or on my GitHub Account.

另外,我现在将向您介绍RAPIDS,这是NVIDIA的Python库的开源集合。 在此示例中,我们将利用其与XGBoost库的集成来加快Google Colab中的工作流程。 此示例的完整笔记本(包含有关如何在Google Colab中设置RAPIDS的说明)可在此处或在我的GitHub帐户找到。

RAPIDS is designed to be the next evolutionary step in data processing. Thanks to its Apache Arrow in-memory format, RAPIDS can lead to up to around 50x speed improvement compared to Spark in-memory processing. Additionally, it is also able to scale from one to multi-GPUs.

RAPIDS被设计为数据处理的下一步发展。 凭借其Apache Arrow内存格式,与Spark内存处理相比,RAPIDS可以将速度提高近50倍。 此外,它还可以从一个GPU扩展到多个GPU。

All RAPIDS libraries are based on Python and are designed to have Pandas and Sklearn-like interfaces to facilitate adoption.


The structure of RAPIDS is based on different libraries in order to accelerate data science from end to end. Its main components are:

RAPIDS的结构基于不同的库,以便从头到尾加速数据科学。 它的主要组成部分是:

  • cuDF = used to perform data processing tasks (Pandas-like).

    cuDF =用于执行数据处理任务(类似于熊猫)。
  • cuML = used to create machine learning models (Sklearn-like).

    cuML =用于创建机器学习模型(类似于Sklearn)。
  • cuGraph = used to perform graph analytics (NetworkX).

    cuGraph =用于执行图分析(NetworkX)。

In this example, we will make use of it's XGBoost integration:


dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)


params = {}
booster_params = {}
booster_params['tree_method'] = 'gpu_hist' 

clf = xgb.train(params, dtrain)
CPU times: user 1.42 s, sys: 719 ms, total: 2.14 s
Wall time: 2.51 s

As we can see above, using RAPIDS it took just about 2.5 seconds to train our model (decreasing time execution by almost 200 times!).


Finally, we can now check that we obtained exactly the same prediction accuracy using RAPIDS that we registered in the other cases:


rapids_pred = clf.predict(dtest)

rapids_pred = np.round(rapids_pred)
rapids_acc = round(accuracy_score(y_test, rapids_pred), 2)
print("XGB accuracy using RAPIDS:", rapids_acc*100, '%')
XGB accuracy using RAPIDS: 99.0 %

If you are interested in finding out more about RAPIDS, more information is available here.


结论 (Conclusion)

Finally, we can now compare the execution time of the different methods used. As shown in Figure 2, using GPU optimization can substantially decrease execution time, especially if integrated with the use of RAPIDS libraries.

最后,我们现在可以比较所使用的不同方法的执行时间。 如图2所示,使用GPU优化可以大大减少执行时间,尤其是与RAPIDS库结合使用时。

Figure 3 shows how many times faster the GPUs models are compared to our baseline CPU results.


联络人 (Contacts)

If you want to keep updated with my latest articles and projects, follow me on Medium and subscribe to my mailing list. These are some of my contacts details:

如果您想随时了解我的最新文章和项目,请在Medium上关注我并订阅我的邮件列表 。 这些是我的一些联系方式:

翻译自: https://www.freecodecamp.org/news/benchmarking-machine-learning-execution-speeds/






