多路并发,并行计算_使用r进行并行计算以使用h2o建立并发模型-CSDN博客

多路并发,并行计算

Parallel Computing with R

使用R进行并行计算

The R language offers advantageous means capable of creating statistical models, data processing and visualization methods, but scaling can be difficult with the increase of the data volume.By default, R is limited to running on only one thread on the CPU. If we want to get faster results or perform complex tasks, we need to use some packages that can take advantage of the multiple CPU cores from our machine to reduce the processing time.

R语言提供了能够创建统计模型，数据处理和可视化方法的有利手段，但是随着数据量的增加，缩放可能会很困难。默认情况下，R被限制为仅在CPU上的一个线程上运行。如果我们想获得更快的结果或执行复杂的任务，则需要使用一些可以利用机器中多个CPU内核的软件包来减少处理时间。

H2O

水

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.https://www.h2o.ai/products/h2o/

H2O是具有线性可扩展性的完全开源的分布式内存中机器学习平台。 H2O支持最广泛使用的统计和机器学习算法，包括梯度提升机器，广义线性模型，深度学习等。 H2O还具有业界领先的AutoML功能，该功能可自动运行所有算法及其超参数，以生成最佳模型的排行榜。 H2O平台已在全球超过18,000个组织中使用，并且在R＆Python社区中都非常受欢迎。 https://www.h2o.ai/products/h2o/

H2O, in addition to being a package capable of data modeling, can also be a tool used in R to take advantage of the machine’s resources, since it has Java-based software as a backend, and its primary purpose is to be a distributed, parallel, in-memory processing engine using multi-threading and multi-nodes.Using the H2O library in R, it is possible to define the number of threads in the thread pool which relates very closely to the number of CPUs used. By default, it uses all available CPUs on the host (nthreads = -1); for manual definition we must use a positive integer that specifies the number of CPUs directly (e.g nthreads = 2).

H2O除了是可以进行数据建模的软件包外，还可以用作R中的工具，以利用机器的资源，因为它具有基于Java的软件作为后端，并且其主要目的是成为分布式，使用多线程和多节点的并行，内存中处理引擎。使用R中的H2O库，可以定义线程池中的线程数，该线程数与所使用的CPU数非常相关。默认情况下，它使用主机上所有可用的CPU( nthreads = -1)；对于手动定义，我们必须使用直接指定CPU数量的正整数(例如nthreads = 2)。

Concurrent Model Building

并行模型构建

We can use the capabilities of H2O together with the parallel computing R packages in many ways to solve complex problems in a faster way.To exemplify this combination, let’s take a look at two scenarios that differ in the problem addressed and in the parallel method used.In the first scenario, the plan is to build several models for the same problem, differing the hyperparameters used in training, in order to choose the best model in test performance. In the second scenario, the plan is to create several different models for different outcomes, but in which they all share the same knowledge data, only the answer will be different.It should be noted that only basic and necessary preprocessing steps will be taken to create models, without special treatment to achieve the best possible solution. The focus will be on demonstrating some ideas that can serve as a basis for more complex project developments.

我们可以通过多种方式将H2O的功能与并行计算R包一起使用，以更快地解决复杂问题。为说明这种结合，让我们看一下两种情况，两种情况在所解决的问题和所使用的并行方法上有所不同在第一种情况下，计划是针对同一问题构建多个模型，不同的是训练中使用的超参数，以便选择测试性能的最佳模型。在第二种情况下，计划是针对不同的结果创建几个不同的模型，但是在它们都共享相同的知识数据的情况下，只有答案会有所不同，应注意的是，将仅采取基本且必要的预处理步骤来完成创建模型，无需特殊处理即可获得最佳解决方案。重点将放在展示一些想法，这些想法可以作为更复杂的项目开发的基础。

doParallel

The doParallel package is a “parallel backend” for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. The foreach package must be used in conjunction with a package such as doParallel, in order to execute code in parallel. The user must register a parallel backend to use, otherwise foreach will execute tasks sequentially, even when the %dopar% operator is used. https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

doParallel软件包是foreach软件包的“并行后端”。它提供了并行执行foreach循环所需的机制。为了并行执行代码，必须将foreach软件包与doParallel之类的软件包结合使用。用户必须注册一个并行后端才能使用，否则即使使用％dopar％运算符，foreach也将顺序执行任务。 https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

It’s very simple to run a quick example comparing the elapsed time of a sequential loop with a 2 cores parallel loop. Just replacing %do% by %dopar%.

运行一个简单的示例，比较顺序循环和2核并行循环的经过时间是非常简单的。只需将％do％替换为％dopar％即可。

user   system elapsed
55.838  1.049  52.408

library(doParallel)
cluster <- makeCluster(2)
registerDoParallel(cluster)
system.time({
  res <- foreach(i = 1:1000) %dopar% {
    mean(rnorm(i * 1000))
  }
})
stopCluster(cluster)

user   system elapsed
5.187   0.231  27.896

Now let’s move on to more complex scenarios. In the first scenario, the well-known Titanic dataset (https://www.kaggle.com/c/titanic/data) will serve as an example where we’ll use a parallel loop registered by the doParallel package and H2O capabilities to create different models at the same time. In this case, it will be a supervised learning problem with binomial classification response (survived: true or false) and we’ll use gradient boosting machine as a machine learning technique algorithm.All created models will vary in a hyperparameter value (max_depth). The final step will be the evaluation of all (20) models created in a test dataset and in which the value of a performance metric (AUC — area under curve) for each model will be returned. The seed value is the same for both data split and model starting condition so that the results in different approaches will be exactly the same.

现在，让我们继续进行更复杂的场景。在第一种情况下，著名的Titanic数据集( https://www.kaggle.com/c/titanic/data )将作为示例，在此示例中，我们将使用由doParallel软件包注册的并行循环和H2O功能来同时创建不同的模型。在这种情况下，这将是具有二项式分类响应(幸免：正确或错误)的监督学习问题，我们将使用梯度提升机作为机器学习技术算法，所有创建的模型将在超参数值(max_depth)中变化。最后一步是评估在测试数据集中创建的所有(20)个模型，其中将返回每个模型的性能指标( AUC —曲线下的面积)的值。对于数据拆分和模型启动条件，种子值均相同，因此采用不同方法的结果将完全相同。

tryCatch(
  expr = {
    run_args <- commandArgs(trailingOnly = TRUE)
    stopifnot(length(run_args) > 0)
    
    # H2O Init Port argument
    arg_port <- as.numeric(run_args[1])
    # Max Depth argument
    arg_max_depth <- as.numeric(run_args[2])
    
    # Init H2O
    suppressPackageStartupMessages(library(h2o))
    invisible(suppressWarnings(capture.output(
      h2o.init(
        port = arg_port,
        nthreads = 1,
        max_mem_size = "1G"
      )
    )))
    h2o.no_progress()
    
    # Preparation
    df_path <- "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
    df <- h2o.importFile(path = df_path)
    response <- "survived"
    predictors <- setdiff(names(df), c(response, "name"))
    df[[response]] <- as.factor(df[[response]])
    
    splits <- h2o.splitFrame(
      data = df,
      ratios = c(0.6, 0.2),
      destination_frames = c("TRAIN", "VALID", "TEST"),
      seed = 1234
    )
    
    # Modelation
    gbm <- h2o.gbm(
      x = predictors,
      y = response,
      training_frame = "TRAIN",
      validation_frame = "VALID",
      ntrees = 10000,
      max_depth = arg_max_depth,
      sample_rate = 0.8,
      col_sample_rate = 0.8,
      learn_rate = 0.05,
      learn_rate_annealing = 0.99,
      stopping_rounds = 5,
      stopping_tolerance = 1e-4,
      stopping_metric = "AUC",
      score_tree_interval = 10,
      seed = 1234
    )
    
    # Evaluation
    model_performance <- h2o.performance(gbm, newdata = h2o.getFrame("TEST")) 
    auc <- h2o.auc(model_performance)
    cat(auc)
  },
  warning = function(w) {
    cat(0)
  },
  error = function(e) {
    cat(0)
  },
  finally = {
    h2o.removeAll()
    h2o.shutdown(prompt = FALSE)
  } 
)

The script performs the following:

该脚本执行以下操作：

Receives 2 arguments (H2O port and Max Depth);
接收2个参数(H2O端口和最大深度)；
Starts an H2O instance using 1 Thread on the specified port as an argument.
使用指定端口上的1个线程作为参数来启动H2O实例。
Import the titanic dataset for the started H2O cluster;
导入已启动的H2O集群的钛酸数据集；
Split the dataset to train, valid and test datasets;
拆分数据集以训练，验证和测试数据集；
Create a single GBM Model with the defined hyperparameters and with the max_depth specified as an argument;
创建一个具有定义的超参数并以max_depth作为参数的GBM模型；
Evaluates the model created using test dataset and collects the calculated AUC metric;
评估使用测试数据集创建的模型，并收集计算出的AUC度量；
Clean all objects in memory in the H2O instance;
清理H2O实例中内存中的所有对象；
Closes the H2O instance;
关闭H2O实例；
Returns the AUC value or zero in case of any error or warning.
如果出现任何错误或警告，则返回AUC值或零。

Based on this script we can create some ways of how it can be executed through the system call. Some common problems in using and understanding the doParallel package are originated in objects, connections or packages that are not present in the environment where the foreach loop runs. This system call method is a way to ensure that we won’t have such problems since the script loads all the necessary resources to perform. Although it works, it does not necessarily imply that it is the best solution for all cases. Just think about a shared required operation, like loading a large dataset, that will be repeated in each execution becoming inefficient.

基于此脚本，我们可以创建一些如何通过系统调用执行该脚本的方法。使用和理解doParallel软件包的一些常见问题起源于在foreach循环运行的环境中不存在的对象，连接或软件包。这种系统调用方法是一种确保脚本不会加载此类问题的方法，因为脚本会加载所有必需的资源来执行。尽管它可以工作，但不一定意味着它是所有情况下的最佳解决方案。只需考虑一个共享的必需操作，例如加载大型数据集，该操作将在每次执行中重复执行，从而导致效率低下。

Sequential Loop

顺序循环

library(data.table)
library(foreach)
H2O_INIT_PORT <- 40000


iterations <- foreach(i = 1:20) %do% {
  iteration_port <- H2O_INIT_PORT + i * 3
  iteration_max_depth <- 3 +  i
  model_cmd <- sprintf("Rscript --vanilla titanic_gbm.R %s %s",
                       iteration_port,
                       iteration_max_depth)
  auc <- system(model_cmd, intern = TRUE)
  list(max_depth =  iteration_max_depth,
       auc = auc)
}
results <- rbindlist(iterations)

Parallel Loop with doParallel

与doParallel并行循环

run_args <- commandArgs(trailingOnly = TRUE)
stopifnot(length(run_args) > 0)
arg_cores <- as.numeric(run_args[1])


library(data.table)
library(doParallel)
H2O_INIT_PORT <- 40000


cluster <- makeCluster(arg_cores)
registerDoParallel(cluster)
iterations <- foreach(i = 1:20 %dopar% {
  iteration_port <- H2O_INIT_PORT + i * 3
  iteration_max_depth <- 3 +  i
  model_cmd <- sprintf("Rscript --vanilla titanic_gbm.R %s %s",
                       iteration_port,
                       iteration_max_depth)
  auc <- system(model_cmd, intern = T)
  list(max_depth = iteration_max_depth,
       auc = auc)
}
stopCluster(cluster)


results <- rbindlist(iterations)

H2O Grid Search

H2O网格搜索

For those more familiar with the capabilities of H2O, at this point you should be wondering why not simply use a grid search for this problem. It’s the same concept: building models through a set of hyperparameters and consequently selecting the best model.Although, with the previous example, we’re able to have more control in the combination of parameters for a single model, contrary to the “RandomDiscrete” strategy in the grid search, or to avoid some combinations that we know that may not work by choosing “Cartesian” strategy. And remember, this is just a simple example to show a different approach, in which the goal is to trigger more elaborate ideas and solutions.Anyway, let’s add and compare grid search as a type of execution, although shorter execution times are expected. In the latest versions of H2O, users can specify a “parallelism” parameter when running a grid search. A value of 1 indicates sequential building (default); a value of 0 is used for adaptive parallelism; and any value higher than 1 sets the exact number of models built in parallel.

对于那些更熟悉H2O功能的人，此时您应该想知道为什么不简单地使用网格搜索来解决这个问题。这是相同的概念：通过一组超参数构建模型并选择最佳模型。尽管在前面的示例中，我们能够对单个模型的参数组合进行更多控制，这与“ RandomDiscrete ”相反网格搜索中的策略，或者通过选择“笛卡尔”策略来避免某些我们不知道的组合可能有效。记住，这只是一个简单的例子，展示了一种不同的方法，其目的是触发更多精心设计的想法和解决方案。尽管如此，我们希望将网格搜索作为一种执行类型进行添加和比较，尽管期望缩短执行时间。在最新版本的H2O中，用户可以在运行网格搜索时指定“ parallelism”参数。值1表示顺序构建(默认)；值1表示顺序构建。值0用于自适应并行性；大于1的任何值将设置并行构建的模型的确切数量。

suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(h2o))
run_args <- commandArgs(trailingOnly = TRUE)
stopifnot(length(run_args) > 0)
arg_parallelism <- as.numeric(run_args[1])


# Init H2O
invisible(suppressWarnings(capture.output(
  h2o.init(
    port = 40000,
    nthreads = arg_parallelism,
    max_mem_size = "5G"
  )
)))
h2o.no_progress()


# Preparation
df_path <- "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
df <- h2o.importFile(path = df_path)
response <- "survived"
predictors <- setdiff(names(df), c(response, "name"))
df[[response]] <- as.factor(df[[response]])


splits <- h2o.splitFrame(
  data = df,
  ratios = c(0.6, 0.2),
  destination_frames = c("TRAIN", "VALID", "TEST"),
  seed = 1234
)


# Modelation
grid <- h2o.grid(
  hyper_params = list(max_depth = seq(4, 23, 1)),
  search_criteria = list(strategy = "Cartesian"),
  algorithm = "gbm",
  grid_id = "gbm_grid",
  x = predictors,
  y = response,
  training_frame = h2o.getFrame("TRAIN"),
  validation_frame = h2o.getFrame("VALID"),
  parallelism = arg_parallelism,
  ntrees = 10000,
  learn_rate = 0.05,
  learn_rate_annealing = 0.99,
  sample_rate = 0.8,
  col_sample_rate = 0.8,
  stopping_rounds = 5,
  stopping_tolerance = 1e-4,
  stopping_metric = "AUC",
  score_tree_interval = 10,
  seed = 1234
)


eval_auc <- function(model_id) {
  model <- h2o.getModel(model_id)
  model_performance <- h2o.performance(model, newdata = h2o.getFrame("TEST"))
  auc <- h2o.auc(model_performance)
  max_depth <- model@allparameters$max_depth
  list(max_depth = max_depth,
       auc = auc)
}


results <- lapply(unlist(grid@model_ids), eval_auc)
results <- rbindlist(results)
results <- results[order(max_depth)]


h2o.removeAll()
h2o.shutdown(prompt = FALSE)

Parallel Loop with unique H2O instance

具有唯一H2O实例的并行循环

Since the grid search does not repeat some operations, such as loading the dataset, in all iterations we defined a different solution to try to replicate its processing mode. Basically, only one H2O instance is started with the same settings used previously in the grid search. The dataset is loaded once and only afterward is called the model creation script (adapted).This adapted script must have the instruction to initiate an instance in H2O but by placing the same port as the instance started and the option startH2O=FALSE to connect and not to start.

由于网格搜索不会重复某些操作(例如加载数据集)，因此在所有迭代中，我们定义了不同的解决方案来尝试复制其处理模式。基本上，只有一个H2O实例以先前在网格搜索中使用的相同设置启动。数据集被加载一次，之后才被称为模型创建脚本(已适配)，此适配脚本必须具有在H2O中启动实例的指令，但要通过与实例启动相同的端口放置，并使用startH2O = FALSE选项进行连接和连接。不开始。

run_args <- commandArgs(trailingOnly = TRUE)
stopifnot(length(run_args) > 0)
arg_cores <- as.numeric(run_args[1])


library(data.table)
library(doParallel)
suppressPackageStartupMessages(library(h2o))


# Init H2O
H2O_INIT_PORT <- 40000
invisible(suppressWarnings(capture.output(
  h2o.init(
    port = H2O_INIT_PORT,
    nthreads = arg_cores,
    max_mem_size = "5G"
  )
)))
h2o.no_progress()


# Preparation
df_path <- "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
df <- h2o.importFile(path = df_path)
response <- "survived"
df[[response]] <- as.factor(df[[response]])


splits <- h2o.splitFrame(
  data = df,
  ratios = c(0.6, 0.2),
  destination_frames = c("TRAIN", "VALID", "TEST"),
  seed = 1234
)


cluster <- makeCluster(arg_cores)
registerDoParallel(cluster)
iterations <- foreach(i = 1:20) %dopar% {
  iteration_max_depth <- 3 +  i
  model_cmd <- sprintf("Rscript --vanilla titanic_gbm_init_false.R %s %s",
                       H2O_INIT_PORT,
                       iteration_max_depth)
  auc <- system(model_cmd, intern = TRUE)
  list(max_depth =  iteration_max_depth,
       auc = auc)
}
results <- rbindlist(iterations)


stopCluster(cluster)
h2o.removeAll()
h2o.shutdown(prompt = FALSE)

Iterations performed:

执行的迭代：

All steps & Different H2O instances(each iteration loads the dataset and starts an H2O instance)

所有步骤和不同的H2O实例(每次迭代都会加载数据集并启动H2O实例)

Sequential (1 Thread for each iteration)
顺序的(每次迭代有1个线程)
Parallel (2 Cores — 1 Thread for each iteration)
并行(2个核心-每次迭代1个线程)
Parallel (5 Cores — 1 Thread for each iteration)
并行(5个核心-每次迭代有1个线程)

Optimized Steps & Unique H2O Instance(the dataset it’s loaded once and only one instance it’s initiated)

优化步骤和独特的H2O实例(仅加载一次且仅启动一个实例的数据集)

Grid (1 Thread for all iterations — parallelism level 1)
网格(所有迭代有1个线程-并行度为1)
Grid (2 Threads for all iterations — parallelism level 2)
网格(所有迭代有2个线程-并行度2级)
Grid (5 Threads for all iterations — parallelism level 5)
网格(所有迭代有5个线程-并行级别5)
Sequential (1 Core — 1 Thread for all iterations)
顺序的(1个核心-1个线程用于所有迭代)
Parallel (2 Cores — 2 Threads for all iterations)
并行(2个内核— 2个线程用于所有迭代)
Parallel (5 Cores — 5 Threads for all iterations)
并行(5个内核—所有迭代有5个线程)

As expected, the H2O grid search is already quite optimized and it’s a very good solution. Although, and as mentioned, we can manage to implement more specific model creation processes with similar times, or even create a solution that brings together the best of both worlds: A parallel loop using H2O grid search to create several models. There is a big set of possibilities to make better use of our machine’s resources, speed up some analyses or even reduce costs in Azure and AWS services for example.

不出所料，H2O网格搜索已经非常优化，这是一个很好的解决方案。尽管如上所述，我们仍可以设法在更短的时间内实施更具体的模型创建过程，甚至可以创建一个结合了两全其美的解决方案：使用H2O网格搜索的并行循环可创建多个模型。有很多可能性可以更好地利用我们计算机的资源，加快分析速度，甚至可以降低Azure和AWS服务的成本。

mclapplyAnother way (more simple and direct) to enable parallel processing is to use the mclapply function from the parallel package. The mclapply function is basically a parallelized version of the lapply function.

mclapply 启用并行处理的另一种方式(更简单直接)是使用并行包中的mclapply函数。该mclapply功能基本lapply功能的并行版本。

The first two arguments to mclapply() are exactly the same as they are for lapply(). However, mclapply() has further arguments (that must be named), the most important of which is the mc.cores argument, that you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify mc.cores = 4 to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R)https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html

mclapply()的前两个参数与lapply()的前两个参数完全相同。但是，mclapply()还有其他参数(必须命名)，其中最重要的是mc.cores参数，您可以使用它指定要分割计算的处理器/核的数量。例如，如果您的计算机上有4个内核，则可以指定mc.cores = 4来中断您在4个内核上的并行化操作(尽管如果您在后台运行除R之外的其他操作可能不是最好的主意) https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html

As an example to demonstrate mclapply, we will use the digit recognition dataset present in the Kaggle competition (https://www.kaggle.com/c/digit-recognizer).In this demonstration we’ll transform the multinomial classification response (label:[0–9]) into several binomial classifications (isX: True or False), where, instead of creating a unique model to identify the ten digits, we are going to create 10 models (one model for each digit) with True or False probabilities. using the H2O AutoML functionality.In this case, the division may not be the best solution, but in a scenario where you really need to create individual classification models based on the same knowledge data, the same data processing and modeling for different outputs, this can be a good and faster option.

作为演示mclapply的示例，我们将使用Kaggle竞赛( https://www.kaggle.com/c/digit-recognizer )中存在的数字识别数据集。在本演示中，我们将转换多项式分类响应(标签：[0–9])分为几个二项式分类(isX：True或False)，在此，我们将创建10个具有True或True的模型(而不是用于标识十位数字的唯一模型)。错误概率。使用H2O AutoML功能。在这种情况下，划分可能不是最佳解决方案，但是在您确实需要基于相同的知识数据，相同的数据处理和针对不同输出的建模来创建单独分类模型的情况下，可以是一个很好且更快的选择。

library(parallel)
library(data.table)
suppressPackageStartupMessages(library(h2o))
h2o.no_progress()


#Preparation
TRAIN <- read.csv("mnist_train.csv")
TRAIN <- as.data.table(TRAIN)
TEST <- read.csv("mnist_test.csv")
H2O_INIT_PORT <- 40000


predict_digit <- function(digit) {
  # Init H2O (port from argument)
  invisible(suppressWarnings(capture.output(
    h2o.init(
      port = H2O_INIT_PORT + digit * 3,
      nthreads = 1,
      max_mem_size = "1G"
    )
  )))
  
  # Creating a personalized digit iteration train dataset with the appropriate response column
  response <- paste0("is", digit)
  it_train <- copy(TRAIN)
  it_train <-
    it_train[, eval(response) := ifelse(label == digit, TRUE, FALSE)]
  it_train_hf <- as.h2o(it_train)
  predictors <- setdiff(names(it_train), c(response, "label"))
  it_test_hf <- as.h2o(TEST)
  
  # Modelation - Auto ML
  automl <- h2o.automl(
    x = predictors,
    y = response,
    training_frame = it_train_hf,
    nfolds = 5,
    exclude_algos = c("GLM", "DeepLearning"),
    balance_classes = TRUE,
    max_runtime_secs = 300,
    seed = 1234
  )
  
  # Predict using automl leader model (best auc in xval)
  prediction <- h2o.predict(automl@leader, it_test_hf)
  prediction <- as.data.table(prediction)
  names(prediction)[names(prediction) == "TRUE."] <- response
  # Clear Iteration and shutdown h2o instance
  h2o.removeAll()
  h2o.shutdown(prompt = FALSE)
  # returns the probability from being the iteration digit of every row from test dataset
  return(prediction[, ..response])
}

We will use 10 cores, which means it will process an H2O AutoML for all digits at the same time. As you can notice in the script above, for each iteration, an H2O instance is initiated with 1 thread, each with its port. Each iteration will create as many models as possible in a maximum time of 300 seconds and then use the best model (automatically classified by H2O through the highest AUC value in cross-validation) to predict the labels of the test dataset.

我们将使用10个内核，这意味着它将同时为所有数字处理H2O AutoML。如您在上面的脚本中所注意到的，对于每次迭代，H2O实例都是由1个线程启动的，每个线程都有其端口。每次迭代将在300秒的最大时间内创建尽可能多的模型，然后使用最佳模型(H2O通过交叉验证中的最高AUC值由H2O自动分类)来预测测试数据集的标签。

system.time({
  predictions <- mclapply(
    X = 0:9,
    FUN = predict_digit,
    mc.preschedule = FALSE,
    mc.cores = 10,
    mc.cleanup = TRUE,
    mc.silent = FALSE
  )
})


predictions <- do.call(cbind, predictions)
predictions[, Label := colnames(.SD)[max.col(.SD, ties.method = "first")]]
predictions[, Label := gsub("is", "", Label)]

user    system   elapsed
87.369   9.970   408.359

The value shown for elapsed time is a good argument to confirm the advantage of carrying out a concurrent implementation for these types of cases. Had we not done so, the time would have been nine to ten times higher.

显示的经过时间值是一个很好的论据，可以确认对这些类型的案例执行并发实施的优势。如果我们不这样做，时间将增加九到十倍。

Not important for the demonstration, but the final result was a table with the probabilities of each row for each digit. With this table, you could build a final column with decisions supported from all models.

对于演示而言并不重要，但最终结果是一张表格，其中包含每位数字的每一行的概率。使用此表，您可以构建最终列，其中包含所有模型支持的决策。

Final RemarksAnd that’s it, a small demonstration, in many other possibilities, of the usage of concurrent processing in R with the H2O features. As demonstrated, H2O already provides a very effective way to build several machine learning models bypassing the limitations of R. If your purpose is to build a single model, selected for its performance, the grid search is a solid option and you may not need an extra R package to do it faster. But, if your problem requires building several different models, with similar data and preparation automatically built, evaluated and deployed, it is well worth the exploration and implementation of these parallel R packages. Based on the volume of data, you will have to choose between using a single instance in H2O with more resources or multi independent instances with fewer resources based on its pros and cons (time, jobs conflict, memory management, stability, …).If the problem is even more complex, you can dockerize your solution and use Azure or AWS Batch services for example. The doAzureParallel package from Azure it’s very similar to doParallel, where we can distribute our processing to several machines at the same time. Perhaps a topic for an upcoming post!

结束语就是这样，这是在许多其他可能性下的小型演示，演示了R中具有H2O功能的并发处理的用法。如图所示，H2O已经提供了一种非常有效的方法来绕过R的局限性来构建多个机器学习模型。如果您的目的是构建一个根据性能选择的单个模型，那么网格搜索是一个不错的选择，您可能不需要额外的R包可以更快地完成。但是，如果您的问题需要构建几个不同的模型，并自动构建，评估和部署具有相似数据和准备的相似模型，那么对这些并行R包进行探索和实施就非常值得。根据数据量，您将不得不根据其优缺点(时间，作业冲突，内存管理，稳定性等)在使用更多资源的H2O中使用单个实例还是使用更少资源的多个独立实例之间进行选择。该问题甚至更加复杂，您可以对解决方案进行泊坞处理并使用Azure或AWS Batch服务。 Azure的doAzureParallel软件包与doParallel非常相似，在该软件包中，我们可以同时将处理分散到多台计算机。也许是即将发表的帖子的主题！