sparkling-water使用的笔记_sparkling-water示例-CSDN博客

RSparkling > The best of R + H2O + Spark
What you get from R + H2O + Spark?
R is great for statistical computing and graphics, and small scale data preparation, H2O is amazing distributed machine learning platform designed for scale and speed and Spark is great for super fast data processing at mega scale. So combining all of these 3 together you get the best of data science, machine learning and data processing, all in one.

rsparkling: The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R.

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Sparkling Water integrates H2O’s fast scalable machine learning engine with Spark. With Sparkling Water you can publish Spark data structures (RDDs, DataFrames, Datasets) as H2O’s frames and vice versa, DSL to use Spark data structures as input for H2O’s algorithms. You can create ML applications utilizing Spark and H2O APIs, and Python interface enabling use of Sparkling Water directly from PySpark.

Installation Packages:
sparklyr 0.6.2
rsparkling 0.2.1
Apache spark 2.1.x
Sparkling Water 2.1.14
[RStudio] (https://github.com/rstudio) – Optional but great to have it
Quick Start Script:

Sys.setenv(SPARK_HOME='/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.7')
options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
library(rsparkling)
library(sparklyr)
sc = spark_connect(master = "local", version = "2.1.0")
sc
h2o_context(sc, strict_version_check = FALSE)
library(h2o)
h2o.clusterInfo()
h2o_flow(sc)
spark_disconnect(sc)

Important Settings for your environment:
master = “local” > To start local spark cluster
master = “yarn-client” > To start a cluster managed by YARN
To get a list of supported Sparkling Water versions: h2o_release_table()
When you will call spark_connect() you will see a new “tab” appears
Tab “Spark” is used to launch “SparkUI”
Tab “Log” is used to collect spark logs
If there is any issue with sparklyr and spark version pass exact version above otherwise you dont need to pass version.
Startup Script with config parameters to set executor settings:
These are the settings you will use to get our rsparkling/spark session up and running in RStudio:

Sys.setenv(SPARK_HOME='/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.7')
options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
library(rsparkling)
library(sparklyr)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G”
config$spark.executor.instances = 3  <==== This will create 3 Nodes Instance
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
sc
h2o_context(sc, strict_version_check = FALSE)
library(h2o)
h2o.clusterInfo()
spark_disconnect(sc)

Accessing SparkUI:
You can access Spark UI just by clicking SparkUI button at the spark tab as shown below:

Screen Shot 2017-10-28 at 9.54.48 AM

Accessing H2O FLOW UI:
You just need to pass the command to open H2O FLOW UI on selected browser:

h2o_flow()
Screen Shot 2017-10-28 at 9.55.03 AM

Building H2O GLM model using rsparkling + sparklyr + H2O:
In This example we are ingesting the famous “CARS & MPG” dataset and building a GLM (Generalized Linear Model) to predict the miles-per-gallon from the given specification of car capabilities:

options(rsparkling.sparklingwater.location = "/tmp/sparkling-water-assembly_2.11-2.1.7-all.jar")
library(rsparkling)
library(sparklyr)
library(h2o)
sc <- spark_connect(master = "local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
sciris_tbl <- copy_to(sc, iris)
mtcars_tbl <- copy_to(sc, mtcars, "iris1")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars", overwrite = TRUE)
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
mtcars_glm <- h2o.glm(x = c("wt", "cyl"),mtcars_glm <- h2o.glm(x = c("wt", "cyl"),y = "mpg",training_frame = mtcars_h2o,lambda_search = TRUE)
mtcars_glm
spark_disconnect(sc)

ADVERTISEMENT
REPORT THIS AD

That’s all, enjoy!!

Advertisements

REPORT THIS AD
H2O Machine Learning R Spark Sparkling Water
Leave a commentPosted on October 20, 2017 H2O, Java, Machine Learning, Scala, Spark
Scoring H2O MOJO models with spark UDF and Scala
With H2O machine learning the best case is that your machine learning models can be exported as Java code so you can use them for scoring in any platform which supports Java. H2O algorithms generates POJO and MOJO models which does not require H2O runtime to score which is great for any enterprise. You can learn more about H2O POJO and MOJO models here.

Here is the Spark Scala code which shows how to score the H2O MOJO model by loading it from the disk and then using RowData object to pass as row to H2O easyPredict class:

import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData

// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/gbm_model.zip")
val easyModel = new EasyPredictModelWrapper(mojo)

// Get Mojo Details
var features = mojo.getNames.toBuffer

// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")

// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities

Above the MOJO model is stored into local file system as gbm_prostate_model.zip and it is loaded as resources inside the Scala code. The full execution of above code is available here.

Following is the simple Java code which shows how you could use the same code to write a Java application to perform scoring based on H2O MOJO Model:

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

public class main {
  public static void main(String[] args) throws Exception {
    EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("gbm_prostate_model.zip"));

    hex.genmodel.GenModel mojo = MojoModel.load("gbm_prostate_model.zip");

    System.out.println("isSupervised : " + mojo.isSupervised());
    System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));
    System.out.println("Number of columns : " + mojo.getNumCols());
    System.out.println("Response ID : " + mojo.getResponseIdx());
    System.out.println("Response Name : " + mojo.getResponseName());

    for (int i = 0; i < mojo.getNumCols(); i++) {
      String[] domainValues = mojo.getDomainValues(i);
      System.out.println(Arrays.toString(domainValues));
    }

    RowData row = new RowData();
    row.put("AGE", "68");
    row.put("RACE", "2");
    row.put("DCAPS", "2");
    row.put("VOL", "0");
    row.put("GLEASON", "6");

    BinomialModelPrediction p = model.predictBinomial(row);
    System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label);
    System.out.print("Class probabilities: ");
    for (int i = 0; i < p.classProbabilities.length; i++) {
      if (i > 0) {
    System.out.print(",");
      }
      System.out.print(p.classProbabilities[i]);
    }
    System.out.println("");
  }
}

Thats it, enjoy!!