spark-ml和jpmml-sparkml生成pmml模型过程种遇到的问题

需求:利用pmml(预测模型标记语言)来实现跨平台的机器学习模型部署。

pmml简介:参考链接1

如何将模型生成pmml格式:参考链接3

 

1、成功的写法:将数据的各种transform和模型全部都放入pipeline中,可以生成pmml。

代码如下:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature._
import org.apache.spark.sql.SaveMode

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.jpmml.model.JAXBUtil
import org.jpmml.sparkml.PMMLBuilder
import org.dmg.pmml.PMML
import javax.xml.transform.stream.StreamResult
import java.io.FileOutputStream

import org.apache.spark.ml.linalg.DenseVector

import scala.collection.mutable.ArrayBuffer

object Test extends App{

    println("666666")
    val spark = SparkSession.builder().master("local").appName("TestPmml").getOrCreate()

    val str2Int: Map[String, Double] = Map(
        "Iris-setosa" -> 0.0,
        "Iris-versicolor" -> 1.0,
        "Iris-virginica" -> 2.0
    )
    var str2double = (x: String) => str2Int(x)
    var myFun = udf(str2double)
    val data = spark.read.textFile("...\\scalaProgram\\PMML\\iris1.txt").toDF()
        .withColumn("splitcol", split(col("value"), ","))
        .select(
            col("splitcol").getItem(0).as("sepal_length"),
            col("splitcol").getItem(1).as("sepal_width"),
            col("splitcol").getItem(2).as("petal_length"),
            col("splitcol").getItem(3).as("petal_width"),
            col("splitcol").getItem(4).as("label")
        )
        .withColumn("label", myFun(col("label")))
        .select(
            col("sepal_length").cast(DoubleType),
            col("sepal_width").cast(DoubleType),
            col("petal_length").cast(DoubleType),
            col("petal_width").cast(DoubleType),
            col("label").cast(DoubleType)
        )

    val data1 = data.na.drop()
    println("data: " + data1.count().toString)
    val schema = data1.schema
    println("data1 schema: " + schema)

    // merge multi-feature to vector features
    val features: Array[String] = Array("sepal_length", "sepal_width", "petal_length", "petal_width")
    val assembler: VectorAssembler = new VectorAssembler().setInputCols(features).setOutputCol("features")
    
    val rf: RandomForestClassifier = new RandomForestClassifier()
        .setLabelCol("label")
        .setFeaturesCol("features")
        .setMaxDepth(8)
        .setNumTrees(30)
        .setSeed(1234)
        .setMinInfoGain(0)
        .setMinInstancesPerNode(1)


    val pipeline = new Pipeline().setStages(Array(assembler,rf))

    val pipelineModel = pipeline.fit(newdata1)
    println("success fit......")
    val pmml = new PMMLBuilder(schema, pipelineModel).build()
    val targetFile = "...\\scalaProgram\\PMML\\pipemodel.pmml"
    val fis: FileOutputStream = new FileOutputStream(targetFile)
    val fout: StreamResult = new StreamResult(fis)
    JAXBUtil.marshalPMML(pmml, fout)
    println("pmml success......")
}

结果:

2、上面代码中VectorAssembler方法就是将多列Double型的数据聚合为一列Vector型的数据。目前因为业务需求,直接给你一列Vector型的数据,然后用模型进行训练并将模型保存为pmml格式。

分析:因为传入模型训练的数据必须是Vector型的,所以上面代码才会利用VectorAssembler将多列属性值合并为一列,而目前已经有了Vector型数据,那就只需要将模型放入Pipeline().setStage()中就行了,试一试

代码:

object Test extends App{

    println("666666")
    val spark = SparkSession.builder().master("local").appName("TestPmml").getOrCreate()

    // convert features string to vector-data
    var string2vector = (x: String) => {
        var length = x.length()
        var a = x.substring(1, length - 1).split(",").map(i => i.toDouble)
        Vectors.dense(a)
    }
    var str2vec = udf(string2vector)
    val newdata1 = spark.read.load("...\\scalaProgram\\PMML\\data1.parquet")

    val newdata2 = newdata1.withColumn("features", str2vec(col("features")))
    println("newdata2: "+newdata2.schema)

    val rf: RandomForestClassifier = new RandomForestClassifier()
        .setLabelCol("label")
        .setFeaturesCol("features")
        .setMaxDepth(8)
        .setNumTrees(30)
        .setSeed(1234)
        .setMinInfoGain(0)
        .setMinInstancesPerNode(1)


    val pipeline = new Pipeline().setStages(Array(rf))

    val pipelineModel = pipeline.fit(newdata2)
    println("success fit......")

    val pmml = new PMMLBuilder(newdata2.schema, pipelineModel).build()
    val targetFile = "...\\scalaProgram\\PMML\\pipemodel.pmml"
    val fis: FileOutputStream = new FileOutputStream(targetFile)
    val fout: StreamResult = new StreamResult(fis)
    JAXBUtil.marshalPMML(pmml, fout)
    println("pmml success......")

}

运行报错:报这个错主要是因为PMMLBuilder中schema里的datatype只支持string,integral,double or boolean,这说明传入pipeline().fit()的原始数据就必须是这些类型。

而我们传入的newdata2数据里features这一列的数据是VectorUDT类型的。根据这个问题,想了一个办法:就是将上面代码中newdata2那一列的数据类型由VectorUDT类型转换为String类型并存成parquet格式,然后从parquet格式中读取出来,那么这一列的数据就是String类型,然后经过VectorAssembler算子,将该算子和模型的算子一起放入pipeline.setStage中,这样貌似也能行,试一试

注意:由于parquet格式是这样的,你写的数据格式是什么类型的,读出来就是相应的类型

代码:

object TestPmml extends App{
    val spark = SparkSession.builder().master("local").appName("TestPmml").getOrCreate()

    val str2Int: Map[String, Double] = Map(
        "Iris-setosa" -> 0.0,
        "Iris-versicolor" -> 1.0,
        "Iris-virginica" -> 2.0
    )
    var str2double = (x: String) => str2Int(x)
    var myFun = udf(str2double)
    val data = spark.read.textFile("...\\scalaProgram\\PMML\\iris1.txt").toDF()
        .withColumn("splitcol", split(col("v
  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 13
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值