Spark 机器学习逻辑回归demo

这里整理记录一下Spark ML学习的小示例,本人运行实例都是在spark-shell下,详细教程请参考官网地址:
http://spark.apache.org/docs/latest/ml-pipeline.html

Estimator, Transformer, 和 Param使用代码实例:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row
import spark.implicits._

//创建spark对象
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

//准备训练集
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

//准备测试集
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

//创建逻辑回归算法实例,并查看、设置相应参数
val lr = new LogisticRegression()
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
lr.setMaxIter(10).setRegParam(0.01)

//训练学习得到model1,查看model1的参数
val model1 = lr.fit(training)
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)
//用paraMap来设置参数集
val paramMap = ParamMap(lr.maxIter -> 20).put(lr.maxIter, 30)  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)
//可以将两个paraMap结合起来
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  
val paramMapCombined = paramMap ++ paramMap2
//使用结合的paraMap训练学习得到model2
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

//使用测试集测试model2
model2.transform(test).select("features", "label", "myProbability", "prediction").collect().foreach { case Row(features: Vector, label: Double, prob:Vector, prediction: Double) =>println(s"($features, $label) -> prob=$prob,prediction=$prediction")}

Pipeline 代码实例:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row
import spark.implicits._

//创建spark对象
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

//准备训练集
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

//准备测试集
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "spark hadoop spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

//配置ML pipeline,由,tokenzier(分词器)、hashingTF和lr(逻辑回归)三个stage组成
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

//训练Pipline得到model,即一个transformer(转换器)
val model = pipeline.fit(training)
//保存模型
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
//保存pipeline结构
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
//需要使用的时候加载模型
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

//使用测试集对模型进行测试
model.transform(test).select("id", "text", "probability", "prediction").collect().foreach { case Row(id: Long, text: String, prob: Vector,prediction: Double) =>println(s"($id, $text) --> prob=$prob,prediction=$prediction")}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值