大数据实战篇: Spark2.0.0 + Adult数据集 + Logistic回归模型测试(Scala语言)

Spark Adult数据集Logistic回归模型测试


【Pre】


1.官网下载数据集adult.csv及adult.data:https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
2.本地发送到服务器端/usr/app/spark-2.0.0-bin-hadoop2.7/data/mllib/adult.csv,并上传至hdfs上

hdfs dfs -put   /usr/app/spark-2.0.0-bin-hadoop2.7/data/mllib/adult.csv  /data/mllib/adult.csv

【1】处理过程中需要的jar包

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.Vectors
import scala.collection.mutable.ListBuffer

【2】 加载Adult数据集

   val d = spark.read.format("csv").option("header", "false").option("inferSchema", "true").load("/data/mllib/adult.csv")
   val dataDF=d.toDF(colNames = "age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income")

【3】使用OneHotEncoder处理Adult数据集


    1.需要进行OneHotEncoder编码的字段

  val categoricalColumns =Array("workclass","education","marital-status","occupation","relationship","race","sex","native-country")

    2.采用Pileline方式处理机器学习流程

    val stagesArray = new ListBuffer[PipelineStage]()
    for (cate <- categoricalColumns) { 
    val indexer = new StringIndexer().setInputCol(cate).setOutputCol(s"${cate}Index");   
    val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol(s"${cate}classVec");   
    stagesArray.append(indexer,encoder);  
    }

【4】使用VectorAssembler合并所有特征为单个向量

    val numericCols = Array("age","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week")
    val assemblerInputs = categoricalColumns.map(_ + "classVec") ++ numericCols
    val assembler = new VectorAssembler().setInputCols(assemblerInputs).setOutputCol("features")
    stagesArray.append(assembler)

【5】对标签进行提取转换成label(NumericType)


  val indexer_string = new StringIndexer().setInputCol("income").setOutputCol("label")
  stagesArray.append(indexer_string)

【6】用Pipeline的形式运行各个PipelineStage

  val pipeline = new Pipeline()
  pipeline.setStages(stagesArray.toArray)

    1.fit()    根据需要计算特征统计信息

  val pipelineModel = pipeline.fit(dataDF)

    2.transform() 真实转换特征

  val dataset = pipelineModel.transform(dataDF)

    3.输出处理好的dataset

  dataset.show()

【7】Logistic模型训练及预测


    1. //随机分割测试集和训练集数据,指定seed可以固定数据分配

  val Array(trainingDF, testDF) = dataset.randomSplit(Array(0.9, 0.1), seed = 12345)   
  println(s"trainingDF size=${trainingDF.count()},testDF size=${testDF.count()}")

    2.定义模型,并设置相关参数(迭代次数、正则化参数等)

  val lrModel = new LogisticRegression().setMaxIter(10000).setRegParam(0.001).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features").fit(trainingDF)

  val predictions = lrModel.transform(testDF).select($"label".as("label"), $"features", $"rawPrediction", $"probability", $"prediction")
  predictions.show()

【8】模型的评估

  import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator 
  val evaluator = new BinaryClassificationEvaluator()
  val accuracy = evaluator.evaluate(predictions)

【9】打包jar上传至服务器,提交spark运行

  spark-submit  --class main.LogisticAdult  --master spark://hadoop11:7077     LogisticAdult.jar

【problem】


(1) column income must be of type NumericType but was actually of type StringType
Solve: income(label)的类型为String,应当转化成数值型

scala>val lrModel = new LogisticRegression().setLabelCol("income").setFeaturesCol("features").fit(trainingDF)
java.lang.IllegalArgumentException: requirement failed: Column income must be of type NumericType but was actually of type StringType.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:73)
  at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
  at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
  at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
  at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
  at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
  at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
  ... 48 elided

相关笔记:

LogisticRegression模型参数说明:
elasticNetParam:双精度型。表示弹性网络混合参数,范围[0,1]。
featuresCol:  字符串型。表示特征列名。
fitIntercept:布尔型。表示是否训练拦截对象。
labelCol:字符串型。表示标签列名。
maxIter:整数型。表示最多迭代次数(>=0)。
predictionCol:字符串型。表示预测结果列名。
probabilityCol:字符串型。表示用以预测类别条件概率的列名。
regParam:双精度型。表示正则化参数(>=0)。
standardization:布尔型。表示训练模型前是否需要对训练特征进行标准化处理。
threshold:双精度型。表示二分类预测的阀值,范围[0,1]。
thresholds:双精度数组型。表示多分类预测的阀值,以调整预测结果在各个类别的概率。
tol:双精度型。表示迭代算法的收敛性。
weightCol:   字符串型。  表示列权重。

  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值