Spark机器学习-决策树2案列07

本文介绍了决策树算法在分类任务中的运用,包括ID3和C4.5等方法。讨论了决策树的优点,如原理简单、多分类能力及高效预测,同时也指出其对输入特征的敏感性和过分类问题。通过Iris数据集展示了决策树的实际操作,以及使用Spark MLlib进行训练和评估。此外,还给出了一个根据身高体重预测性别的案例,演示了决策树在性别分类中的应用。
摘要由CSDN通过智能技术生成

决策树

决策树因其进行决策判断的结构与数据结构中的树相同,故而得名决策树算法既可以实现分类,也可以实现回归,一般用作分类的比较多。例如if-then就是一种简单的决策树心
决策树的解法有很多例如ID3,C4.5等,其使用了信息论中嫡的概念

优点

决策树原理简单,易于实现
决策树能够实现多分类
能够在较短的时间内对大型数据源作出预测,预测性能较好

缺点

对输入特征要求较高,很多情况下需要作预处理
识别类别过多时,发生错误的概率较大

实列

如图展示了一个能否批准贷款的决策树
在这里插入图片描述
输入变量的特征有很多,选择特征作为分类判断的依据之一便是能够具有很好的区分度那么也就是说,选择出的变量能够更具有代表性,以至于区分程度更高,作为决策树的判断节点信息增益
在这里插入图片描述

数据集

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

读取数据

val file = spark.read.format("csv")
  .load("src/main/scala1/coding-271/ch8/iris/iris.data")

提取数据,并对Iris-setosa等字符串数据进行转换为0,1,2

val data = file.map(row => {
  val label = row.getString(4) match {
    case "Iris-setosa" => 0
    case "Iris-versicolor" => 1
    case "Iris-virginica" => 2
  }
  (row.getString(0).toDouble,
    row.getString(1).toDouble,
    row.getString(2).toDouble,
    row.getString(3).toDouble,
    label,
    random.nextDouble())
}).toDF("_c0","_c1","_c2","_c3","label","rand").sort("rand").filter(col("label")=!=2)
data.show()

将特征数据转换为数组

import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler().setInputCols(Array("_c0","_c1", "_c2", "_c3")).setOutputCol("features")
val dataset = assembler.transform(data)
dataset.show(false)

获取训练集 测试集

val Array(train,text) = dataset.randomSplit(Array(0.8, 0.2), 13L)

使用决策树算法

import org.apache.spark.ml.classification.DecisionTreeClassifier
val dt = new DecisionTreeClassifier().setFeaturesCol("features").setLabelCol("label")
val model = dt.fit(train)
val result = model.transform(text)
result.show()

预测评估器,返回训练效果

val evaluator = new MulticlassClassificationEvaluator()
 .setLabelCol("label")
 .setPredictionCol("prediction")
 .setMetricName("accuracy")

val accuracy = evaluator.evaluate(result)
println(accuracy)//打印准确率

案列2
根据身高体重预测男生女生
pattern数据解析函数 数据 [161.2, 51.6], [167.5, 59.0] //身高,体重 通过正则获取数据,然后在转换处理

val pattern= (filename: String, category: Int) => {
  val patternString = "\\[(.*?)\\]".r
  val rand = new Random()
  spark.sparkContext.textFile(filename)
    .flatMap(text => patternString.findAllIn(text.replace(" ", "")))
    .map(text => {
      val pairwise = text.substring(1, text.length - 1).split(",")
      (pairwise(0).toDouble, pairwise(1).toDouble, category, rand.nextDouble())
    })
}//获取男性女性身高体重
val male = pattern("src/main/scala1/coding-271/ch8/gender/male.txt", 1)//男性
val female=pattern("src/main/scala1/coding-271/ch8/gender/gender.txt", 2) //女性

转换为DF

val maleDF = spark.createDataFrame(male)
  .toDF("height", "weigh", "category", "rand")

val femaleDF = spark.createDataFrame(female)
  .toDF("height", "weigh", "category", "rand")

maleDF.show(false)
femaleDF.show(false)

连接男性,女性表

val dataset = maleDF.union(femaleDF).sort("rand")
dataset.show(false)

将数据转换为数组

val assembler = new VectorAssembler()
  .setInputCols(Array("height", "weigh"))
  .setOutputCol("features")

val transformedDataset = assembler.transform(dataset)
transformedDataset.show()

生成训练集,测试集

val Array(train,text) = transformedDataset.randomSplit(Array(0.8, 0.2))

使用决策树算法训练模型

val classifier = new DecisionTreeClassifier()
  .setFeaturesCol("features")
  .setLabelCol("category")

val model = classifier.fit(train)
val result = model.transform(text)

预测评估器,返回训练效果

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("category")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val d = evaluator.evaluate(result)
println(d)//准确率
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值