Spark(day10) -- MLlib(2)

最新推荐文章于 2021-06-26 16:29:32 发布

LikeInsane

最新推荐文章于 2021-06-26 16:29:32 发布

阅读量221

点赞数

文章标签： MLlib Mahout Spark SparkSql Algorithm

本文链接：https://blog.csdn.net/qq_24522589/article/details/79733367

版权

一.Principle of decision tree classification algorithm.

1）Overview

Decision tree is a widely used classification algorithm.
Compared with bayesian algorithm, the advantage of decision tree is that the construction process does not require any domain knowledge or parameter setting.

In practical application, the decision tree is more suitable for detecting knowledge.

2)Algorithm thought

The key of decision tree classification algorithm is to construct an optimal decision tree based on the "prior data" to predict the category of unknown data.

Decision tree: a tree structure (either a binary tree or a non-binary tree).
Each of its non-leaf nodes represents a test on a feature attribute, each of which represents the output of this feature attribute on a range, and each leaf node holds a category.

Using decision tree for decision making is the process of starting from the root node, the test for the classification of the corresponding feature attribute, and its value in accordance with choice of output branch, until you reach a leaf node, the category of the leaf nodes are stored as decision results.

3)The decision tree structure sample

There are two properties in the sample, and A0 is the red apple.A1 is the big apple.
If you want to build a decision tree that automatically determines the quality of apples according to this data sample.

Since the data in this example has only two attributes, we can take all the decision trees that might be constructed, as shown in the figure below:

Obviously, the decision tree that USES A0 (red) on the left is better than the decision tree that is divided by A1 (size) on the right.
Of course it's intuitive.
The intuition is obviously not suitable for the implementation of the program, so there needs to be a quantitative evaluation to evaluate the performance of the two trees.
The quantitative method used in the evaluation of decision tree is to calculate the information entropy gain of each division.

If the information entropy of a selected attribute is decreased most after data division, then this partition attribute is the optimal choice.

4)The basis for the selection of attributes.

Entropy: the unit of information used by Shannon, the founder of information theory.
In short, entropy is the degree of "disorder, chaos".
To understand by calculation:
1. Entropy of original sample data:
Sample size: 4
Good apple: 2
Bad apples: 2
Entropy: -(1/2 * log(1/2) + 1/2 * log(1/2) = 1.
The entropy of information is 1, which represents the most chaotic and disordered state.

2. Calculation of entropy gain of two decision trees.

1.tree 1 first select A0 is divided, each child node information entropy calculation is as follows:

There are 2 positive cases and 0 negative cases in leaf nodes.

The information entropy is: e1 = -(2/2 * log(2/2) + 0/2 * log(0/2)) = 0.

2, 3 leaf nodes have 0 positive cases and 2 negative cases.

Information entropy is: e2 = -(0/2 * log(0/2) + 2/2 * log(2/2)) = 0.

Therefore, the information entropy of A0 is selected as the weighted sum of the information entropy of each child node. E = e1*2/4 + e2*2/4 = 0.

The information entropy gain G (S, A0) = s-e = 1-0 = 1.

In fact, the decision tree leaf node indicates that it already belongs to the same category, so the information entropy must be 0.

2. tree 2 to choose A1 is divided, each child node information entropy calculation is as follows:

The 0,2 sub-nodes have 1 positive case and 1 negative case.

Information entropy is: e1 = -(1/2 * log(1/2) + 1/2 * log(1/2) = 1.

The 1, 3 child nodes have 1 positive case and 1 negative example.

Information entropy is: e2 = -(1/2 * log(1/2) + 1/2 * log(1/2) = 1.

Therefore, the information entropy of A1 is selected as the weighted sum of the information entropy of each sub-node. E = e1*2/4 + e2*2/4 = 1.

In other words, it's the same as not!

The information entropy gain G (S, A1) = s-e = 1-1 = 0.

So, before each partition, we just have to figure out what the maximum entropy gain is.

After the partition of the decision attribute, the disorder of data becomes lower and lower, which is the information entropy is getting smaller and smaller.

Tease out the attributes in the data.

Compare the information entropy gain of the data after the partition of a particular property, select the attribute with the maximum information entropy gain as the first division basis, then continue to select the second attribute, and so on.

二.Mahout

http://mahout.apache.org/docs/latest/

/user/root hadoop fs –mkdir –p /user/root/testdata

hadoop fs -copyFromLocal /root/synthetic_control.data testdata

hadoop jar /root/usr/local/mahout/mahout-examples-0.13.0-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job

org.apache.mahout.clustering.syntheticcontrol.canopy.Job

chmod +x ./bin/mahout

./bin/mahout clusterdump -i output/clusters-0-final -p output/clusteredPoints -o test.txt

三.SparkSql with Hive

1. hive-site.xml

<name>hive.metastore.uris</name>

<value>thrift://192.168.16.100:9083</value>

</property>

</configuration>

2. cp hive-site.xml into spark conf

3. bin/hive –-service metastore

4. ./bin/sbin/spark-sql

package SparkMLlib

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object KmeansFromHive {
  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME","root")
    val conf = new SparkConf().setAppName("KmeansFromHive").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    import sqlContext.implicits._
    sqlContext.sql("set spark.sql.shuffle.partitions = 1")
    sqlContext.sql("use mllib")
    val data = sqlContext.sql("select a.orderlocation,sum(b.itemqty) totalqty,sum(b.itemamount) totalamount from tb_stock a join stock_detail b on a.orderid = b.orderid group by a.orderlocation ")
    /*data.collect().foreach(x=>{
      println(x)
    })*/
   // 必须将data变成向量才能进行模型训练
    val parseData = data.map{
     case Row(_,totalqty,totalamount)=>
       //抽取特征
       val features = Array[Double](totalqty.toString.toDouble,totalamount.toString.toDouble)
       Vectors.dense(features)
   }
    //用K-means算法进行模型训练
    val numCluster = 4
    val maxIteration = 40
    val model = KMeans.train(parseData,numCluster,maxIteration)
    //用模型对我们的数据进行预测
    val res = data.map{
      case Row(orderlocation,totalqty,totalamount)=>
        //抽取特征
        val features = Array[Double](totalqty.toString.toDouble,totalamount.toString.toDouble)
        //将数组变成机器学习中的向量
        val linesVector = Vectors.dense(features)
        val prediction = model.predict(linesVector)
        orderlocation + " " + totalqty + " " + totalamount + " " + prediction
    }
    res.collect().foreach(x=>{
      println(x)
    })
  }
}