1.理解关联规则
市场购物篮分析的结果是一组指定商品之间关系模式的关联规则,一个典型的规则可以表述为: {花生酱,果酱} –> {面包}
这个关联规则用通俗易懂的语言来表达就是:如果购买了花生酱和果酱,那么也很有可能会购买面包。我们分析的就是事物之间 关系,某些事物是否存在联系。
2.测试数据
a,b,c
a,b,d
b,a,d
b,c,e
b,d,e
a,b,c
a,b,e
a,b,e
a,b,c
a,b,c
a,b
a,d
b,d
b,e
c,d,e
a,e
b,d
a,b,c,d,e大家可以想象成某个商品,a商品,b商品。。。
以上是一些商品购买的交易,从上面我们分析商品之间有什么联系,这里需要用到支持度和置信度两个概念,不理解的可以百度下。
支持度:一个项集或者规则度量法的支持度是指其在数据中出现的频率。
置信度:是指该规则的预测能力或者准确度的度量。
比如这里有17组数据,我们求a的支持度,发现a出现10次,所以支持度为10/17,而置信度公式是confidence(x,y)=support(x,y)/support(x),意思是已知x的支持度,求x发生的概率下y发生的概率。
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.mage.ml.association_rules
import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel}
import org.apache.spark.sql.{DataFrame, Dataset}
// $example off$
import org.apache.spark.sql.SparkSession
/**
* 关联规则....
*/
object FPGrowthExample {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("local")
.appName("FPGrowth")
.getOrCreate()
import spark.implicits._
//加载数据
val shoppings: Dataset[String] = spark.read.textFile("shopping_cart")
//把数据通过空格分割,转成DataFrame
val df: DataFrame = shoppings.map(_.split(",")).toDF("items")
val growth = new FPGrowth().setItemsCol("items")
//设置支持度和置信度
growth.setMinConfidence(0.8)
growth.setMinSupport(0.3)
//设置分区数
growth.setNumPartitions(2)
val model: FPGrowthModel = growth.fit(df)
//打印频繁项集
model.freqItemsets.show();
//打印符合置信度和支持度条件的关联规则
model.associationRules.show()
spark.stop()
}
}
参考文章:https://blog.csdn.net/qq_41455420/article/details/89532574