先上代码:
import org.apache.spark.ml.fpm.FPGrowth import spark.implicits._ val df=spark.sparkContext.makeRDD(Seq((1,Seq(123,456,789)),(2,Seq(123,456)),(3,Seq(456,789)),(4,Seq(666)),(5,Seq(555,888,666)) ) ).toDF("id","uids")
val fp=new FPGrowth().setItemsCol("ids") .setMinConfidence(0.1) .setMinSupport(0.001) // 频繁项(热门商品)在总体的最小(出现)比例 .setNumPartitions(3) val fpModel=fp.fit(df) fpModel.freqItemsets.show(false) fpModel.associationRules.filter("size(antecedent)=1 and antecedent[0]=123456").show
问题1:树深度无限制问题
这频繁项 子集深度不加限制,感觉不合理,,,好比词袋模型词袋任意长了;若不在前面手动限制 频繁项 最大长度,我的笔记本直接运行不出来,我的笔记本必须限制到30以内size(ids)<30。
问题2:参数含义setMinSupport(0.001) // 频繁项(热门商品)在总体的最小(出现)比例
追溯从fit到genericFit就发现minCount= math.ceil(minSupport * count),再到genFreqItems发现解释为“ minCount minimum count for frequent itemsets”
override def fit(dataset: Dataset[_]): FPGrowthModel = {
transformSchema(dataset.schema, logging = true)
genericFit(dataset)
}
private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = instrumented { instr =>
...
val parentModel = mllibFP.run(items)
def run[Item: ClassTag](data: RDD[Array[Item]]):
val count = data.count()
val minCount = math.ceil(minSupport * count).toLong
/**
* Generates frequent items by filtering the input data using minimal support level.
* @param minCount minimum count for frequent itemsets
* @param partitioner partitioner used to distribute items
* @return array of frequent patterns and their frequencies ordered by their frequencies
*/
private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]], minCount: Long, partitioner: Partitioner)
问题3:与协同过滤比较
1 userCF推荐的更丰富,FPGrowth关联推荐由于被截取了N的树深度,推荐的也少了挺多;
2 两种算法交集较大,基本 userCF的结果涵盖了FPGrowth