Apriori的Spark算法

最新推荐文章于 2024-06-28 00:19:05 发布

Evan_Gu

最新推荐文章于 2024-06-28 00:19:05 发布

阅读量3.5k

点赞数 1

分类专栏： Spark 文章标签： Apriori 频繁项集

本文链接：https://blog.csdn.net/gdp12315_gu/article/details/50589930

版权

Spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

2014届全国高校云计算大赛技能赛

K-频繁项集挖掘并行化算法
 环境描述：本题目需要运行在 Apache Spark 1.0.1Apache Spark 1.0.1Apache Spark 1.0.1 Apache Spark 1.0.1Apache Spark 1.0.1 Apache Spark 1.0.1 Apache Spark 1.0.1Apache Spark 1.0.1Apache Spark 1.0.1Apache Spark 1.0.1 环境下，使用 Java JavaJava或者 ScalaScala 进行编程开发。
 题目描述：在规定的 ChessChessChess 标准数据集上，规定标准数据集上，规定 K = 8，支持度，支持度 support = 85% support = 85% support = 85% support = 85%support = 85%，进行 1-频繁项集到 K-频繁项集的挖掘。
 数据集：本题目将采用 ChessChess 标准数据集 apriori_data apriori_dataapriori_data apriori_data apriori_data apriori_data，具体下载地址见大赛网站 http://cloud.seu.edu.cn

 程序设计约束：程序需要两个输入参数，第一为据集路径二出文件夹路径。 1-频繁项集到 K-频繁项集的结果放在 K个文件中，名分别为个文件中，名分别为 resultresult -1,result1,result1,result1,result -2, …,result,result,result -8(K=8)8(K=8) 8(K=8)8(K=8)，每个文件的格式为：文件的格式为：

a,b,c:0.85
a,b,d:0.90

项集和支持度用西文冒号 (:) (:)分割，项集分割，项集中如果有多个元素则用西文逗号分割中如果有多个元素则用西文逗号分割 (,) 。

import scala.util.control.Breaks._
import scala.collection.mutable.ArrayBuffer
import java.util.BitSet
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._


object FrequentItemset {
  def main(args: Array[String]) {
    if (args.length != 2) {
      println("USage:<Datapath> <Output>")
    }
    //initial SparkContext
    val sc = new SparkContext()
    val SUPPORT_NUM = 15278611 //Transactions total is num=17974836, SUPPORT_NUM = num*0.85
    val TRANSACITON_NUM = 17974836.0
    val K = 8


    //All transactions after removing transaction ID, and here we combine the same transactions.
    val transactions = sc.textFile(args(0)).map(line =>
      line.substring(line.indexOf(" ") + 1).trim).map((_, 1)).reduceByKey(_ + _).map(line => {
      val bitSet = new BitSet()
      val ss = line._1.split(" ")
      for (i <- 0 until ss.length) {
        bitSet.set(ss(i).toInt, true)
      }
      (bitSet, line._2)
    }).cache()


    //To get 1 frequent itemset, here, fi represents frequent itemset
    var fi = transactions.flatMap { line =>
      val tmp = new ArrayBuffer[(String, Int)]
      for (i <- 0 until line._1.size()) {
        if (line._1.get(i)) tmp += ((i.toString, line._2))
      }
      tmp
    }.reduceByKey(_ + _).filter(line1 => line1._2 >= SUPPORT_NUM).cache()
    val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)
    result.saveAsTextFile(args(1) + "/result-1")


    for (i <- 2 to K) {
      val candiateFI = getCandiateFI(fi.map(_._1).collect(), i)
      val bccFI = sc.broadcast(candiateFI)
      //To get the final frequent itemset
      fi = transactions.flatMap { line =>
        val tmp = new ArrayBuffer[(String, Int)]()
        //To check if each itemset of candiateFI in transactions
        bccFI.value.foreach { itemset =>
          val itemArray = itemset.split(",")
          var count = 0
          for (item <- itemArray) if (line._1.get(item.toInt)) count += 1
          if (count == itemArray.size) tmp += ((itemset, line._2))
        }
        tmp
      }.reduceByKey(_ + _).filter(_._2 >= SUPPORT_NUM).cache()
      val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)
      result.saveAsTextFile(args(1) + "/result-" + i)
      bccFI.unpersist()
    }
  }


  //To get the candiate k frequent itemset from k-1 frequent itemset
  def getCandiateFI(f: Array[String], tag: Int) = {
    val separator = ","
    val arrayBuffer = ArrayBuffer[String]()
    for(i <- 0 until f.length;j <- i + 1 until f.length){
      var tmp = ""
      if(2 == tag) tmp = (f(i) + "," + f(j)).split(",").sortWith((a,b) => a.toInt <= b.toInt).reduce(_+","+_)
      else {
        if (f(i).substring(0, f(i).lastIndexOf(',')).equals(f(j).substring(0, f(j).lastIndexOf(',')))) {
          tmp = (f(i) + f(j).substring(f(j).lastIndexOf(','))).split(",").sortWith((a, b) => a.toInt <= b.toInt).reduce(_ + "," + _)
        }
      }
      var hasInfrequentSubItem = false //To filter the item which has infrequent subitem
      if (!tmp.equals("")) {
        val arrayTmp = tmp.split(separator)
        breakable {
          for (i <- 0 until arrayTmp.size) {
            var subItem = ""
            for (j <- 0 until arrayTmp.size) {
              if (j != i) subItem += arrayTmp(j) + separator
            }
            //To remove the separator "," in the end of the item
            subItem = subItem.substring(0, subItem.lastIndexOf(separator))
            if (!f.contains(subItem)) {
              hasInfrequentSubItem = true
              break
            }
          }
        } //breakable
      }
      else hasInfrequentSubItem = true
      //If itemset has no sub inftequent itemset, then put it into candiateFI
      if (!hasInfrequentSubItem) arrayBuffer += (tmp)
    } //for
    arrayBuffer.toArray
  }
}