FP-Tree频繁模式树算法-CSDN博客

参考资料：http://blog.csdn.net/sealyao/article/details/6460578
更多数据挖掘算法：https://github.com/linyiqun/DataMiningAlgorithm

介绍

FP-Tree算法全称是FrequentPattern Tree算法，就是频繁模式树算法，他与Apriori算法一样也是用来挖掘频繁项集的，不过不同的是，FP-Tree算法是Apriori算法的优化处理，他解决了Apriori算法在过程中会产生大量的候选集的问题，而FP-Tree算法则是发现频繁模式而不产生候选集。但是频繁模式挖掘出来后，产生关联规则的步骤还是和Apriori是一样的。

算法原理

FP树，FP树，那他当然是最终被构造成一个树的形状了。所以步骤如下：

1、创建根节点，用NULL标记。

2、统计所有的事务数据，统计事务中各个类型项的总支持度(在下面的例子中就是各个商品ID的总个数)

3、依次读取每条事务，比如T1， 1， 2， 5，因为按照总支持度计数数量降序排列，输入的数据顺序就是2， 1， 5，然后挂到根节点上。

4、依次读取后面的事务，并以同样的方式加入的FP树中，顺着根节点路径添加，并更新节点上的支持度计数。

最后就会形成这样的一棵树：

然后还要新建一个项头表，代表所有节点的类型和支持度计数。这个东西在后面会有大用处。如果你以为FP树的算法过程到这里就结束了，你就大错特错了，算法的终结过程为最后的FP树只包括但路径，就是树呈现直线形式，也就是节点都只有1个孩子或没有孩子，顺着一条线下来，没有其他的分支。这就算是一条挖掘出的频繁模式。所以上面的算法还要继续递归的构造FP树，递归构造FP树的过程：

1、这时我们从最下面的I5开始取出。把I5加入到后缀模式中。后缀模式到时会于频繁模式组合出现构成最终的频繁模式。

2、获取频繁模式基，<I2, Ii>，<I2, I1, I3>，计数为I5节点的count值，然后以这2条件模式基为输入的事务，继续构造一个新的FP树

3、这就是我们要达到的FP树单路径的目标了，不过这里个要求，要把支持度计数不够的点排除，这里的I3:1就不符号，所以最后I5后缀模式下的<I2, I1>与I5的组合模式了，就为<I2, I5>, <I1, I5>,<I1, I2, I5>。

I5下的挖掘频繁模式是比较简单的，没有出现递归，看一下I3下的递归构造，这就不简单了，同样的操作，最后就会出现下面这幅图的样子：

发现还不是单条路径，继续递归构造，此时的后缀模式硬卧I3+I1,就是<I3, I1>，然后就来到了下面这幅图的情形了。

后面的例子会有更详细的说明。

算法的实现

输入数据如下：

交易ID	商品ID列表
T100	I1，I2，I5
T200	I2，I4
T300	I2，I3
T400	I1，I2，I4
T500	I1，I3
T600	I2，I3
T700	I1，I3
T800	I1，I2，I3，I5
T900	I1，I2，I3

在文件中的形式就是：

[java]view plaincopy 
   
print?
 T1 1 2 5    
 T2 2 4    
 T3 2 3    
 T4 1 2 4    
 T5 1 3    
 T6 2 3    
 T7 1 3    
 T8 1 2 3 5    
 T9 1 2 3   

算法的树节点类：

[java]view plaincopy 
   
print?
 /** 
  * FP树节点 
  *  
  * @author lyq 
  *  
  */  
 public class TreeNode implements Comparable<TreeNode>, Cloneable{  
     // 节点类别名称  
     private String name;  
     // 计数数量  
     private Integer count;  
     // 父亲节点  
     private TreeNode parentNode;  
     // 孩子节点，可以为多个  
     private ArrayList<TreeNode> childNodes;  
       
     public TreeNode(String name, int count){  
         this.name = name;  
         this.count = count;  
     }  
   
     public String getName() {  
         return name;  
     }  
   
     public void setName(String name) {  
         this.name = name;  
     }  
   
     public Integer getCount() {  
         return count;  
     }  
   
     public void setCount(Integer count) {  
         this.count = count;  
     }  
   
     public TreeNode getParentNode() {  
         return parentNode;  
     }  
   
     public void setParentNode(TreeNode parentNode) {  
         this.parentNode = parentNode;  
     }  
   
     public ArrayList<TreeNode> getChildNodes() {  
         return childNodes;  
     }  
   
     public void setChildNodes(ArrayList<TreeNode> childNodes) {  
         this.childNodes = childNodes;  
     }  
   
     @Override  
     public int compareTo(TreeNode o) {  
         // TODO Auto-generated method stub  
         return o.getCount().compareTo(this.getCount());  
     }  
   
     @Override  
     protected Object clone() throws CloneNotSupportedException {  
         // TODO Auto-generated method stub  
         //因为对象内部有引用，需要采用深拷贝  
         TreeNode node = (TreeNode)super.clone();   
         if(this.getParentNode() != null){  
             node.setParentNode((TreeNode) this.getParentNode().clone());  
         }  
           
         if(this.getChildNodes() != null){  
             node.setChildNodes((ArrayList<TreeNode>) this.getChildNodes().clone());  
         }  
           
         return node;  
     }  
       
 }  

算法主要实现类：

[java]view plaincopy 
   
print?
 package DataMining_FPTree;  
   
 import java.io.BufferedReader;  
 import java.io.File;  
 import java.io.FileReader;  
 import java.io.IOException;  
 import java.util.ArrayList;  
 import java.util.Collections;  
 import java.util.HashMap;  
 import java.util.Map;  
   
 /** 
  * FPTree算法工具类 
  *  
  * @author lyq 
  *  
  */  
 public class FPTreeTool {  
     // 输入数据文件位置  
     private String filePath;  
     // 最小支持度阈值  
     private int minSupportCount;  
     // 所有事物ID记录  
     private ArrayList<String[]> totalGoodsID;  
     // 各个ID的统计数目映射表项，计数用于排序使用  
     private HashMap<String, Integer> itemCountMap;  
   
     public FPTreeTool(String filePath, int minSupportCount) {  
         this.filePath = filePath;  
         this.minSupportCount = minSupportCount;  
         readDataFile();  
     }  
   
     /** 
      * 从文件中读取数据 
      */  
     private void readDataFile() {  
         File file = new File(filePath);  
         ArrayList<String[]> dataArray = new ArrayList<String[]>();  
   
         try {  
             BufferedReader in = new BufferedReader(new FileReader(file));  
             String str;  
             String[] tempArray;  
             while ((str = in.readLine()) != null) {  
                 tempArray = str.split(" ");  
                 dataArray.add(tempArray);  
             }  
             in.close();  
         } catch (IOException e) {  
             e.getStackTrace();  
         }  
   
         String[] temp;  
         int count = 0;  
         itemCountMap = new HashMap<>();  
         totalGoodsID = new ArrayList<>();  
         for (String[] a : dataArray) {  
             temp = new String[a.length - 1];  
             System.arraycopy(a, 1, temp, 0, a.length - 1);  
             totalGoodsID.add(temp);  
             for (String s : temp) {  
                 if (!itemCountMap.containsKey(s)) {  
                     count = 1;  
                 } else {  
                     count = ((int) itemCountMap.get(s));  
                     // 支持度计数加1  
                     count++;  
                 }  
                 // 更新表项  
                 itemCountMap.put(s, count);  
             }  
         }  
     }  
   
     /** 
      * 根据事物记录构造FP树 
      */  
     private void buildFPTree(ArrayList<String> suffixPattern,  
             ArrayList<ArrayList<TreeNode>> transctionList) {  
         // 设置一个空根节点  
         TreeNode rootNode = new TreeNode(null, 0);  
         int count = 0;  
         // 节点是否存在  
         boolean isExist = false;  
         ArrayList<TreeNode> childNodes;  
         ArrayList<TreeNode> pathList;  
         // 相同类型节点链表，用于构造的新的FP树  
         HashMap<String, ArrayList<TreeNode>> linkedNode = new HashMap<>();  
         HashMap<String, Integer> countNode = new HashMap<>();  
         // 根据事物记录，一步步构建FP树  
         for (ArrayList<TreeNode> array : transctionList) {  
             TreeNode searchedNode;  
             pathList = new ArrayList<>();  
             for (TreeNode node : array) {  
                 pathList.add(node);  
                 nodeCounted(node, countNode);  
                 searchedNode = searchNode(rootNode, pathList);  
                 childNodes = searchedNode.getChildNodes();  
   
                 if (childNodes == null) {  
                     childNodes = new ArrayList<>();  
                     childNodes.add(node);  
                     searchedNode.setChildNodes(childNodes);  
                     node.setParentNode(searchedNode);  
                     nodeAddToLinkedList(node, linkedNode);  
                 } else {  
                     isExist = false;  
                     for (TreeNode node2 : childNodes) {  
                         // 如果找到名称相同，则更新支持度计数  
                         if (node.getName().equals(node2.getName())) {  
                             count = node2.getCount() + node.getCount();  
                             node2.setCount(count);  
                             // 标识已找到节点位置  
                             isExist = true;  
                             break;  
                         }  
                     }  
   
                     if (!isExist) {  
                         // 如果没有找到，需添加子节点  
                         childNodes.add(node);  
                         node.setParentNode(searchedNode);  
                         nodeAddToLinkedList(node, linkedNode);  
                     }  
                 }  
   
             }  
         }  
   
         // 如果FP树已经是单条路径，则输出此时的频繁模式  
         if (isSinglePath(rootNode)) {  
             printFrequentPattern(suffixPattern, rootNode);  
             System.out.println("-------");  
         } else {  
             ArrayList<ArrayList<TreeNode>> tList;  
             ArrayList<String> sPattern;  
             if (suffixPattern == null) {  
                 sPattern = new ArrayList<>();  
             } else {  
                 // 进行一个拷贝，避免互相引用的影响  
                 sPattern = (ArrayList<String>) suffixPattern.clone();  
             }  
   
             // 利用节点链表构造新的事务  
             for (Map.Entry entry : countNode.entrySet()) {  
                 // 添加到后缀模式中  
                 sPattern.add((String) entry.getKey());  
                 //获取到了条件模式机，作为新的事务  
                 tList = getTransactionList((String) entry.getKey(), linkedNode);  
                   
                 System.out.print("[后缀模式]：{");  
                 for(String s: sPattern){  
                     System.out.print(s + ", ");  
                 }  
                 System.out.print("}, 此时的条件模式基：");  
                 for(ArrayList<TreeNode> tnList: tList){  
                     System.out.print("{");  
                     for(TreeNode n: tnList){  
                         System.out.print(n.getName() + ", ");  
                     }  
                     System.out.print("}, ");  
                 }  
                 System.out.println();  
                 // 递归构造FP树  
                 buildFPTree(sPattern, tList);  
                 // 再次移除此项，构造不同的后缀模式，防止对后面造成干扰  
                 sPattern.remove((String) entry.getKey());  
             }  
         }  
     }  
   
     /** 
      * 将节点加入到同类型节点的链表中 
      *  
      * @param node 
      *            待加入节点 
      * @param linkedList 
      *            链表图 
      */  
     private void nodeAddToLinkedList(TreeNode node,  
             HashMap<String, ArrayList<TreeNode>> linkedList) {  
         String name = node.getName();  
         ArrayList<TreeNode> list;  
   
         if (linkedList.containsKey(name)) {  
             list = linkedList.get(name);  
             // 将node添加到此队列中  
             list.add(node);  
         } else {  
             list = new ArrayList<>();  
             list.add(node);  
             linkedList.put(name, list);  
         }  
     }  
   
     /** 
      * 根据链表构造出新的事务 
      *  
      * @param name 
      *            节点名称 
      * @param linkedList 
      *            链表 
      * @return 
      */  
     private ArrayList<ArrayList<TreeNode>> getTransactionList(String name,  
             HashMap<String, ArrayList<TreeNode>> linkedList) {  
         ArrayList<ArrayList<TreeNode>> tList = new ArrayList<>();  
         ArrayList<TreeNode> targetNode = linkedList.get(name);  
         ArrayList<TreeNode> singleTansaction;  
         TreeNode temp;  
   
         for (TreeNode node : targetNode) {  
             singleTansaction = new ArrayList<>();  
   
             temp = node;  
             while (temp.getParentNode().getName() != null) {  
                 temp = temp.getParentNode();  
                 singleTansaction.add(new TreeNode(temp.getName(), 1));  
             }  
   
             // 按照支持度计数得反转一下  
             Collections.reverse(singleTansaction);  
   
             for (TreeNode node2 : singleTansaction) {  
                 // 支持度计数调成与模式后缀一样  
                 node2.setCount(node.getCount());  
             }  
   
             if (singleTansaction.size() > 0) {  
                 tList.add(singleTansaction);  
             }  
         }  
   
         return tList;  
     }  
   
     /** 
      * 节点计数 
      *  
      * @param node 
      *            待加入节点 
      * @param nodeCount 
      *            计数映射图 
      */  
     private void nodeCounted(TreeNode node, HashMap<String, Integer> nodeCount) {  
         int count = 0;  
         String name = node.getName();  
   
         if (nodeCount.containsKey(name)) {  
             count = nodeCount.get(name);  
             count++;  
         } else {  
             count = 1;  
         }  
   
         nodeCount.put(name, count);  
     }  
   
     /** 
      * 显示决策树 
      *  
      * @param node 
      *            待显示的节点 
      * @param blankNum 
      *            行空格符，用于显示树型结构 
      */  
     private void showFPTree(TreeNode node, int blankNum) {  
         System.out.println();  
         for (int i = 0; i < blankNum; i++) {  
             System.out.print("\t");  
         }  
         System.out.print("--");  
         System.out.print("--");  
   
         if (node.getChildNodes() == null) {  
             System.out.print("[");  
             System.out.print("I" + node.getName() + ":" + node.getCount());  
             System.out.print("]");  
         } else {  
             // 递归显示子节点  
             // System.out.print("【" + node.getName() + "】");  
             for (TreeNode childNode : node.getChildNodes()) {  
                 showFPTree(childNode, 2 * blankNum);  
             }  
         }  
   
     }  
   
     /** 
      * 待插入节点的抵达位置节点，从根节点开始向下寻找待插入节点的位置 
      *  
      * @param root 
      * @param list 
      * @return 
      */  
     private TreeNode searchNode(TreeNode node, ArrayList<TreeNode> list) {  
         ArrayList<TreeNode> pathList = new ArrayList<>();  
         TreeNode tempNode = null;  
         TreeNode firstNode = list.get(0);  
         boolean isExist = false;  
         // 重新转一遍，避免出现同一引用  
         for (TreeNode node2 : list) {  
             pathList.add(node2);  
         }  
   
         // 如果没有孩子节点，则直接返回，在此节点下添加子节点  
         if (node.getChildNodes() == null) {  
             return node;  
         }  
   
         for (TreeNode n : node.getChildNodes()) {  
             if (n.getName().equals(firstNode.getName()) && list.size() == 1) {  
                 tempNode = node;  
                 isExist = true;  
                 break;  
             } else if (n.getName().equals(firstNode.getName())) {  
                 // 还没有找到最后的位置，继续找  
                 pathList.remove(firstNode);  
                 tempNode = searchNode(n, pathList);  
                 return tempNode;  
             }  
         }  
   
         // 如果没有找到，则新添加到孩子节点中  
         if (!isExist) {  
             tempNode = node;  
         }  
   
         return tempNode;  
     }  
   
     /** 
      * 判断目前构造的FP树是否是单条路径的 
      *  
      * @param rootNode 
      *            当前FP树的根节点 
      * @return 
      */  
     private boolean isSinglePath(TreeNode rootNode) {  
         // 默认是单条路径  
         boolean isSinglePath = true;  
         ArrayList<TreeNode> childList;  
         TreeNode node;  
         node = rootNode;  
   
         while (node.getChildNodes() != null) {  
             childList = node.getChildNodes();  
             if (childList.size() == 1) {  
                 node = childList.get(0);  
             } else {  
                 isSinglePath = false;  
                 break;  
             }  
         }  
   
         return isSinglePath;  
     }  
   
     /** 
      * 开始构建FP树 
      */  
     public void startBuildingTree() {  
         ArrayList<TreeNode> singleTransaction;  
         ArrayList<ArrayList<TreeNode>> transactionList = new ArrayList<>();  
         TreeNode tempNode;  
         int count = 0;  
   
         for (String[] idArray : totalGoodsID) {  
             singleTransaction = new ArrayList<>();  
             for (String id : idArray) {  
                 count = itemCountMap.get(id);  
                 tempNode = new TreeNode(id, count);  
                 singleTransaction.add(tempNode);  
             }  
   
             // 根据支持度数的多少进行排序  
             Collections.sort(singleTransaction);  
             for (TreeNode node : singleTransaction) {  
                 // 支持度计数重新归为1  
                 node.setCount(1);  
             }  
             transactionList.add(singleTransaction);  
         }  
   
         buildFPTree(null, transactionList);  
     }  
   
     /** 
      * 输出此单条路径下的频繁模式 
      *  
      * @param suffixPattern 
      *            后缀模式 
      * @param rootNode 
      *            单条路径FP树根节点 
      */  
     private void printFrequentPattern(ArrayList<String> suffixPattern,  
             TreeNode rootNode) {  
         ArrayList<String> idArray = new ArrayList<>();  
         TreeNode temp;  
         temp = rootNode;  
         // 用于输出组合模式  
         int length = 0;  
         int num = 0;  
         int[] binaryArray;  
   
         while (temp.getChildNodes() != null) {  
             temp = temp.getChildNodes().get(0);  
   
             // 筛选支持度系数大于最小阈值的值  
             if (temp.getCount() >= minSupportCount) {  
                 idArray.add(temp.getName());  
             }  
         }  
   
         length = idArray.size();  
         num = (int) Math.pow(2, length);  
         for (int i = 0; i < num; i++) {  
             binaryArray = new int[length];  
             numToBinaryArray(binaryArray, i);  
   
             // 如果后缀模式只有1个，不能输出自身  
             if (suffixPattern.size() == 1 && i == 0) {  
                 continue;  
             }  
   
             System.out.print("频繁模式：{【后缀模式：");  
             // 先输出固有的后缀模式  
             if (suffixPattern.size() > 1  
                     || (suffixPattern.size() == 1 && idArray.size() > 0)) {  
                 for (String s : suffixPattern) {  
                     System.out.print(s + ", ");  
                 }  
             }  
             System.out.print("】");  
             // 输出路径上的组合模式  
             for (int j = 0; j < length; j++) {  
                 if (binaryArray[j] == 1) {  
                     System.out.print(idArray.get(j) + ", ");  
                 }  
             }  
             System.out.println("}");  
         }  
     }  
   
     /** 
      * 数字转为二进制形式 
      *  
      * @param binaryArray 
      *            转化后的二进制数组形式 
      * @param num 
      *            待转化数字 
      */  
     private void numToBinaryArray(int[] binaryArray, int num) {  
         int index = 0;  
         while (num != 0) {  
             binaryArray[index] = num % 2;  
             index++;  
             num /= 2;  
         }  
     }  
   
 }  

算法调用测试类：

[java]view plaincopy 
   
print?
 /** 
  * FPTree频繁模式树算法 
  * @author lyq 
  * 
  */  
 public class Client {  
     public static void main(String[] args){  
         String filePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";  
         //最小支持度阈值  
         int minSupportCount = 2;  
           
         FPTreeTool tool = new FPTreeTool(filePath, minSupportCount);  
         tool.startBuildingTree();  
     }  
 }  

输出的结果为：

[java]view plaincopy 
   
print?
 [后缀模式]：{3, }, 此时的条件模式基：{2, }, {1, }, {2, 1, },   
 [后缀模式]：{3, 2, }, 此时的条件模式基：  
 频繁模式：{【后缀模式：3, 2, 】}  
 -------  
 [后缀模式]：{3, 1, }, 此时的条件模式基：{2, },   
 频繁模式：{【后缀模式：3, 1, 】}  
 频繁模式：{【后缀模式：3, 1, 】2, }  
 -------  
 [后缀模式]：{2, }, 此时的条件模式基：  
 -------  
 [后缀模式]：{1, }, 此时的条件模式基：{2, },   
 频繁模式：{【后缀模式：1, 】2, }  
 -------  
 [后缀模式]：{5, }, 此时的条件模式基：{2, 1, }, {2, 1, 3, },   
 频繁模式：{【后缀模式：5, 】2, }  
 频繁模式：{【后缀模式：5, 】1, }  
 频繁模式：{【后缀模式：5, 】2, 1, }  
 -------  
 [后缀模式]：{4, }, 此时的条件模式基：{2, }, {2, 1, },   
 频繁模式：{【后缀模式：4, 】2, }  
 -------  

读者可以自己手动的构造一下，可以更深的理解这个过程，然后对照本人的代码做对比。

算法编码时的难点

1、在构造树的时候要重新构建一棵树的时候，要不能对原来的树做更改，在此期间用了老的树的对象，又造成了重复引用的问题了，于是果断又new了一个TreeNode，只把原树的name，和count值拿了过来，父子节点关系完全重新构造。

2、在事务生产树的过程中，把事务映射到TreeNode数组中，然后过程就是加Node节点或者更新Node节点的count值，过程简单许多，也许会让人很难理解，应该个人感觉这样比较方便，如果是死板的String[]字符串数组的形式，中间还要与TreeNode各种转化非常麻烦。

3、在计算条件模式基的时候，我是存在了HashMap<String, ArrayList<TreeNode>>map中，并并没有搞成链表的形式，直接在生成树的时候就全部统计好。

4、此处算法用了2处递归，一个地方是在添加树节点的时候，搜索要在哪个node上做添加的方法，searchNode(TreeNode node, ArrayList<TreeNode> list)，还有一个是整个的buildFPTree()算法，都不是能够一眼就能看明白的地方。希望大家能够理解我的用意。