单机和集群环境下的FP-Growth算法java实现(关联规则挖掘)

最新推荐文章于 2023-03-22 13:44:39 发布

人非木石_xst

最新推荐文章于 2023-03-22 13:44:39 发布

阅读量2.8k

点赞数 1

文章标签：算法 java spark 数据挖掘关联规则

本文链接：https://blog.csdn.net/shimin520shimin/article/details/49281381

版权

1 FP-Growth简要描述

和Apriori算法一样，都是用于关联规则挖掘的算法。Apriori算法每生成一次k频繁项集都需要遍历一次事务数据库，当事务数据库很大时会有频繁的I/O操作，因此只适合找出小数据集的频繁项集；而FP-Growth算法整个过程中，只有两次扫描事务数据库，一次发生在数据预处理（包括去掉事务的ID编号、合并相同事务等），另一次发生在构造FP-Tree的头项表，因此该种算法对于大数据集效率也很高。FP-Growth算法的步骤主要有：数据预处理、构造头项表（需要筛选出满足最小支持度的item）、构造频繁树、接下来就是遍历头项表，递归得到所有模式基，所有频繁项集。

2 FP-Growth单机java实现源码

<span style="font-size:14px;">package org.min.ml;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;

/**
 * FP-tree算法:用于挖掘出事务数据库中的频繁项集，该方法是对APriori算法的改进
 * 
 * @author ShiMin
 * @date   2015/10/17 
 */
public class FPTree
{
	private int minSupport;//最小支持度
	
	public FPTree(int support)
	{
		this.minSupport = support;
	}
	
	/**
	 * 加载事务数据库  
	 * @param file 文件路径名  文件中每行item由空格分隔
	 */
	public List<List<String>> loadTransaction(String file)
	{
		List<List<String>> transactions = new ArrayList<List<String>>();
		try
		{
			BufferedReader br = new BufferedReader(new FileReader(new File(file)));
			String line = "";
			while((line = br.readLine()) != null)
			{
				transactions.add(Arrays.asList(line.split(" ")));
			}
		} catch (Exception e)
		{
			e.printStackTrace();
		}
		return transactions;
	}
	
	public void FPGrowth(List<List<String>> transactions, List<String> postPattern)
	{
		//构建头项表
		List<TNode> headerTable = buildHeaderTable(transactions);
		//构建FP树
		TNode tree = bulidFPTree(headerTable, transactions);
		//当树为空时退出
		if (tree.getChildren()== null || tree.getChildren().size() == 0)
		{
			return;
		}
		//输出频繁项集
        if(postPattern!=null)
        {
            for (TNode head : headerTable) 
            {
                System.out.print(head.getCount() + " " + head.getItemName());
                for (String item : postPattern)
                {
                	System.out.print(" " + item);
                }
                System.out.println();
            }
        }
		//遍历每一个头项表节点 
		for(TNode head : headerTable)
		{
			List<String> newPostPattern = new LinkedList<String>();
			newPostPattern.add(head.getItemName());//添加本次模式基
			//加上将前面累积的前缀模式基
			if (postPattern != null)
			{
                newPostPattern.addAll(postPattern);
			}
			//定义新的事务数据库
		    List<List<String>> newTransaction = new LinkedList<List<String>>();
            TNode nextnode = head.getNext();
            //去除名称为head.getItemName()的模式基，构造新的事务数据库
            while(nextnode != null)
            {
            	int count = nextnode.getCount();
            	List<String> parentNodes = new ArrayList<String>();//nextnode节点的所有祖先节点
            	TNode parent = nextnode.getParent();
            	while(parent.getItemName() != null)
            	{
            		parentNodes.add(parent.getItemName());
            		parent = parent.getParent();
            	}
            	//向事务数据库中重复添加count次parentNodes
            	while((count--) > 0)
            	{
            		newTransaction.add(parentNodes);//添加模式基的前缀 ，因此最终的频繁项为:  parentNodes -> newPostPattern
            	}
            	//下一个同名节点
            	nextnode = nextnode.getNext();
            }
			//每个头项表节点重复上述所有操作，递归
            FPGrowth(newTransaction, newPostPattern);
		}
	}
	
	/**
	 * 构建头项表，按递减排好序
	 * @return
	 */
	public List<TNode> buildHeaderTable(List<List<String>> transactions)
	{
		List<TNode> list = new ArrayList<TNode>();
		Map<String,TNode> nodesmap = new HashMap<String,TNode>();
		//为每一个item构建

最低0.47元/天解锁文章

人非木石_xst

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
单机和集群环境下的FP-Growth算法java实现(关联规则挖掘)

1 FP-Growth简要描述和Apriori算法一样，都是用于关联规则挖掘的算法。Apriori算法每生成一次k频繁项集都需要遍历一次事务数据库，当事务数据库很大时会有频繁的I/O操作，因此只适合找出小数据集的频繁项集；而FP-Growth算法整个过程中，只有两次扫描事务数据库，一次发生在数据预处理（包括去掉事务的ID编号、合并相同事务等），另一次发生在构造FP-Tree的头项
复制链接

扫一扫