用A-priori算法实现frequent item set查找

问题:

Suppose there are 10,000 items, numbered 1 to 10,000, and 10,000 baskets, also numbered 1 to
10,000. Provided is the item list, where each line represents the items in one basket. For
example, when one basket contains item 1, 2, and 3, the corresponding line will be:
1 2 3
Write the A-Priori algorithm in map-reduce setting to answer the following questions.

  1. If the support threshold is 1000, which items are frequent?
  2. If the support threshold is 100, find the maximal frequent itemsets, i.e., frequent
    itemsets with the largest size

简单来说就是给了一个包含一万行的文件,每一行代表一个basket,一个basket的中的每一个字符代表一个item,如果一个basket中包含一个item的集合,那么这个item的集合的支持度就加一。现在要找出那些频繁出现的Item的集合(支持度超过某个阈值)。
用A-priori算法实现frequent item set的查找,并用两轮MapReduce来输出结果。第一轮MapReduce用来输出可能的frequent item set,第二轮用来验证这些可能的集合是否是true frequent item set。

A-priori算法的流程

在这里插入图片描述
每一轮查找生成的集合,如果支持度计数超过阈值,就保留,并用来自连接生成下一轮的候选集合。如果低于阈值,则舍弃。
在这里插入图片描述
这个算法的核心思想就是,如果一个集合是频繁项集,那它的子集也一定是频繁项集。例如:如果{1,2,3}是频繁项集,那么{1,2}和{1,3}一定也是频繁项集。

Java实现

第一轮的Mapper:

首先编写查找frequent item的算法:

private Map<Set<String>,Integer> getOneRoundItems(List<Set<String>> candidate,int index,int perThreshold,int perSize,ArrayList[] buckets){
        Map<Set<String>, Integer> keyValues = new ConcurrentHashMap<>();
        for (int j = index; j < index + perSize; j++) {
            //每一个桶
            List<String> bucket = buckets[j];
            //判断是否包含每一个集合
            for(Set<String> items:candidate){
                int notEqual=0;
                for(String item:items){
                    if(!bucket.contains(item))
                        notEqual++;
                    if(notEqual>0)
                        break;
                }
                if(notEqual==0){
                    keyValues.put(items,keyValues.getOrDefault(items,0)+1);
                }
            }
        }
        for (Map.Entry entry : keyValues.entrySet()) {
            if ((int) entry.getValue() < perThreshold) {
                keyValues.remove(entry.getKey());
            }
        }
        return keyValues;
    }

这里因为用mapreduce实现,所以将整个baskets分成了几块小的baskets,每一块用一个线程处理,每一块的阈值为总的阈值除以切片数。

编写自连接的方法,用于将k-1轮生成的frequent item set连接成k轮用于计算的候选sets:

//由当前的frequent item set生成下一步的候选的set
    private List<Set<String>> generateCandidateSet(Map<Set<String>,Integer> keyValues){
        List<Set<String>> candidateSet = new CopyOnWriteArrayList<>();
        List<Set<String>> resultSet = new CopyOnWriteArrayList<>();
        for (Map.Entry entry:keyValues.entrySet()){
            candidateSet.add((Set<String>) entry.getKey());
        }
        for (int i = 0; i < candidateSet.size() ; i++) {
            Set<String> itemSet1 = candidateSet.get(i);
            for (int j = i+1; j < candidateSet.size() ; j++) {
                int notEqual = 0;
                Set<String> itemSet2 = candidateSet.get(j);
                for (String item:itemSet1){
                    if(!itemSet2.contains(item)){
                        notEqual++;
                    }
                    if (notEqual>1){
                        break;
                    }
                }
                if(notEqual<=1){
                    Set<String> set = new HashSet<>();
                    set.addAll(itemSet1);
                    set.addAll(itemSet2);
                    if(!resultSet.contains(set))
                        resultSet.add(set);
                }
            }
        }
        return resultSet;
    }

这里连接的规则是,如果两个集合之间有且仅有一个item不相同,才将它们的并集作为连接的结果。

编写map方法:

    public void map(ArrayList[] buckets, int threshold) {
        //用四个线程,将数据分成四块,每个线程处理一块数据
        final int tasks = 4;
        //线程池,用于管理线程
        ExecutorService executorService = Executors.newCachedThreadPool();
        final CountDownLatch countDownLatch = new CountDownLatch(tasks);
        for (int i = 0; i < tasks; i++) {
            int perSize = buckets.length / tasks;
            int finalI = i;
            int perThreshold = threshold / tasks;
            executorService.execute(() -> {
                int index = finalI * perSize;
                List<Set<String>> beginningCandidate = new ArrayList<>();
                //生成k=1的项集
                for (int j = 1; j < 10001; j++) {
                    Set<String> set = new HashSet<>();
                    set.add(String.valueOf(j));
                    beginningCandidate.add(set);
                }
                Map<Set<String>,Integer> originKeyValues = getOneRoundItems(beginningCandidate,index,perThreshold,perSize,buckets);
                List<Set<String>> lastCandidate = generateCandidateSet(originKeyValues);
                Map<Set<String>,Integer> lastKeyValues = originKeyValues;
                try {
                    buffer[finalI].put(lastKeyValues);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                //循环至k项集不再是frequent
                while(true){
                    Map<Set<String>,Integer> keyValues = getOneRoundItems(lastCandidate,index,perThreshold,perSize,buckets);
                    if(keyValues.size()==0){
                        break;
                    }
                    lastCandidate = generateCandidateSet(keyValues);
                    lastKeyValues = keyValues;
                    try {
                        buffer[finalI].put(lastKeyValues);
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                countDownLatch.countDown();
            });
        }
        //等待所有任务完成再关闭线程池
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();
    }

在map方法中设定一个死循环,一直循环到项为k的集合不中再有超过阈值的集合时。每一轮都将结果放入阻塞队列中,让reducer的读取并处理。

整个FirstMapper类的代码:

public class FirstMapper {
    private static final BlockingDeque[] buffer = Transfer.buffer;

    public void map(ArrayList[] buckets, int threshold) {
        //用四个线程,将数据分成四块,每个线程处理一块数据
        final int tasks = 4;
        //线程池,用于管理线程
        ExecutorService executorService = Executors.newCachedThreadPool();
        final CountDownLatch countDownLatch = new CountDownLatch(tasks);
        for (int i = 0; i < tasks; i++) {
            int perSize = buckets.length / tasks;
            int finalI = i;
            int perThreshold = threshold / tasks;
            executorService.execute(() -> {
                int index = finalI * perSize;
                List<Set<String>> beginningCandidate = new ArrayList<>();
                //生成k=1的项集
                for (int j = 1; j < 10001; j++) {
                    Set<String> set = new HashSet<>();
                    set.add(String.valueOf(j));
                    beginningCandidate.add(set);
                }
                Map<Set<String>,Integer> originKeyValues = getOneRoundItems(beginningCandidate,index,perThreshold,perSize,buckets);
                List<Set<String>> lastCandidate = generateCandidateSet(originKeyValues);
                Map<Set<String>,Integer> lastKeyValues = originKeyValues;
                try {
                    buffer[finalI].put(lastKeyValues);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                //循环至k项集不再是frequent
                while(true){
                    Map<Set<String>,Integer> keyValues = getOneRoundItems(lastCandidate,index,perThreshold,perSize,buckets);
                    if(keyValues.size()==0){
                        break;
                    }
                    lastCandidate = generateCandidateSet(keyValues);
                    lastKeyValues = keyValues;
                    try {
                        buffer[finalI].put(lastKeyValues);
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                countDownLatch.countDown();
            });
        }
        //等待所有任务完成再关闭线程池
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();
    }

    //frequent Items筛选
    private Map<Set<String>,Integer> getOneRoundItems(List<Set<String>> candidate,int index,int perThreshold,int perSize,ArrayList[] buckets){
        Map<Set<String>, Integer> keyValues = new ConcurrentHashMap<>();
        for (int j = index; j < index + perSize; j++) {
            //每一个桶
            List<String> bucket = buckets[j];
            //判断是否包含每一个集合
            for(Set<String> items:candidate){
                int notEqual=0;
                for(String item:items){
                    if(!bucket.contains(item))
                        notEqual++;
                    if(notEqual>0)
                        break;
                }
                if(notEqual==0){
                    keyValues.put(items,keyValues.getOrDefault(items,0)+1);
                }
            }
        }
        for (Map.Entry entry : keyValues.entrySet()) {
            if ((int) entry.getValue() < perThreshold) {
                keyValues.remove(entry.getKey());
            }
        }
        return keyValues;
    }

    //由当前的frequent item set生成下一步的候选的set
    private List<Set<String>> generateCandidateSet(Map<Set<String>,Integer> keyValues){
        List<Set<String>> candidateSet = new CopyOnWriteArrayList<>();
        List<Set<String>> resultSet = new CopyOnWriteArrayList<>();
        for (Map.Entry entry:keyValues.entrySet()){
            candidateSet.add((Set<String>) entry.getKey());
        }
        for (int i = 0; i < candidateSet.size() ; i++) {
            Set<String> itemSet1 = candidateSet.get(i);
            for (int j = i+1; j < candidateSet.size() ; j++) {
                int notEqual = 0;
                Set<String> itemSet2 = candidateSet.get(j);
                for (String item:itemSet1){
                    if(!itemSet2.contains(item)){
                        notEqual++;
                    }
                    if (notEqual>1){
                        break;
                    }
                }
                if(notEqual<=1){
                    Set<String> set = new HashSet<>();
                    set.addAll(itemSet1);
                    set.addAll(itemSet2);
                    if(!resultSet.contains(set))
                        resultSet.add(set);
                }
            }
        }
        return resultSet;
    }

}

第一轮的Reducer:

public class FirstReducer {
    private static final BlockingDeque[] buffer = Transfer.buffer;
    private static final List<Set<String>> reduceResult = new CopyOnWriteArrayList<>();

    public List<Set<String>> reduce() {
        //四个线程,从缓冲去中读取mapper的生产结果,reduce后作为第一层reducer的结果
        int tasks = 4;
        CountDownLatch countDownLatch = new CountDownLatch(tasks);
        Lock lock = new ReentrantLock();
        ExecutorService executorService = Executors.newCachedThreadPool();
        for (int i = 0; i < tasks; i++) {
            int finalI = i;
            executorService.execute(() -> {
                try {
                    while(!buffer[finalI].isEmpty()){
                        Map<Set<String>, Integer> map = (Map<Set<String>, Integer>) buffer[finalI].take();
                        for (Map.Entry entry : map.entrySet()) {
                            Set<String> items = (Set<String>) entry.getKey();
                            lock.lock();
                            if(!reduceResult.contains(items)){
                                reduceResult.add(items);
                                System.out.println("Add set to the first reducer's result: "+Util.setToString(items));
                            }
                            lock.unlock();
                        }
                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                countDownLatch.countDown();
            });
        }
        //等待所有任务完成
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();

        return reduceResult;
    }
}

reducer的逻辑很简单,因为mapper放入缓冲冲区的数据是Map<Set<String>,Integer>类型,这里需要将Integer,即集合的支持度技术去掉,只输出第一轮的frequent Item set。结果为List<Set<String>>类型。

第二轮的mapper:

//第二层mapper,用于统计第一层reducer输出的结果
public class SecondMapper {
    private static final BlockingDeque<Map<Set<String>,Integer>> buffer = Transfer.bufferForSecondPass;
    private static final Map<Set<String>,Integer> itemSet = new ConcurrentHashMap<>();
    public void map(ArrayList[] buckets,List<Set<String>> frequentItems,int threshold){
        final int tasks = 4;
        ExecutorService executorService = Executors.newCachedThreadPool();
        final CountDownLatch countDownLatch = new CountDownLatch(tasks);
        Lock lock = new ReentrantLock();
        for (int i = 0; i < tasks; i++) {
            int perSize = buckets.length / tasks;
            int finalI = i;
            executorService.execute(() -> {
                int index = finalI * perSize;
                for (int j = index; j < index+perSize ; j++) {
                    List<String> bucket = buckets[j];
                    for(Set<String> items:frequentItems){
                        int notEqual=0;
                        for(String item:items){
                            if(!bucket.contains(item))
                                notEqual++;
                            if(notEqual>0)
                                break;
                        }
                        if(notEqual==0){
                            lock.lock();
                            itemSet.put(items,itemSet.getOrDefault(items,0)+1);
                            lock.unlock();
                        }
                    }
                }
                countDownLatch.countDown();
            });
        }
        //等待所有任务完成
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();
        for (Map.Entry entry:itemSet.entrySet()){
            if((int)entry.getValue()<threshold){
                itemSet.remove(entry.getKey());
            }
        }
        //将汇总的结果传输给缓冲区
        try {
            buffer.put(itemSet);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

第二轮的mapper用于统计第一轮reducer中输出的候选的frequent item set中的每个set的真实支持度计数,因为在第一轮中,如果一个集合在一个线程中超过了该块数据的阈值,就判断它为frequent。但是该集合在整个baskets中的计数不一定超过了总的阈值。

第二轮的reducer:

//第二层reducer,用于输出最终结果
public class SecondReducer {
    private static final BlockingDeque<Map<Set<String>,Integer>> buffer = Transfer.bufferForSecondPass;
    public List<Set<String>> reduce(){
        List<Set<String>> result = new ArrayList<>();
        try {
            Map<Set<String>,Integer> itemSet = buffer.take();
            for(Map.Entry entry:itemSet.entrySet()){
                int support = (int)entry.getValue();
                    Set<String> set = (Set<String>) entry.getKey();
                    result.add(set);
                    System.out.println("After the second reduce, Frequent Items set: "+Util.setToString(set)+", support: "+support);
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return result;
    }
}

第二轮的reducer作用和第一轮的一样,用于输出最终的结果。

测试方法:

public class Main {
    public static void main(String[] args) {
        ArrayList[] buckets = FileOperator.readFile("itemlist10000.txt");
        FirstMapper firstMapper = new FirstMapper();
        FirstReducer firstReducer = new FirstReducer();
        SecondMapper secondMapper = new SecondMapper();
        SecondReducer secondReducer = new SecondReducer();

        firstMapper.map(buckets,1000);
        secondMapper.map(buckets,firstReducer.reduce(),1000);
        secondReducer.reduce();
    }
}

结果:

Add set to the first reducer's result: {1}
Add set to the first reducer's result: {2}
Add set to the first reducer's result: {3}
Add set to the first reducer's result: {4}
Add set to the first reducer's result: {5}
Add set to the first reducer's result: {6}
Add set to the first reducer's result: {7}
Add set to the first reducer's result: {8}
Add set to the first reducer's result: {9}
Add set to the first reducer's result: {1,2}
Add set to the first reducer's result: {1,3}
Add set to the first reducer's result: {1,4}
Add set to the first reducer's result: {2,3}
Add set to the first reducer's result: {2,4}
Add set to the first reducer's result: {1,5}
Add set to the first reducer's result: {1,6}
Add set to the first reducer's result: {2,6}
Add set to the first reducer's result: {1,7}
Add set to the first reducer's result: {3,6}
Add set to the first reducer's result: {1,8}
Add set to the first reducer's result: {2,8}
Add set to the first reducer's result: {1,9}
Add set to the first reducer's result: {4,8}
Add set to the first reducer's result: {3,9}
Add set to the first reducer's result: {1,2,3}
Add set to the first reducer's result: {1,2,4}
Add set to the first reducer's result: {1,2,6}
Add set to the first reducer's result: {1,3,6}
Add set to the first reducer's result: {2,3,6}
Add set to the first reducer's result: {1,2,8}
Add set to the first reducer's result: {1,4,8}
Add set to the first reducer's result: {1,3,9}
Add set to the first reducer's result: {2,4,8}
Add set to the first reducer's result: {1,2,3,6}
Add set to the first reducer's result: {1,2,4,8}
Add set to the first reducer's result: {10}
Add set to the first reducer's result: {1,10}
Add set to the first reducer's result: {2,10}
Add set to the first reducer's result: {5,10}
Add set to the first reducer's result: {2,5}
Add set to the first reducer's result: {1,2,10}
Add set to the first reducer's result: {1,5,10}
Add set to the first reducer's result: {2,5,10}
Add set to the first reducer's result: {1,2,5}
Add set to the first reducer's result: {1,2,5,10}
After the second reduce, Frequent Items set: {1,2,3,6}, support: 1662
After the second reduce, Frequent Items set: {1,2,4,8}, support: 1248
After the second reduce, Frequent Items set: {1,2,3}, support: 1662
After the second reduce, Frequent Items set: {1,2,4}, support: 2497
After the second reduce, Frequent Items set: {1,2,6}, support: 1662
After the second reduce, Frequent Items set: {1,3,6}, support: 1662
After the second reduce, Frequent Items set: {2,3,6}, support: 1662
After the second reduce, Frequent Items set: {1,2,8}, support: 1248
After the second reduce, Frequent Items set: {1,3,9}, support: 1109
After the second reduce, Frequent Items set: {1,4,8}, support: 1248
After the second reduce, Frequent Items set: {2,4,8}, support: 1248
After the second reduce, Frequent Items set: {1,2}, support: 4996
After the second reduce, Frequent Items set: {1,3}, support: 3329
After the second reduce, Frequent Items set: {2,3}, support: 1662
After the second reduce, Frequent Items set: {1,4}, support: 2497
After the second reduce, Frequent Items set: {2,4}, support: 2497
After the second reduce, Frequent Items set: {1,5}, support: 1999
After the second reduce, Frequent Items set: {1,6}, support: 1662
After the second reduce, Frequent Items set: {2,6}, support: 1662
After the second reduce, Frequent Items set: {1,7}, support: 1427
After the second reduce, Frequent Items set: {3,6}, support: 1662
After the second reduce, Frequent Items set: {1,8}, support: 1248
After the second reduce, Frequent Items set: {1,9}, support: 1109
After the second reduce, Frequent Items set: {2,8}, support: 1248
After the second reduce, Frequent Items set: {3,9}, support: 1109
After the second reduce, Frequent Items set: {4,8}, support: 1248
After the second reduce, Frequent Items set: {1}, support: 10000
After the second reduce, Frequent Items set: {2}, support: 4996
After the second reduce, Frequent Items set: {3}, support: 3329
After the second reduce, Frequent Items set: {4}, support: 2497
After the second reduce, Frequent Items set: {5}, support: 1999
After the second reduce, Frequent Items set: {6}, support: 1662
After the second reduce, Frequent Items set: {7}, support: 1427
After the second reduce, Frequent Items set: {8}, support: 1248
After the second reduce, Frequent Items set: {9}, support: 1109

项目GitHub地址:
https://github.com/scientist272/frequent_item_set_1

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值