Apriori算法寻找频繁集

最新推荐文章于 2023-09-13 16:10:47 发布

林下月光

最新推荐文章于 2023-09-13 16:10:47 发布

阅读量5.9k

点赞数 9

分类专栏：机器学习

本文链接：https://blog.csdn.net/weixin_41857483/article/details/109580313

版权

机器学习专栏收录该内容

67 篇文章 21 订阅

订阅专栏

0. 前言

上篇博客对Apriori算法的原理进行了总结，下面希望来实现以下这个算法。

1. Apriori算法寻找频繁集步骤

假定Apriori算法的输入参数是最小支持度(minSupport)和数据集。该算法首先会生成所有单个物品的项集列表，接着会扫描所有的记录，查看这些项集是否满足最小支持度的要求，不满足的会被消除掉；接着对满足的项集进行组合，生成包含2个元素的项集，再重新扫描数据集，消除掉不满足最小支持度的项集。重复上面的步骤，直至所有项集都被去掉。
可能文字的叙述太过生硬，下面用一个例子来说明。下面是数据集，此处设置最小支持度为0.5：

ID	商品
1	方便面，火腿肠
2	方便面，尿布，啤酒，橙汁
3	火腿肠，尿布，啤酒，可乐
4	方便面，火腿肠，尿布，啤酒
5	方便面，火腿肠，尿布，可乐

算法流程：
在这里插入图片描述
由于最小支持度设置为0.5，所以对于包含1个物品的集合中，{橙汁}和{可乐}支持度分别为1/5=0.2、2/5=0.4，不满足最小支持度，所以就被消除掉了。包含多个物品的集合也是同样的计算方法。

故上面最终得到的频繁集有：
{方便面}，{火腿肠}，{尿布}，{啤酒}；{方便面，火腿肠}，{方便面、尿布}，{火腿肠，尿布}，{尿布，啤酒}。

2. 代码实现Apriori算法寻找频繁集

参照《机器学习实战》一书实现

创造数据集

def load_dataset():
    """构造数据集"""
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

创建c1集合

def create_c1(dataSet):
    """创建包含一个元素的所有候选项集的集合"""
    c1 = []
    for tran in dataSet:
        for item in tran:
            if not [item] in c1:
                c1.append([item])
    c1.sort()
    return c1

计算候选数据集支持度

def scan_dataset(D, Ck, minSupport):
    """计算候选数据集CK在数据集D中的支持度，返回大于最小支持度的数据"""
    ssCnt = {}  # 存放所有候选集和频率
    for tid in D:
        for can in map(frozenset, Ck):
            if can.issubset(tid):
                if not can in ssCnt:
                    ssCnt[can] = 1
                else:
                    ssCnt[can] += 1
    numItems = float(len(D))  # 所有项集数目
    retList = []  # 满足最小支持度的频繁项集
    supportData = {}  # 满足最小支持度的频繁项集和频率
    for key in ssCnt:
        support = ssCnt[key]/numItems  # 支持度
        if support >= minSupport:  # 保存下大于最小支持度的项集
            retList.insert(0, key)
        supportData[key] = support
    return retList, supportData

测试

dataSet = load_dataset()
print("数据集：\n", dataSet)
c1 = create_c1(dataSet)
print("包含一个元素候选项集：\n", c1)
L1, supportData = scan_dataset(dataSet, c1, 0.5)  # 最小支持度为0.5
print("频繁项集支持度：\n", supportData)
print("满足最小支持度的频繁项集：\n", L1)

输出结果：

数据集：
 [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
包含一个元素候选项集：
 [[1], [2], [3], [4], [5]]
频繁项集支持度：
 {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75}
满足最小支持度的频繁项集：
 [frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]

Apriori算法

def aprioriGen(Lk, k):
    """计算可能的候选项集"""
    retList = []  # 满足条件的频繁项集
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[: k-2]
            L2 = list(Lk[j])[: k-2]
            L1.sort()
            L2.sort()
            if L1 == L2:
                retList.append(Lk[i] | Lk[j])
    return retList


def apriori(dataSet, minSupport=0.5):
    """找满足最小支持度的频繁项集"""
    c1 = create_c1(dataSet)
    D = list(map(set, dataSet))
    L1, supportData = scan_dataset(D, c1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2])>0):
        Ck = aprioriGen(L[k - 2], k)
        Lk, supK = scan_dataset(D, Ck, minSupport)
        supportData.update(supK)
        if len(Lk) == 0:
            break
        L.append(Lk)
        k += 1
    return L, supportData

测试

dataSet = load_dataset()
print("数据集：\n", dataSet)
L, supportData = apriori(dataSet)
print("频繁项集支持度：\n", supportData)
print("满足最小支持度的频繁项集：\n", L)
print("L0下可能的候选项集：\n", aprioriGen(L[0], 2))

Lt, supportDatat = apriori(dataSet, minSupport=0.7)
print("满足最小支持度为70%的频繁项集：\n", Lt)

输出结果：

数据集：
 [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
频繁项集支持度：
 {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
满足最小支持度的频繁项集：
 [[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})]]
L0下可能的候选项集：
 [frozenset({2, 5}), frozenset({3, 5}), frozenset({1, 5}), frozenset({2, 3}), frozenset({1, 2}), frozenset({1, 3})]
满足最小支持度为70%的频繁项集：
 [[frozenset({5}), frozenset({2}), frozenset({3})], [frozenset({2, 5})]]