机器学习-使用Apriori进行关联分析

Apriori算法是我们要介绍的第二个非监督学习算法,通过Apriori算法我们可以非常容易的找到数据集中的内在关系,例如:哪些数据出现的频率较高,哪些数据之间存在着相互关系等等。

本节包含以下内容:

  1. 背景
  2. 频繁项集
  3. 关联规则

部分内容引用自《Machine Learning in Action》


背景

设想这样一个场景,一个超市里共有N种商品,客户可以一次购买其中的任何多种商品。目前我们有该超市一年的销售记录,希望从中找到哪些商品经常被一起购买,以及客户如果买了哪些商品后最有可能购买另外哪些商品。当然,实际操作中还会统计更多的指标,例如是否周末或节假日,超市的地理位置,供应商,优惠活动等等,这些不做考虑。找到经常被购买的商品后,我们可以更多的进购这些商品,找到关联规则后,我们可以尽可能的把这些商品放置到一起。经常被购买的商品我们称为频繁项集,项集之间的内在关系我们称为关联规则。

如何找到频繁项集和关联规则?

先引出两个概念:

  1. 支持度(support):定义为数据集中包含该项集的所有记录数除以总的记录数。例如,数据集为:{1},{2,3},{3,4},{2,5},{1,5},因为包含{1} 的集合有两个,{1}和{1,5},而总共有5个集合,所以,数据集{1}的支持度为:2/5=0.4。
  2. 可信度(confidence):可信度是针对某条规则来定义的,例如({1} ->{5}),其值为支持度({1,5})/ 支持度({1})。通过上面的例子可知,可信度({1} ->{5})= 支持度({1,5})/ 支持度({1})= (1/5)/(2/5)= 0.5

通过支持度和可信度的量化定义,我们就能设定支持度大于某个数值,例如 0.7 的项集为频繁项集,可信度大于某个数值,例如0.5 的规则为可信的关联规则。

那么,如何计算所有可能集合的支持度以及找出对应关联规则的可信度呢?

假设有4种商品,{0,1,2,3},可能的组合有15种,

我们需要分别去计算每个组合的支持度,从而判断其是否为频繁项集。理论上,N个元素的集合共有2^N - 1 种组合,如果每种组合都去计算,这将是指数级的时间复杂度,实际操作难度很大。Apriori算法可以解决此问题,其能在最大程度上减少计算量。

频繁项集

在寻找频繁项集时,Apriori算法的基本思想是,如果某个项集是非频繁的,那么包含该项集的所有超集也是非频繁的。因为超集的个数<=子集的个数,所以超集的支持度<=子集的支持度,如果子集都是非频繁项集,那么显然超集也是非频繁项集。例如:下图中,{2,3}是非频繁项集,那么包含{2,3}的所有超集{0,2,3},{1,2,3}和{0,1,2,3}都是非频繁项集。

 

有了这个原理,我们就能先计算包含一个元素的子集,判断哪些是频繁项集,哪些是非频繁项集,把非频繁项集的元素剔除掉,剩下的元素再组合成包含两个元素的子集,再进行过滤。这样依次进行,直到组合成最终的一个集合。

这里我们通过代码来实现如何发现频繁项集,创建模块 apriori.py,并输入以下代码:

def load_dataset():
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]


def create_c1(dataset):
    c1 = []
    for data in dataset:
        for item in data:
            if not [item] in c1:
                c1.append([item])
    c1.sort()
    return list(map(frozenset, c1))


def scan_d(d, ck, min_support):
    ss_cnt = {}
    for tid in d:
        for can in ck:
            if can.issubset(tid):
                if not can in ss_cnt:
                    ss_cnt[can] = 1
                else:
                    ss_cnt[can] += 1
    num_items = float(len(d))
    ret_list = []
    support_data = {}
    for key in ss_cnt:
        support = ss_cnt[key] / num_items
        if support >= min_support:
            ret_list.insert(0, key)
        support_data[key] = support
    return ret_list, support_data


def apriori_gen(lk, k):
    ret_list = []
    len_lk = len(lk)
    for i in range(len_lk):
        for j in range(i + 1, len_lk):
            l1 = list(lk[i])[:k - 2]
            l2 = list(lk[j])[:k - 2]
            l1.sort()
            l2.sort()
            if l1 == l2:
                ret_list.append(lk[i] | lk[j])
    return ret_list


def apriori(dataset, min_support):
    c1 = create_c1(dataset)
    d = list(map(set, dataset))
    l1, support_data = scan_d(d, c1, min_support)
    l = [l1]
    k = 2
    while (len(l[k - 2]) > 0):
        ck = apriori_gen(l[k - 2], k)
        lk, supk = scan_d(d, ck, min_support)
        support_data.update(supk)
        l.append(lk)
        k += 1
    return l, support_data


if __name__ == '__main__':
    print("---Test create_c1---")
    dataset = load_dataset()
    c1 = create_c1(dataset)
    print("C1: %r" % c1)
    print("---Test scan_d---")
    l1, support_data = scan_d(list(map(set, dataset)), c1, 0.5)
    print("L1: %r" % l1)
    print("Support data: %r" % support_data)
    print("---Test apriori_gen---")
    ret_list = apriori_gen(l1, 2)
    print(ret_list)
    print("---Test apriori---")
    l, support_data = apriori(dataset, 0.5)
    print("L: %r" % l)
    print("Support data: %r" % support_data)

运行结果:

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori.py
---Test create_c1---
C1: [frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4}), frozenset({5})]
---Test scan_d---
L1: [frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]
Support data: {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75}
---Test apriori_gen---
[frozenset({2, 5}), frozenset({3, 5}), frozenset({1, 5}), frozenset({2, 3}), frozenset({1, 2}), frozenset({1, 3})]
---Test apriori---
L: [[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
Support data: {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}

Process finished with exit code 0

关联规则

关联规则是建立在频繁项集的基础上,只有频繁项集的关联规则才有参考意义。在寻找关联规则时,Apriori算法的基本思想是:对某个项集S,如果某条规则(S1 -> S2)不能满足最小可信度的要求,那么所有以包含S2的超集为后件的规则都不满足最小可信度的要求。例如:

已知({0,1,2} -> {3})不满足最小可信度要求,那么上图中黑色的规则也都不满足最小可信度要求,因为这些规则的后件都是项集{3}的超集。

通过关联规则的定义可以很容易证明这个结论。假设最小可信度为C,已知 可信度({0,1,2} -> {3})< C,所以 支持度({0,1,2,3})/ 支持度({0,1,2})< C。而 支持度({0,1,2})<= 支持度({1,2}),所以,支持度({0,1,2,3})/ 支持度({1,2})<= 支持度({0,1,2,3})/ 支持度({0,1,2})< C。所以,可信度({1,2} -> {0,3})<= 可信度({0,1,2} -> {3})< C。

下面通过代码来寻找关联规则,创建模块 apriori_rule.py,并输入以下代码:

import apriori as apriori_base


def generate_rules(L, support_data, min_conf=0.7):
    big_rule_list = []
    for i in range(1, len(L)):  # only get the sets with two or more items
        for freq_set in L[i]:
            h1 = [frozenset([item]) for item in freq_set]
            if (i > 1):
                rules_from_conseq(freq_set, h1, support_data, big_rule_list, min_conf)
            else:
                calc_conf(freq_set, h1, support_data, big_rule_list, min_conf)
    return big_rule_list


def calc_conf(freq_set, H, support_data, brl, min_conf=0.7):
    pruned_h = []
    for conseq in H:
        conf = support_data[freq_set] / support_data[freq_set - conseq]
        if conf >= min_conf:
            print("%r --> %r, conf: %r" % (freq_set - conseq, conseq, conf))
            brl.append((freq_set - conseq, conseq, conf))
            pruned_h.append(conseq)
    return pruned_h


def rules_from_conseq(freq_set, H, support_data, brl, min_conf=0.7):
    m = len(H[0])
    if (len(freq_set) > (m + 1)):
        hmp1 = apriori_base.apriori_gen(H, m + 1)
        hmp1 = calc_conf(freq_set, hmp1, support_data, brl, min_conf)
        if (len(hmp1) > 1):
            rules_from_conseq(freq_set, hmp1, support_data, brl, min_conf)


if __name__ == '__main__':
    dataset = apriori_base.load_dataset()
    l, support_data = apriori_base.apriori(dataset, 0.5)
    print(l)
    print(support_data)
    rules = generate_rules(l, support_data, 0.7)
    print(rules)

运行结果:

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori_rule.py
[[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
{frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
frozenset({5}) --> frozenset({2}), conf: 1.0
frozenset({2}) --> frozenset({5}), conf: 1.0
frozenset({1}) --> frozenset({3}), conf: 1.0
[(frozenset({5}), frozenset({2}), 1.0), (frozenset({2}), frozenset({5}), 1.0), (frozenset({1}), frozenset({3}), 1.0)]

Process finished with exit code 0

如果调整最小可信度为0.6,将得到更多的规则:

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori_rule.py
[[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
{frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
frozenset({3}) --> frozenset({2}), conf: 0.6666666666666666
frozenset({2}) --> frozenset({3}), conf: 0.6666666666666666
frozenset({5}) --> frozenset({3}), conf: 0.6666666666666666
frozenset({3}) --> frozenset({5}), conf: 0.6666666666666666
frozenset({5}) --> frozenset({2}), conf: 1.0
frozenset({2}) --> frozenset({5}), conf: 1.0
frozenset({3}) --> frozenset({1}), conf: 0.6666666666666666
frozenset({1}) --> frozenset({3}), conf: 1.0
frozenset({5}) --> frozenset({2, 3}), conf: 0.6666666666666666
frozenset({3}) --> frozenset({2, 5}), conf: 0.6666666666666666
frozenset({2}) --> frozenset({3, 5}), conf: 0.6666666666666666
[(frozenset({3}), frozenset({2}), 0.6666666666666666), (frozenset({2}), frozenset({3}), 0.6666666666666666), (frozenset({5}), frozenset({3}), 0.6666666666666666), (frozenset({3}), frozenset({5}), 0.6666666666666666), (frozenset({5}), frozenset({2}), 1.0), (frozenset({2}), frozenset({5}), 1.0), (frozenset({3}), frozenset({1}), 0.6666666666666666), (frozenset({1}), frozenset({3}), 1.0), (frozenset({5}), frozenset({2, 3}), 0.6666666666666666), (frozenset({3}), frozenset({2, 5}), 0.6666666666666666), (frozenset({2}), frozenset({3, 5}), 0.6666666666666666)]

Process finished with exit code 0

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值