Apriori算法是我们要介绍的第二个非监督学习算法,通过Apriori算法我们可以非常容易的找到数据集中的内在关系,例如:哪些数据出现的频率较高,哪些数据之间存在着相互关系等等。
本节包含以下内容:
- 背景
- 频繁项集
- 关联规则
部分内容引用自《Machine Learning in Action》
背景
设想这样一个场景,一个超市里共有N种商品,客户可以一次购买其中的任何多种商品。目前我们有该超市一年的销售记录,希望从中找到哪些商品经常被一起购买,以及客户如果买了哪些商品后最有可能购买另外哪些商品。当然,实际操作中还会统计更多的指标,例如是否周末或节假日,超市的地理位置,供应商,优惠活动等等,这些不做考虑。找到经常被购买的商品后,我们可以更多的进购这些商品,找到关联规则后,我们可以尽可能的把这些商品放置到一起。经常被购买的商品我们称为频繁项集,项集之间的内在关系我们称为关联规则。
如何找到频繁项集和关联规则?
先引出两个概念:
- 支持度(support):定义为数据集中包含该项集的所有记录数除以总的记录数。例如,数据集为:{1},{2,3},{3,4},{2,5},{1,5},因为包含{1} 的集合有两个,{1}和{1,5},而总共有5个集合,所以,数据集{1}的支持度为:2/5=0.4。
- 可信度(confidence):可信度是针对某条规则来定义的,例如({1} ->{5}),其值为支持度({1,5})/ 支持度({1})。通过上面的例子可知,可信度({1} ->{5})= 支持度({1,5})/ 支持度({1})= (1/5)/(2/5)= 0.5。
通过支持度和可信度的量化定义,我们就能设定支持度大于某个数值,例如 0.7 的项集为频繁项集,可信度大于某个数值,例如0.5 的规则为可信的关联规则。
那么,如何计算所有可能集合的支持度以及找出对应关联规则的可信度呢?
假设有4种商品,{0,1,2,3},可能的组合有15种,
我们需要分别去计算每个组合的支持度,从而判断其是否为频繁项集。理论上,N个元素的集合共有 种组合,如果每种组合都去计算,这将是指数级的时间复杂度,实际操作难度很大。Apriori算法可以解决此问题,其能在最大程度上减少计算量。
频繁项集
在寻找频繁项集时,Apriori算法的基本思想是,如果某个项集是非频繁的,那么包含该项集的所有超集也是非频繁的。因为超集的个数<=子集的个数,所以超集的支持度<=子集的支持度,如果子集都是非频繁项集,那么显然超集也是非频繁项集。例如:下图中,{2,3}是非频繁项集,那么包含{2,3}的所有超集{0,2,3},{1,2,3}和{0,1,2,3}都是非频繁项集。
有了这个原理,我们就能先计算包含一个元素的子集,判断哪些是频繁项集,哪些是非频繁项集,把非频繁项集的元素剔除掉,剩下的元素再组合成包含两个元素的子集,再进行过滤。这样依次进行,直到组合成最终的一个集合。
这里我们通过代码来实现如何发现频繁项集,创建模块 apriori.py,并输入以下代码:
def load_dataset():
return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
def create_c1(dataset):
c1 = []
for data in dataset:
for item in data:
if not [item] in c1:
c1.append([item])
c1.sort()
return list(map(frozenset, c1))
def scan_d(d, ck, min_support):
ss_cnt = {}
for tid in d:
for can in ck:
if can.issubset(tid):
if not can in ss_cnt:
ss_cnt[can] = 1
else:
ss_cnt[can] += 1
num_items = float(len(d))
ret_list = []
support_data = {}
for key in ss_cnt:
support = ss_cnt[key] / num_items
if support >= min_support:
ret_list.insert(0, key)
support_data[key] = support
return ret_list, support_data
def apriori_gen(lk, k):
ret_list = []
len_lk = len(lk)
for i in range(len_lk):
for j in range(i + 1, len_lk):
l1 = list(lk[i])[:k - 2]
l2 = list(lk[j])[:k - 2]
l1.sort()
l2.sort()
if l1 == l2:
ret_list.append(lk[i] | lk[j])
return ret_list
def apriori(dataset, min_support):
c1 = create_c1(dataset)
d = list(map(set, dataset))
l1, support_data = scan_d(d, c1, min_support)
l = [l1]
k = 2
while (len(l[k - 2]) > 0):
ck = apriori_gen(l[k - 2], k)
lk, supk = scan_d(d, ck, min_support)
support_data.update(supk)
l.append(lk)
k += 1
return l, support_data
if __name__ == '__main__':
print("---Test create_c1---")
dataset = load_dataset()
c1 = create_c1(dataset)
print("C1: %r" % c1)
print("---Test scan_d---")
l1, support_data = scan_d(list(map(set, dataset)), c1, 0.5)
print("L1: %r" % l1)
print("Support data: %r" % support_data)
print("---Test apriori_gen---")
ret_list = apriori_gen(l1, 2)
print(ret_list)
print("---Test apriori---")
l, support_data = apriori(dataset, 0.5)
print("L: %r" % l)
print("Support data: %r" % support_data)
运行结果:
D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori.py
---Test create_c1---
C1: [frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4}), frozenset({5})]
---Test scan_d---
L1: [frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]
Support data: {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75}
---Test apriori_gen---
[frozenset({2, 5}), frozenset({3, 5}), frozenset({1, 5}), frozenset({2, 3}), frozenset({1, 2}), frozenset({1, 3})]
---Test apriori---
L: [[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
Support data: {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
Process finished with exit code 0
关联规则
关联规则是建立在频繁项集的基础上,只有频繁项集的关联规则才有参考意义。在寻找关联规则时,Apriori算法的基本思想是:对某个项集S,如果某条规则(S1 -> S2)不能满足最小可信度的要求,那么所有以包含S2的超集为后件的规则都不满足最小可信度的要求。例如:
已知({0,1,2} -> {3})不满足最小可信度要求,那么上图中黑色的规则也都不满足最小可信度要求,因为这些规则的后件都是项集{3}的超集。
通过关联规则的定义可以很容易证明这个结论。假设最小可信度为C,已知 可信度({0,1,2} -> {3})< C,所以 支持度({0,1,2,3})/ 支持度({0,1,2})< C。而 支持度({0,1,2})<= 支持度({1,2}),所以,支持度({0,1,2,3})/ 支持度({1,2})<= 支持度({0,1,2,3})/ 支持度({0,1,2})< C。所以,可信度({1,2} -> {0,3})<= 可信度({0,1,2} -> {3})< C。
下面通过代码来寻找关联规则,创建模块 apriori_rule.py,并输入以下代码:
import apriori as apriori_base
def generate_rules(L, support_data, min_conf=0.7):
big_rule_list = []
for i in range(1, len(L)): # only get the sets with two or more items
for freq_set in L[i]:
h1 = [frozenset([item]) for item in freq_set]
if (i > 1):
rules_from_conseq(freq_set, h1, support_data, big_rule_list, min_conf)
else:
calc_conf(freq_set, h1, support_data, big_rule_list, min_conf)
return big_rule_list
def calc_conf(freq_set, H, support_data, brl, min_conf=0.7):
pruned_h = []
for conseq in H:
conf = support_data[freq_set] / support_data[freq_set - conseq]
if conf >= min_conf:
print("%r --> %r, conf: %r" % (freq_set - conseq, conseq, conf))
brl.append((freq_set - conseq, conseq, conf))
pruned_h.append(conseq)
return pruned_h
def rules_from_conseq(freq_set, H, support_data, brl, min_conf=0.7):
m = len(H[0])
if (len(freq_set) > (m + 1)):
hmp1 = apriori_base.apriori_gen(H, m + 1)
hmp1 = calc_conf(freq_set, hmp1, support_data, brl, min_conf)
if (len(hmp1) > 1):
rules_from_conseq(freq_set, hmp1, support_data, brl, min_conf)
if __name__ == '__main__':
dataset = apriori_base.load_dataset()
l, support_data = apriori_base.apriori(dataset, 0.5)
print(l)
print(support_data)
rules = generate_rules(l, support_data, 0.7)
print(rules)
运行结果:
D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori_rule.py
[[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
{frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
frozenset({5}) --> frozenset({2}), conf: 1.0
frozenset({2}) --> frozenset({5}), conf: 1.0
frozenset({1}) --> frozenset({3}), conf: 1.0
[(frozenset({5}), frozenset({2}), 1.0), (frozenset({2}), frozenset({5}), 1.0), (frozenset({1}), frozenset({3}), 1.0)]
Process finished with exit code 0
如果调整最小可信度为0.6,将得到更多的规则:
D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/apriori/apriori_rule.py
[[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})], []]
{frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}
frozenset({3}) --> frozenset({2}), conf: 0.6666666666666666
frozenset({2}) --> frozenset({3}), conf: 0.6666666666666666
frozenset({5}) --> frozenset({3}), conf: 0.6666666666666666
frozenset({3}) --> frozenset({5}), conf: 0.6666666666666666
frozenset({5}) --> frozenset({2}), conf: 1.0
frozenset({2}) --> frozenset({5}), conf: 1.0
frozenset({3}) --> frozenset({1}), conf: 0.6666666666666666
frozenset({1}) --> frozenset({3}), conf: 1.0
frozenset({5}) --> frozenset({2, 3}), conf: 0.6666666666666666
frozenset({3}) --> frozenset({2, 5}), conf: 0.6666666666666666
frozenset({2}) --> frozenset({3, 5}), conf: 0.6666666666666666
[(frozenset({3}), frozenset({2}), 0.6666666666666666), (frozenset({2}), frozenset({3}), 0.6666666666666666), (frozenset({5}), frozenset({3}), 0.6666666666666666), (frozenset({3}), frozenset({5}), 0.6666666666666666), (frozenset({5}), frozenset({2}), 1.0), (frozenset({2}), frozenset({5}), 1.0), (frozenset({3}), frozenset({1}), 0.6666666666666666), (frozenset({1}), frozenset({3}), 1.0), (frozenset({5}), frozenset({2, 3}), 0.6666666666666666), (frozenset({3}), frozenset({2, 5}), 0.6666666666666666), (frozenset({2}), frozenset({3, 5}), 0.6666666666666666)]
Process finished with exit code 0