Apriori算法关联分析
算法概述
关联分析是一种在大规模数据集中寻找有趣关系的任务。这些关系一般有两种形式:频繁项集和关联规则。频繁项集是经常出现在一块的物品的集合,关联规则按时两种物品之间可能存在很强的关系。下面举例进行说明
交易号码 | 商品 |
---|---|
0 | 豆奶,莴苣 |
1 | 莴苣,尿布,葡萄酒,甜菜 |
2 | 豆奶,尿布,葡萄酒,橙汁 |
3 | 莴苣,豆奶,尿布,葡萄酒 |
4 | 莴苣, 豆奶,尿布,橙汁 |
表格中{葡萄酒,尿布,豆奶}就是频繁项集的一个例子。还可以找到尿布->葡萄酒等关联规则,说明可能有人买了尿布也很有可能会买葡萄酒。使用频繁项集和关联规则,商家可以更好的了解顾客的购物习惯。
频繁项集
我们的目标是找到经常在一起购买的物品集合,对于N种物品的数据集共有 2N−1 种组合,计算复杂度很高。为了降低所需的计算时间,研究人员发现了Apriori原理:如果某个项集是频繁的,那么它的所有子集也是频繁的,一般反过来用,即:如果一个项集是非频繁集,那么他的所有超集也是非频繁的,也就是已经有了先验知识。频繁项集是通过支持度进行量化的,支持度:数据集中包含该项集所占的比例
频繁项集伪代码如下:
当集合中项的个数大于0
构建一个k个项组成的候选项集的列表
检查数据以确认每个项集都是频繁的
保留贫寒项集并构建k+1项组成的候选项及的列表
示例代码如下:
def loadDataSet():
return [[1,3,4], [2,3,5],[1,2,3,5], [2,5]]
# create single item set
def createC1(dataSet):
C1 = []
for transaction in dataSet:
for item in transaction:
if not [item] in C1:
C1.append([item])
C1.sort()
# the list can't be the key of dict, the frozenset can be set as the key of dict
return list(map(frozenset, C1)) # different with python2 should add list function after map
# get the sets that the support satisfying the condition
def scanD(D, Ck, minSupport):
ssCnt = {}
for tid in D:
for can in Ck:
if can.issubset(tid):
if not can in ssCnt:# python3 the dict object has no attribute 'has_key'
ssCnt[can] = 1
else:
ssCnt[can] += 1
numItems = float(len(D))
retList = []
supportData = {}
for key in ssCnt:
support = ssCnt[key]/numItems
if support >= minSupport:
retList.insert(0, key)
supportData[key] = support
return retList, supportData
# Create Ck
def aprioriGen(Lk, k):
retList = []
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1, lenLk):
L1 = list(Lk[i])[:k - 2]
L2 = list(Lk[j])[:k - 2]
L1.sort()
L2.sort()
if L1 == L2:
retList.append(Lk[i] | Lk[j])
return retList
# get the freqSet with apriori algorithm
def apriori(dataSet, minSupport = 0.5):
C1 = createC1(dataSet)
D = list(map(set, dataSet))
L1, supportData = scanD(D, C1, minSupport)
L = [L1]
k = 2
while len(L[k-2]) > 0:
Ck = aprioriGen(L[k-2], k)
Lk, supK = scanD(D, Ck, minSupport)
supportData.update(supK)
L.append(Lk)
k += 1
return L, supportData
>>>import apriori
>>>dataSet = apriori.loadDataSet()
>>>L, supportData = apriori.apriori(dataSet)
>>>print(L)
L = [[frozenset({1}), frozenset({3}), frozenset({2}), frozenset({5})], [frozenset({3, 5}), frozenset({1, 3}), frozenset({2, 5}), frozenset({2, 3})], [frozenset({2, 3, 5})]]
从频繁项集中获取关联规则
对于关联规则,量化方法为可信度。一条规则 P−>H 的可信度定为 support(P|H)/support(P) , support 为支持度,在频繁项集中已经进行了计算。类似的关联度也满足Apriori原理,例如:0,1,2->3不满足最小可信度,那么任何0,1,2的子集->3也不满足可信度要求。
关联规则示例代码:
def generateRules(L, supportData, minConf = 0.7):
bigRuleList = []
for i in range(1, len(L)):# only get the sets with two or more items
for freqSet in L[i]:
H1 = [frozenset([item]) for item in freqSet]
if i > 1:
rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
else:
calcConf(freqSet, H1, supportData, bigRuleList, minConf)
return bigRuleList
def calcConf(freqSet, H, supportData, brl, minConf = 0.7):
pruneDH = []
for conseq in H:
conf = supportData[freqSet]/supportData[freqSet - conseq]
if conf >= minConf:
print(freqSet - conseq, '-->', conseq, 'conf:', conf)
brl.append((freqSet - conseq, conseq, conf))
pruneDH.append(conseq)
return pruneDH
def rulesFromConseq(freqSet, H, supportData, br1, minConf = 0.7):
m = len(H[0])
if len(freqSet) > (m+1):# try further merging
Hmp1 = aprioriGen(H, m + 1)# create H m+1 candidate
Hmp1 = calcConf(freqSet, Hmp1, supportData, br1, minConf)
if len(Hmp1) > 1:# need at least two sets to merge
rulesFromConseq(freqSet, Hmp1, supportData, br1, minConf)
>>>import apriori
>>>dataSet = apriori.loadDataSet()
>>>L, supportData = apriori.apriori(dataSet)
>>>rules = apriori.generateRules(L,supportData, 0.5)
>>>print(rules)
frozenset({5}) --> frozenset({3}) conf: 0.6666666666666666
frozenset({3}) --> frozenset({5}) conf: 0.6666666666666666
frozenset({3}) --> frozenset({1}) conf: 0.6666666666666666
frozenset({1}) --> frozenset({3}) conf: 1.0
frozenset({5}) --> frozenset({2}) conf: 1.0
frozenset({2}) --> frozenset({5}) conf: 1.0
frozenset({3}) --> frozenset({2}) conf: 0.6666666666666666
frozenset({2}) --> frozenset({3}) conf: 0.6666666666666666
frozenset({5}) --> frozenset({2, 3}) conf: 0.6666666666666666
frozenset({3}) --> frozenset({2, 5}) conf: 0.6666666666666666
frozenset({2}) --> frozenset({3, 5}) conf: 0.6666666666666666
算法特点
优点: 易编码实现
缺点: 在大数据集上效率较低
适用数据类型: 数值型或标称型数据