关联规则挖掘——FPTree

FPTree

  • 前面介绍过用Aprior算法来挖掘关联规则,但是Aprior算法的最大缺陷就是生成了候选项集,通过求组合数可以知道候选项集的数量是非常庞大的。
  • 从候选项集到频繁项集,需要不断地扫描整个数据库来计算支持度是否满足min_sup的要求。
  • 因此为了消除这种“候选产生过程”,频繁增长模式(Frequent-Pattern Growth,FP-Growth)产生了。

算法思想

  • 采用分治策略
  • 首先,将代表频繁项集的数据库压缩到一颗频繁模式树(FP-Tree),该树仍然保留项集的关联信息。
  • 然后,把这种压缩后的数据库划分成一组条件数据库,每个数据库关联一个频繁项或者是“模式段”,并分别挖掘每个条件数据库。

算法实现过程

  • 数据库的第一次扫描和Aprior算法相同,他导出频繁1项集的集合,并得到他们的支持度计数。频繁1项集按照支持度递减进行排序,结果记为L。
  • 首先创造根节点,用“null”进行标记。
  • 第二次扫描数据库D,每个事物中的项都按照L中的次序处理(按照递减支持度计数排序),并且对每个事物创建一个分枝。

案例分析

事物ID
001r,z,h,j,p
002z,y,x,w,v,u,t,s
003z
004r,x,n,o,s
005y,r,x,z,q,t,p
006y,z,x,e,q,s,t,m

代码实现

实现过程

  • 首先遍历一次数据库,对数据库中的所有事物进行记录frozenDataDict{transaction:count},通过transaction构造初始的headerTable({item:count}),根据minSupport来筛选出频繁1项集,并且将headerTable修改为{item:[counter, treeNodeLink]}
  • 遍历fronzenDataDict中的所有key值,即每一个事物,从每一个事物当中筛选出属于频繁1项集当中的项,将筛选出来的项按照headerTable当中的counter值降序排列得到orderedItems
  • 将orderItems当中的每一项插入到FPTree当中,对于每一个item都包含两种情况:一种是当前的item已经在子节点children当中,此时只需要将children当中对应节点的counter值增加即可;第二种是children当中并不包含item,那么我们需要创建一个新的节点
  • 在新的节点创建完成之后,我们需要更新headerTable当中的treeNodeLink,此时又存在两种情况:一种是treeNodeLink值为None,那么这个时候只需要将treeNodeLink指向新创建的节点即可;第二种情况就是treeNodeLink不是None,那么这个时候需要遍历treeNodeLink指向的treeNode,直到treeNode的nodeLink为None,这个时候将这个空的nodeLink指向刚刚创建的节点即可
  • 检查完所有的事物之后FPTree创建完成

FPTree中节点的声明

class TreeNode(object):
    def __init__(self, nameValue: str, numOccur: int, parentNode):
        # 项的名字
        self.name = nameValue
        # 项在FPTree当中出现的次数
        self.count = numOccur
        # 相同项的下一个节点
        self.nodeLink = None
        # 父节点
        self.parentNode = parentNode
        # 子节点
        # for example, the children like 'milk': TreeNode('milk')
        self.children = {}

    def inc(self, count):
        self.count += count

    def show(self, ind=1):
        print('   ' * ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.show(ind + 1)

FPTree类以及相关的函数声明

def makeHeaderTable(headerTable: dict) -> dict:
    for item in headerTable:
        headerTable[item] = [headerTable[item], None]

    return headerTable


def getHeaderTable(dataSet, minSupport=1) -> dict:
    headerTable = {}

    for key, value in dataSet.items():
        for item in key:
            headerTable[item] = headerTable.get(item, 0) + value

    lessThanMinSupportList = list(filter(lambda k: headerTable[k] < minSupport, headerTable))
    for x in lessThanMinSupportList:
        del headerTable[x]

    return headerTable


def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
    while toastNode.nodeLink is not None:
        toastNode = toastNode.nodeLink
    toastNode.nodeLink = targetNode


class FPTree:
    def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
        self.treeNode = TreeNode('null', 1, None)
        # 'milk': [counter, nodeLink]
        self.headerTable = makeHeaderTable(headerTable)
        self.frozenDataDict = frozenDataDict
        self.minSupport = minSupport

    def updateTree(self, treeNode, items: list, count: int):
        item = items[0]
        if item in treeNode.children:
            treeNode.children[item].inc(count)
        else:
            treeNode.children[item] = TreeNode(item, count, treeNode)
            if self.headerTable[item][1] is None:
                self.headerTable[item][1] = treeNode.children[item]
            else:
                updateHeaderTable(self.headerTable[item][1], treeNode.children[item])
        if len(items) > 1:
            self.updateTree(treeNode.children[item], items[1::], count)

    def createFPTree(self):
        freqItems = set(self.headerTable.keys())

        if len(freqItems) == 0:
            self.headerTable = None
            return

        for transaction, count in self.frozenDataDict.items():
            learnSet = {}
            for item in transaction:
                if item in freqItems:
                    learnSet[item] = self.headerTable[item][0]

            if len(learnSet) > 0:
                orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
                self.updateTree(self.treeNode, orderedItems, count)

完整代码

def loadSimpleData():
    simpleData = [['beer', 'milk', 'chicken'], ['milk', 'bread'], ['milk', 'diaper'],
                  ['beer', 'milk', 'bread'], ['beer', 'diaper'], ['milk', 'diaper'],
                  ['beer', 'diaper'], ['beer', 'milk', 'diaper', 'chicken'], ['beer', 'milk', 'diaper']]
    return simpleData


def createInitSet(dataSet: list) -> dict:
    returnSet = {}

    for item in dataSet:
        frozenItem = frozenset(item)
        returnSet[frozenItem] = returnSet.get(frozenItem, 0) + 1

    return returnSet


class TreeNode(object):
    def __init__(self, nameValue: str, numOccur: int, parentNode):
        # 项的名字
        self.name = nameValue
        # 项在FPTree当中出现的次数
        self.count = numOccur
        # 相同项的下一个节点
        self.nodeLink = None
        # 父节点
        self.parentNode = parentNode
        # 子节点
        # for example, the children like 'milk': TreeNode('milk')
        self.children = {}

    def inc(self, count):
        self.count += count

    def show(self, ind=1):
        print('   ' * ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.show(ind + 1)


def getHeaderTable(dataSet, minSupport=1) -> dict:
    headerTable = {}

    for key, value in dataSet.items():
        for item in key:
            headerTable[item] = headerTable.get(item, 0) + value

    lessThanMinSupportList = list(filter(lambda k: headerTable[k] < minSupport, headerTable))
    for x in lessThanMinSupportList:
        del headerTable[x]

    return headerTable


def makeHeaderTable(headerTable: dict) -> dict:
    for item in headerTable:
        headerTable[item] = [headerTable[item], None]

    return headerTable


def updateHeaderTable(toastNode: TreeNode, targetNode: TreeNode):
    while toastNode.nodeLink is not None:
        toastNode = toastNode.nodeLink
    toastNode.nodeLink = targetNode


class FPTree:
    def __init__(self, frozenDataDict: dict, headerTable: dict, minSupport: int):
        self.treeNode = TreeNode('null', 1, None)
        # 'milk': [counter, nodeLink]
        self.headerTable = makeHeaderTable(headerTable)
        self.frozenDataDict = frozenDataDict
        self.minSupport = minSupport

    def updateTree(self, treeNode, items: list, count: int):
        item = items[0]
        if item in treeNode.children:
            treeNode.children[item].inc(count)
        else:
            treeNode.children[item] = TreeNode(item, count, treeNode)
            if self.headerTable[item][1] is None:
                self.headerTable[item][1] = treeNode.children[item]
            else:
                updateHeaderTable(self.headerTable[item][1], treeNode.children[item])
        if len(items) > 1:
            self.updateTree(treeNode.children[item], items[1::], count)

    def createFPTree(self):
        freqItems = set(self.headerTable.keys())

        if len(freqItems) == 0:
            self.headerTable = None
            return

        for transaction, count in self.frozenDataDict.items():
            learnSet = {}
            for item in transaction:
                if item in freqItems:
                    learnSet[item] = self.headerTable[item][0]

            if len(learnSet) > 0:
                orderedItems = [item[0] for item in sorted(learnSet.items(), key=lambda k: (k[1], k[0]), reverse=True)]
                self.updateTree(self.treeNode, orderedItems, count)


def main():
    data = loadSimpleData()
    dataDict = createInitSet(data)
    headerTable = getHeaderTable(dataDict, 3)
    fpTree = FPTree(dataDict, headerTable, 3)
    fpTree.createFPTree()
    fpTree.treeNode.show()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值