Python实现数据挖掘Aprior算法

最新推荐文章于 2022-07-20 09:09:20 发布

SRE实战派

最新推荐文章于 2022-07-20 09:09:20 发布

阅读量706

点赞数 4

分类专栏： Python学习

本文链接：https://blog.csdn.net/qq_44924544/article/details/110873758

版权

python 数据挖掘算法

Python学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

算法实现思路

首先，获取事务集和最小支持度。事务集可以采用手动输入，也可以预定义，我在这里使用字典结构预定义事务集，并使用input()方法获取用户输入的最小支持度；
其次，根据事务集，通过遍历事务集中的每项，获取所有单项集的支持度即C1，同样以字典的方式存储，其中的键为项组成的元组，值为项集的支持度，同时为了保证后续遍历时的前n个项的顺序一致，在遍历C1时，根据排序后的键来进行遍历，筛选出符合最小支持度的频繁单项集L1，同时将不频繁的项集添加至全局变量非频繁项集的集合中；
接着，编写根据Lk-1生成Ck的函数。如果Lk-1中的键都只由一项组成，那么两两自由组合，如果Lk-1中的键都至少由两项组成，那么将前k-2项相同的两项集组合，生成一个k项集，将所有可能的项集组合完毕就生成了初始的Ck，接着通过遍历全局变量非频繁项集的集合，如果其中某一项是Ck中一项集的子集，那么将该项集从Ck中删除，最终得到的便是Ck；
再次，编写根据Ck生成Lk的函数。遍历Ck，遍历事务集，如果Ck中的项集是事务集中某事务的子集，那么其支持度+1，得到所有Ck中所有项集的支持度，接着删除不满足最小支持度的项集同时将不满足最小支持度的项集添加至全局变量不频繁项集的集合中；
最后，编写主函数。设置全局变量，最终频繁项集、不频繁项集和第几层，调用函数获取初始频繁项集L1，更新不频繁项集的集合，获取C2（赋值给C），当C不为空时，不断获取频繁项集Lk和Ck（赋值给C），更新最终频繁项集，直至C为空时，打印最终频繁项集。

源代码

# -*- coding: utf-8 -*-

"""
@Time        : 2020/12/7
@Author      : lixinci
@File        : Aprior算法
@Description :
"""


transaction = {
    'T100': ['M', 'O', 'N', 'K', 'E', 'Y'],
    'T200': ['D', 'O', 'N', 'K', 'E', 'Y'],
    'T300': ['M', 'A', 'K', 'E'],
    'T400': ['M', 'U', 'C', 'K', 'Y'],
    'T500': ['C', 'O', 'K', 'I', 'E']
}


def find_frequent_one_itemsets(dataset, min_support=1):
    """
    获取L1，及不频繁项集
    """
    C1 = {}
    L1 = {}
    infrequent_item_sets = []
    # 获取C1
    for transaction in dataset.values():
        for item in transaction:
            C1[item] = C1.get(item, 0) + 1
    print("C1：{}".format(C1))
    # 遍历C1，筛选满足最小支持度的项集，生成L1
    for key in sorted(list(C1.keys())):
        if C1[key] >= min_support:
            L1[(key,)] = C1[key]
        else:
            infrequent_item_sets.append((key,))
    return L1, infrequent_item_sets


def itemsets_gen(L, infrequent_item_sets=[]):
    """
    根据频繁项集Lk-1生成项集Ck
    """
    C = []
    del_list_index = []
    keys = list(L.keys())
    if len(keys[0]) == 1:
        for i in range(len(keys) - 1):
            for j in range(i + 1, len(keys)):
                C.append(keys[i] + keys[j])
    else:
        for i in range(len(keys) - 1):
            for j in range(i + 1, len(keys)):
                if keys[i][:-1].__eq__(keys[j][:-1]):
                    C.append(keys[i] + (keys[j][-1],))
    if infrequent_item_sets:
        for item in infrequent_item_sets:
            for c in C:
                if set(item).issubset(c):
                    del_list_index.append(C.index(c))
        del_list_index.sort(reverse=True)
        for count in range(len(del_list_index)):
            del C[del_list_index[count]]
    return C


def frequent_itemsets_gen(dataset, C, min_support=1):
    """
    根据项集Ck，生成频繁项集Lk和不频繁项集的集合
    """
    L = {}
    tmp_L = {}
    infrequent_item_sets = []
    # 获取所有项集的最持度
    for c in C:
        for transaction in dataset.values():
            if set(c).issubset(transaction):
                tmp_L[c] = tmp_L.get(c, 0) + 1
            else:
                tmp_L[c] = tmp_L.setdefault(c, 0)
    # 去除不满足最小支持度的项集
    for key in tmp_L:
        if tmp_L[key] >= min_support:
            L[key] = tmp_L[key]
        else:
            infrequent_item_sets.append(tuple(key))
    return L, infrequent_item_sets


# 主函数
def main():
    global frequent_itemsets, infrequent_item_sets, count
    infrequent_item_sets = []  # 不频繁项集
    frequent_itemsets = []  # 频繁项集
    count = 1  # 第几次遍历
    min_support = eval(input("请输入最小支持度计数："))

    # 获取初始频繁项集
    L1, infrequent_item_sets_part = find_frequent_one_itemsets(
        transaction, min_support)
    print("L{}：{}".format(count, L1))
    # 设置初始频繁项集为L1
    frequent_itemsets = L1
    # 更新初不频繁项集
    infrequent_item_sets += infrequent_item_sets_part

    count += 1
    # 获取第二个项集
    C = itemsets_gen(L1, infrequent_item_sets)
    print("C{}：{}".format(count, C))
    # 遍历获取频繁项集
    while C:
        # 获取频繁项集和不频繁项集
        L, infrequent_item_sets_part = frequent_itemsets_gen(
            transaction, C, min_support)
        print("L{}：{}".format(count, L))
        # 更新不频繁项集库
        infrequent_item_sets += infrequent_item_sets_part
        frequent_itemsets = L
        count += 1
        C = itemsets_gen(frequent_itemsets, infrequent_item_sets)
        print("C{}：{}".format(count, C))
    print("最终频繁项集：", end='')
    print(frequent_itemsets)


if __name__ == '__main__':
    main()