机器学习——关联规则挖掘一

最新推荐文章于 2024-11-13 17:09:21 发布

星石传说

最新推荐文章于 2024-11-13 17:09:21 发布

阅读量266

点赞数

分类专栏： python篇文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/2301_78630677/article/details/132710345

版权

python篇专栏收录该内容

104 篇文章

订阅专栏

本文详细介绍了机器学习中的关联规则挖掘，包括其原理（如支持度和置信度的计算）、购物篮数据分析示例以及Python代码实现过程，展示了如何通过one-hot编码计算商品的支持度和置信度以发现商品间的关联关系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

机器学习——关联规则挖掘一

前言

在机器学习中，数据挖掘是一个比较重要的子领域，其中关联规则挖掘是数据挖掘领域的一门技术。本文将简单介绍一下关联规则挖掘。

在这里插入图片描述

一、原理

关联规则挖掘是一种用于发现数据集中的频繁项集和关联规则的数据挖掘技术。它可以帮助我们找到数据集中的相关性，从而可以用于市场篮子分析、推荐系统等领域。

1.1. 例子

举一个"购物篮分析"的典型例子：
某超市的管理者发现：
纸尿裤和啤酒放通常会出现在一个订单里，经过数据分析发现，买尿不湿的家长以父亲居多，如果他们在买尿不湿的同时恰好看到了啤酒，就会有很大的概率购买，从而就能提高啤酒的销售量。

根据纸尿裤与啤酒的关联规则，超市经营者可以将这两件商品放在一起出售，并且还可以通过推荐系统向购买纸尿裤的顾客推荐啤酒。这样可以提高超市的效益

1.2. 支持度与置信度

关联规则的生成可以通过计算支持度和置信度来完成。
支持度表示项集在数据集中的出现频率，
而置信度表示在前提项集出现的情况下，结论项集出现的概率。

例如我要计算商品A和B在一起出售的支持度：
支持度 = (同时包含商品A、B的交易数目) / (所有交易数目)
置信度是指在购买商品A的交易中，同时也购买了商品B的比例：
置信度 = (同时包含商品A、B的交易数目) / (购买商品A的交易数目)

如果支持度和置信度都较高，说明商品A和B之间存在较强的关联关系，可以作为推荐系统中的一条关联规则。

项集：所有项组成的集合（交易中出现的一个事物就是一个项，如一个牛奶就是一个项）
最小支持度：多是人为设定的最小支持度。
频繁项集：出现频率大于最小支持度的项集被称作频繁项集

二、代码实现

2.1. 购物篮数据拆分

#购物篮数据拆分
import numpy as np
import pandas as pd
data = {'products': ['bread apples', 'bread apples milk', 'milk cheese',
                     'bread butter cheese', 'eggs milk',
                     'bread milk butter cheese']}
transactions = pd.DataFrame(data=data, index=range(1,7))
print(transactions)
                   products
1              bread apples
2         bread apples milk
3               milk cheese
4       bread butter cheese
5                 eggs milk
6  bread milk butter cheese

expanded = transactions["products"].str.split(expand=True)
print(expanded)
       0       1       2       3
1  bread  apples    None    None
2  bread  apples    milk    None
3   milk  cheese    None    None
4  bread  butter  cheese    None
5   eggs    milk    None    None
6  bread    milk  butter  cheese

2.2. 得到购买商品的去重列表

#计算购买商品的去重列表
products = set()
for i in expanded.columns:
    for product in expanded[i].unique():
        if product:
            products.add(product)

#print(products)
products = sorted(list(products))
print(products)
['apples', 'bread', 'butter', 'cheese', 'eggs', 'milk']

2.3. 实现one-hot编码

#实现one-hot编码
transactions_encoded = np.zeros((len(expanded),len(products)),dtype= "int8")
print(transactions_encoded)  #初始化矩阵
[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]


for row in zip(range(len(expanded)),expanded.values):
    print(row[0],row[1])
    for ind ,product in enumerate(products):
        if product in row[1]:
            transactions_encoded[row[0],ind] = 1

#print(transactions_encoded)
transactions_encoded_df = pd.DataFrame(data = transactions_encoded,columns= products)
print(transactions_encoded_df)
  apples  bread  butter  cheese  eggs  milk
0       1      1       0       0     0     0
1       1      1       0       0     0     1
2       0      0       0       1     0     1
3       0      1       1       1     0     0
4       0      0       0       0     1     1
5       0      1       1       1     0     1

2.4. 计算商品的支持度

#计算商品的支持度
support = transactions_encoded_df.sum() / len(transactions_encoded_df)
print(support)
apples    0.333333
bread     0.666667
butter    0.333333
cheese    0.500000
eggs      0.166667
milk      0.666667
dtype: float64

#计算多个商品的支持度（如butter和bread
sup_butter_bread = (
    len(transactions_encoded_df.query("butter==1 and bread==1"))
    /
    len(transactions_encoded_df)
)
print(sup_butter_bread)
0.3333333333333333

2.5. 计算置信度

#计算关联规则以及置信度
conf_cheese_brand = (
    len(transactions_encoded_df.query("cheese==1 and bread==1"))
    /
    len(transactions_encoded_df.query("cheese==1"))
)
print(conf_cheese_brand)
0.6666666666666666

conf_butter_cheese = (
    len(transactions_encoded_df.query("butter==1 and cheese==1"))
    /
    len(transactions_encoded_df.query("butter==1"))
)
print(conf_butter_cheese)
1.0