《R语言与数据挖掘》⑧关联规则分析

写在前面

简言之,关联分析就是通过量化后的数字描述物品之间的影响,以及有多大的一些影响关系。
常见的算法如下:
在这里插入图片描述

Apriori关联规则

基本的术语解释:

  • 事务(Transaction):简单理解,一个人购物是的一张小票里面的所有物品组成的集合。
  • 项(Item):小票里面的商品A
  • 项集(Itemset):多个商品组成的集合,和上面不同,你细细品。这里就衍生出,1-项集、2-项集、k-项集等等。
  • 符号:X => Y X称为前项,那么Y称为后项。
  • 支持度(Support):简言之,概率或者频率。
    S u p p o r t ( X − > Y ) = P ( X , Y ) / P ( I ) = P ( X ∪ Y ) / P ( I ) = m u m ( X ∪ Y ) / n u m ( I ) Support(X->Y)=P(X,Y)/P(I)=P(X∪Y)/P(I)=mum(X∪Y)/num(I) Support(X>Y)=P(X,Y)/P(I)=P(XY)/P(I)=mum(XY)/num(I)
    I表示总事务集。 num()表示求事务集里特定项集出现的次数。比如, num(I)表示总事务集的个数num(X∪Y)表示含有{X,Y}的事务集的个数。
  • 频繁项集(Largeltemsets),其实就是满足最小支持度阈值的,这个项集就叫做频繁项集
  • 置信度(Confidence)
    置信度是表示在先决条件下X发生情况下,有 X − > Y X->Y X>Y推出Y的概率。
    C o n f i d e n c e ( X − > Y ) = P ( Y ∣ X ) = P ( X , Y ) / P ( X ) = P ( X ∪ Y ) / P ( X ) Confidence(X->Y)=P(Y|X)=P(X,Y)/P(X)=P(X∪Y)/P(X) Confidence(X>Y)=P(YX)=P(X,Y)/P(X)=P(XY)/P(X)
  • 提升度(Lift)
    表示含有X的条件下,同时包含Y的概率,与Y总体发生概率之比。
    L i f t ( X − > Y ) = P ( Y ∣ X / P ( Y ) Lift(X->Y)=P(Y|X/P(Y) Lift(X>Y)=P(YX/P(Y)

在R语言中,Apriori关联规则算法是借助arules中的一系列函数来实现的,而另一个包arulesViz则可以实现关联规则的可视化。
在arules中,建立关联规则有三种方法,分别为apriori算法,eclat算法和weclat算法。各算法的函数实现如下表:
在这里插入图片描述
在这里插入图片描述

代码实现

一般要进行数据之间的转换,通常是转换为transaction的形式。

# transactions格式的转换
# 列表转换transactions
a_list <- list(c("a", "b", "c"), c("a", "b"), c("a", "b", "d"), c("c", "e"), 
               c("a", "b", "d", "e"))
names(a_list) <- paste("Tr", c(1:5), sep = "")  # 列表重命名
library(arules)
trans <- as(a_list, "transactions")  # 将列表转换为transactions
inspect(trans)  # 检查是否转换成功
# 数据框转换transactions
a_df <- data.frame(age = as.factor(c(6, 8, 7, 6, 9, 5)), 
                   grade = as.factor(c(1, 3, 1, 1, 4, 1)))
trans2 <- as(a_df, "transactions")  # 将数据框转换为transactions
inspect(trans2)  # 检查是否转换成功

实际案例

# 关联规则分析
library(arules)  # 加载程序包arules
data("Groceries")  # 提取数据集Groceries
# 数据集相关的统计汇总信息,包括事务和项集的汇总情况
summary(Groceries)

out:

transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146 

most frequent items:
      whole milk other vegetables       rolls/buns             soda           yogurt 
            2513             1903             1809             1715             1372 
         (Other) 
           34055 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18 
2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46   29   14 
  19   20   21   22   23   24   26   27   28   29   32 
  14    9   11    4    6    1    1    1    1    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   4.409   6.000  32.000 

includes extended item information - examples:
       labels  level2           level1
1 frankfurter sausage meat and sausage
2     sausage sausage meat and sausage
3  liver loaf sausage meat and sausage

# 建立关联规则rules,设定支持度最小值为0.001,置信度最小值为0.5
rules <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.5))

out:

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.5    0.1    1 none FALSE            TRUE       5   0.001      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 9 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.02s].
writing ... [5668 rule(s)] done [0.01s].
creating S4 object  ... done [0.00s].

# 查看规则的汇总信息
summary(rules)

out:

summary(rules)
set of 5668 rules

rule length distribution (lhs + rhs):sizes
   2    3    4    5    6 
  11 1461 3211  939   46 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.00    4.00    3.92    4.00    6.00 

summary of quality measures:
    support           confidence        coverage             lift            count      
 Min.   :0.001017   Min.   :0.5000   Min.   :0.001017   Min.   : 1.957   Min.   : 10.0  
 1st Qu.:0.001118   1st Qu.:0.5455   1st Qu.:0.001729   1st Qu.: 2.464   1st Qu.: 11.0  
 Median :0.001322   Median :0.6000   Median :0.002135   Median : 2.899   Median : 13.0  
 Mean   :0.001668   Mean   :0.6250   Mean   :0.002788   Mean   : 3.262   Mean   : 16.4  
 3rd Qu.:0.001729   3rd Qu.:0.6842   3rd Qu.:0.002949   3rd Qu.: 3.691   3rd Qu.: 17.0  
 Max.   :0.022267   Max.   :1.0000   Max.   :0.043416   Max.   :18.996   Max.   :219.0  

mining info:
      data ntransactions support confidence
 Groceries          9835   0.001        0.5
                                                                           call
 apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.5))

# 查看Groceries中商品的支持度
# Groceries数据中前3件商品的支持度
itemFrequency(Groceries[, 1:3])

out:

frankfurter     sausage  liver loaf 
0.058973055 0.093950178 0.005083884 

# Groceries数据中商品whole milk、other vegetables的支持度
itemFrequency(Groceries[, c("whole milk", "other vegetables")])

out:

      whole milk other vegetables 
       0.2555160        0.1934926 
# 输出支持度频率图
# 输出支持度support大于0.1的项集的支持度频率图
itemFrequencyPlot(Groceries, support = 0.1)

在这里插入图片描述

#输出支持度support最大的前20个项集的支持度频率图
itemFrequencyPlot(Groceries , topN = 20)

在这里插入图片描述


# 查看数据和规则
# 查看关联数据Groceries的前五项
inspect(Groceries[1:5])

out:

 items                     
[1] {citrus fruit,            
     semi-finished bread,     
     margarine,               
     ready soups}             
[2] {tropical fruit,          
     yogurt,                  
     coffee}                  
[3] {whole milk}              
[4] {pip fruit,               
     yogurt,                  
     cream cheese ,           
     meat spreads}            
[5] {other vegetables,        
     whole milk,              
     condensed milk,          
     long life bakery product}

# 查看前五项关联规则
inspect(rules[1:5])

out:

   lhs                    rhs          support     confidence coverage    lift     count
[1] {honey}             => {whole milk} 0.001118454 0.7333333  0.001525165 2.870009 11   
[2] {tidbits}           => {rolls/buns} 0.001220132 0.5217391  0.002338587 2.836542 12   
[3] {cocoa drinks}      => {whole milk} 0.001321810 0.5909091  0.002236909 2.312611 13   
[4] {pudding powder}    => {whole milk} 0.001321810 0.5652174  0.002338587 2.212062 13   
[5] {cooking chocolate} => {whole milk} 0.001321810 0.5200000  0.002541942 2.035097 13  

# 计算规则的各项附加信息
# 计算"coverage", "fishersExactTest", "conviction", "chiSquared" 
qualityMeasures <- interestMeasure(rules, measure = c("coverage", "fishersExactTest", 
                                                      "conviction", "chiSquared"), 
                                   transactions = Groceries) 
summary(qualityMeasures)                        
quality(rules) <- cbind(quality(rules), qualityMeasures)  # 合并quality measures
quality(rules) <- round(quality(rules), digits = 3)  # 保留小数点后3位
inspect(head(rules))  # 查看合并后的关联规则

out:

    lhs                    rhs          support confidence coverage lift  count coverage
[1] {honey}             => {whole milk} 0.001   0.733      0.002    2.870 11    0.002   
[2] {tidbits}           => {rolls/buns} 0.001   0.522      0.002    2.837 12    0.002   
[3] {cocoa drinks}      => {whole milk} 0.001   0.591      0.002    2.313 13    0.002   
[4] {pudding powder}    => {whole milk} 0.001   0.565      0.002    2.212 13    0.002   
[5] {cooking chocolate} => {whole milk} 0.001   0.520      0.003    2.035 13    0.003   
[6] {cereals}           => {whole milk} 0.004   0.643      0.006    2.516 36    0.006   
    fishersExactTest conviction chiSquared
[1] 0.000            2.792      18.030    
[2] 0.000            1.706      17.526    
[3] 0.001            1.820      13.039    
[4] 0.002            1.712      11.624    
[5] 0.004            1.551       9.217    
[6] 0.000            2.085      44.420  

# 规则排序
# 按支持度递减的顺序对rules排序
sort(rules, by = "support")
# 按支持度递减的顺序,输出支持度最大的前五项规则
inspect(sort(rules, by = "support")[1:5])
    lhs                                       rhs          support confidence coverage lift 
[1] {other vegetables, yogurt}             => {whole milk} 0.022   0.513      0.043    2.007
[2] {other vegetables, whipped/sour cream} => {whole milk} 0.015   0.507      0.029    1.984
[3] {tropical fruit, yogurt}               => {whole milk} 0.015   0.517      0.029    2.025
[4] {root vegetables, yogurt}              => {whole milk} 0.015   0.563      0.026    2.203
[5] {pip fruit, other vegetables}          => {whole milk} 0.014   0.518      0.026    2.025
    count coverage fishersExactTest conviction chiSquared
[1] 219   0.043    0                1.528      155.428   
[2] 144   0.029    0                1.510       97.261   
[3] 149   0.029    0                1.543      106.934   
[4] 143   0.026    0                1.704      129.583   
[5] 133   0.026    0                1.543       95.223  

# 提取规则
# 提取后项为"whole milk"并且提升度大于1.2的关联规则
subset(rules, subset = rhs %in% "whole milk" & lift >= 1.2)
# 查看满足后项为"whole milk"并且提升度大于1.2的关联规则的前五项
inspect(subset(rules, subset = rhs %in% "whole milk" & lift >= 1.2)[1:5])

out:

   lhs                    rhs          support confidence coverage lift  count coverage
[1] {honey}             => {whole milk} 0.001   0.733      0.002    2.870 11    0.002   
[2] {cocoa drinks}      => {whole milk} 0.001   0.591      0.002    2.313 13    0.002   
[3] {pudding powder}    => {whole milk} 0.001   0.565      0.002    2.212 13    0.002   
[4] {cooking chocolate} => {whole milk} 0.001   0.520      0.003    2.035 13    0.003   
[5] {cereals}           => {whole milk} 0.004   0.643      0.006    2.516 36    0.006   
    fishersExactTest conviction chiSquared
[1] 0.000            2.792      18.030    
[2] 0.001            1.820      13.039    
[3] 0.002            1.712      11.624    
[4] 0.004            1.551       9.217    
[5] 0.000            2.085      44.420  

# 关联规则分析
library(arules)  # 加载程序包arules
library(arulesViz)  # 加载程序包arulesViz

data("Groceries")  # 提取数据集Groceries
summary(Groceries)  # 数据集相关的统计汇总信息,包括事务和项集的汇总情况
inspect(Groceries[1:10])  # 查看数据集的前10个事务
Size <- size(Groceries)  # 查看每个交易记录包含的商品数目
# 查看Groceries中商品的支持度
ItemFrequency <- itemFrequency(Groceries) 
# 查看Groceries数据中商品whole milk、other vegetables的支持度
itemFrequency(Groceries[, c("whole milk", "other vegetables")])

# 作出支持度support最大的前20个项集的稀疏矩阵图
itemFrequencyPlot(Groceries, topN = 20)

# 建立关联规则rules,条件是支持度大于0.001且置信度大于0.5
rules <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.5))
inspect(rules[1:10])  # 查看rules前十则关联规则

# 查看其它的quality measures
# 计算"coverage", "fishersExactTest", "conviction", "chiSquared" summary(qualityMeasures)
qualityMeasures <- interestMeasure(rules, measure = c("coverage", "fishersExactTest", 
                                                      "conviction", "chiSquared"), 
                                   transactions = Groceries)    
quality(rules) <- cbind(quality(rules), qualityMeasures)  # 合并quality measures
quality(rules) <- round(quality(rules), digits = 3)  # 保留小数点后3位
inspect(head(rules))  # 查看合并后的关联规则

# 规则排序
# 按提升度排序
rules.sorted <- sort(rules, by = "lift")
# 查看排序后的前五则关联规则
inspect(rules.sorted[1:5])
# 提取后项为"whole milk"并且提升度大于1.2的关联规则
rules.subset <- subset(rules, subset = rhs%in%"whole milk" & lift >= 1.2)
# 查看满足后项为"whole milk"并且提升度大于1.2的关联规则的前五项
inspect(rules.subset[1:5])

# 对关联规则做散点图
plot(rules, method = "scatter", interactive = T)

在这里插入图片描述

  • 3
    点赞
  • 50
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Wency(王斯-CUEB)

我不是要饭的

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值