Python机器学习&数据分析-关联规则

Python机器学习&数据分析-关联规则

机器学习课程的笔记整理

一、关联规则前置知识

关联规则

  • 在美国,一些年轻的父亲下班后经常要到超市去买婴儿尿布,超市也因此发现了一个规律,在购买婴儿尿布的年轻父亲们中,有30%~40%的人同时要买一些啤酒。超市随后调整了货架的摆放,把尿布和啤酒放在一起,明显增加了销售额。

在这里插入图片描述

  • 若两个或多个变量的取值之间存在某种规律性,就称为关联

  • 关联规则是寻找在同一个事件中出现的不同项的相关性,比如在一次购买活动中所买不同商品的相关性。

  • “在购买计算机的顾客中,有30%的人也同时购买了打印机”

支持度(support):一个项集或者规则在所有事务中出现的频率,σ(X):表示项集X的支持度计数

  • 项集X的支持度:s(X)=σ(X)/N
  • 规则X==>Y表示物品集X对物品集Y的支持度,也就是物品集X和物品集Y同时出现的概率
  • 某天共有100个顾客到商场购买物品,其中有30个顾客同时购买了啤酒和尿布,那么上述的关联规则的支持度就是30%

置信度(confidence):确定Y在包含X的事务中出现的频繁程度。c(X → Y) = σ(X∪Y)/σ(X)

  • p(Y│X)=p(XY)/p(X)。
  • 置信度反应了关联规则的可信度—购买了项目集X中的商品的顾客同时也购买了Y中商品的可能性有多大
  • 购买薯片的顾客中有50%的人购买了可乐,则置信度为50%

提升度(lift):物品集A的出现对物品集B的出现概率发生了多大的变化

  • lift(A==>B)=confidence(A==>B)/support(B)=p(B|A)/p(B)
  • 现在有** 1000 ** 个消费者,有** 500** 人购买了茶叶,其中有** 450人同时** 购买了咖啡,另** 50人** 没有。由于** confidence(茶叶=>咖啡)=450/500=90%** ,由此可能会认为喜欢喝茶的人往往喜欢喝咖啡。但如果另外没有购买茶叶的** 500人** ,其中同样有** 450人** 购买了咖啡,同样是很高的** 置信度90%** ,由此,得到不爱喝茶的也爱喝咖啡。这样看来,其实是否购买咖啡,与有没有购买茶叶并没有关联,两者是相互独立的,其** 提升度90%/[(450+450)/1000]=1** 。

由此可见,lift正是弥补了confidence的这一缺陷,if lift=1,X与Y独立,X对Y出现的可能性没有提升作用,其值越大(lift>1),则表明X对Y的提升程度越大,也表明关联性越强。

二、自定义购物数据集的例子

在anaconda命令行下通过

conda install -c conda-forge mlxtend

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

自定义一份购物数据集

data = {"ID":[1,2,3,4,5,6],
       "Onion":[1,0,0,1,1,1],
       "Potato":[1,1,0,1,1,1],
       "Burger":[1,1,0,0,1,1],
       "Milk":[0,1,1,1,0,1],
       "Beer":[0,0,1,0,1,0]}
df = pd.DataFrame(data)
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]
df
IDOnionPotatoBurgerMilkBeer
0111100
1201110
2300011
3411010
4511101
5611110

设置支持度 (support) 来选择频繁项集.

  • 选择最小支持度为50%

  • apriori(df, min_support=0.5, use_colnames=True)

frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]],min_support=0.5, use_colnames=True)
frequent_itemsets
supportitemsets
00.666667(Onion)
10.833333(Potato)
20.666667(Burger)
30.666667(Milk)
40.666667(Onion, Potato)
50.500000(Onion, Burger)
60.666667(Potato, Burger)
70.500000(Milk, Potato)
80.500000(Onion, Potato, Burger)

计算规则

  • association_rules(df, metric='lift', min_threshold=1)
  • 可以指定不同的衡量标准与最小阈值
rules = association_rules(frequent_itemsets,metric="lift",min_threshold=1)
rules
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Onion)(Potato)0.6666670.8333330.6666671.001.2000.111111inf
1(Potato)(Onion)0.8333330.6666670.6666670.801.2000.1111111.666667
2(Onion)(Burger)0.6666670.6666670.5000000.751.1250.0555561.333333
3(Burger)(Onion)0.6666670.6666670.5000000.751.1250.0555561.333333
4(Potato)(Burger)0.8333330.6666670.6666670.801.2000.1111111.666667
5(Burger)(Potato)0.6666670.8333330.6666671.001.2000.111111inf
6(Onion, Potato)(Burger)0.6666670.6666670.5000000.751.1250.0555561.333333
7(Onion, Burger)(Potato)0.5000000.8333330.5000001.001.2000.083333inf
8(Potato, Burger)(Onion)0.6666670.6666670.5000000.751.1250.0555561.333333
9(Onion)(Potato, Burger)0.6666670.6666670.5000000.751.1250.0555561.333333
10(Potato)(Onion, Burger)0.8333330.5000000.5000000.601.2000.0833331.250000
11(Burger)(Onion, Potato)0.6666670.6666670.5000000.751.1250.0555561.333333
rules[ ( rules["lift"] > 1.125) & (rules["confidence"] > 0.8) ]
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Onion)(Potato)0.6666670.8333330.6666671.01.20.111111inf
5(Burger)(Potato)0.6666670.8333330.6666671.01.20.111111inf
7(Onion, Burger)(Potato)0.5000000.8333330.5000001.01.20.083333inf

这几条结果就比较有价值了:

  • (洋葱和马铃薯)(汉堡和马铃薯)可以搭配着来卖
  • 如果洋葱和汉堡都在购物篮中, 顾客买马铃薯的可能性也比较高,如果他篮子里面没有,可以推荐一下.

三、模拟实际购物的例子

retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }
retail = pd.DataFrame(retail_shopping_basket)
retail = retail[["ID","Basket"]]
pd.options.display.max_colwidth=100
retail
IDBasket
01[Beer, Diaper, Pretzels, Chips, Aspirin]
12[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]
23[Soda, Chips, Milk]
34[Soup, Beer, Diaper, Milk, IceCream]
45[Soda, Coffee, Milk, Bread]
56[Beer, Chips]

注意:

数据集中都是字符串组成的,需要转换成数值编码

retail_id = retail.drop("Basket",1)
retail_id
ID
01
12
23
34
45
56
retail_Basket = retail.Basket.str.join(",")
retail_Basket
0              Beer,Diaper,Pretzels,Chips,Aspirin
1    Diaper,Beer,Chips,Lotion,Juice,BabyFood,Milk
2                                 Soda,Chips,Milk
3                  Soup,Beer,Diaper,Milk,IceCream
4                          Soda,Coffee,Milk,Bread
5                                      Beer,Chips
Name: Basket, dtype: object
retail_Basket = retail_Basket.str.get_dummies(",")
retail_Basket
AspirinBabyFoodBeerBreadChipsCoffeeDiaperIceCreamJuiceLotionMilkPretzelsSodaSoup
010101010000100
101101010111000
200001000001010
300100011001001
400010100001010
500101000000000
retail = retail_id.join(retail_Basket)
retail
IDAspirinBabyFoodBeerBreadChipsCoffeeDiaperIceCreamJuiceLotionMilkPretzelsSodaSoup
0110101010000100
1201101010111000
2300001000001010
3400100011001001
4500010100001010
5600101000000000
frequent_items_2 = apriori(retail.drop("ID",1),use_colnames=True)
frequent_items_2
supportitemsets
00.666667(Beer)
10.666667(Chips)
20.500000(Diaper)
30.666667(Milk)
40.500000(Chips, Beer)
50.500000(Diaper, Beer)

如果光考虑支持度support(X>Y), [Beer, Chips] 和 [Beer, Diaper] 都是很频繁的,哪一种组合更相关呢?

association_rules(frequent_items_2,metric="lift")
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Chips)(Beer)0.6666670.6666670.50.751.1250.0555561.333333
1(Beer)(Chips)0.6666670.6666670.50.751.1250.0555561.333333
2(Diaper)(Beer)0.5000000.6666670.51.001.5000.166667inf
3(Beer)(Diaper)0.6666670.5000000.50.751.5000.1666672.000000

显然{Diaper, Beer}更相关一些

四、电影题材关联的例子

数据集来源: MovieLens (small)

movies = pd.read_csv("ml-latest-small/movies.csv")
movies.head(10)
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller

数据中包括电影名字与电影类型的标签,第一步还是先转换成one-hot格式

movies_one = movies.drop("genres",1).join(movies.genres.str.get_dummies())
pd.options.display.max_columns=100
movies_one.head()
movieIdtitle(no genres listed)ActionAdventureAnimationChildrenComedyCrimeDocumentaryDramaFantasyFilm-NoirHorrorIMAXMusicalMysteryRomanceSci-FiThrillerWarWestern
01Toy Story (1995)00111100010000000000
12Jumanji (1995)00101000010000000000
23Grumpier Old Men (1995)00000100000000010000
34Waiting to Exhale (1995)00000100100000010000
45Father of the Bride Part II (1995)00000100000000000000
movies_one.shape
(9125, 22)

数据集包括9125部电影,一共有20种不同类型。

movies_one.set_index(["movieId","title"],inplace=True)
movies_one.head()
(no genres listed)ActionAdventureAnimationChildrenComedyCrimeDocumentaryDramaFantasyFilm-NoirHorrorIMAXMusicalMysteryRomanceSci-FiThrillerWarWestern
movieIdtitle
1Toy Story (1995)00111100010000000000
2Jumanji (1995)00101000010000000000
3Grumpier Old Men (1995)00000100000000010000
4Waiting to Exhale (1995)00000100100000010000
5Father of the Bride Part II (1995)00000100000000000000
frequent_itemsets_movies = apriori(movies_one,use_colnames=True,min_support=0.025)
frequent_itemsets_movies
supportitemsets
00.169315(Action)
10.122411(Adventure)
20.048986(Animation)
30.063890(Children)
40.363288(Comedy)
50.120548(Crime)
60.054247(Documentary)
70.478356(Drama)
80.071671(Fantasy)
90.096110(Horror)
100.043178(Musical)
110.059507(Mystery)
120.169315(Romance)
130.086795(Sci-Fi)
140.189479(Thriller)
150.040219(War)
160.058301(Action, Adventure)
170.037589(Comedy, Action)
180.038247(Action, Crime)
190.051178(Action, Drama)
200.040986(Action, Sci-Fi)
210.062904(Thriller, Action)
220.029260(Children, Adventure)
230.036712(Comedy, Adventure)
240.032438(Drama, Adventure)
250.030685(Fantasy, Adventure)
260.027726(Sci-Fi, Adventure)
270.027068(Children, Animation)
280.032877(Children, Comedy)
290.032438(Comedy, Crime)
300.104000(Comedy, Drama)
310.026959(Fantasy, Comedy)
320.090082(Comedy, Romance)
330.067616(Crime, Drama)
340.057863(Thriller, Crime)
350.031671(Mystery, Drama)
360.101260(Drama, Romance)
370.087123(Thriller, Drama)
380.031014(War, Drama)
390.043397(Horror, Thriller)
400.036055(Thriller, Mystery)
410.028932(Thriller, Sci-Fi)
420.035068(Comedy, Drama, Romance)
430.032000(Crime, Thriller, Drama)
rules_movies = association_rules(frequent_itemsets_movies,metric="lift",min_threshold=1.25)
rules_movies
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Action)(Adventure)0.1693150.1224110.0583010.3443372.8129550.0375751.338475
1(Adventure)(Action)0.1224110.1693150.0583010.4762762.8129550.0375751.586111
2(Action)(Crime)0.1693150.1205480.0382470.2258901.8738600.0178361.136081
3(Crime)(Action)0.1205480.1693150.0382470.3172731.8738600.0178361.216716
4(Action)(Sci-Fi)0.1693150.0867950.0409860.2420712.7890150.0262911.204870
5(Sci-Fi)(Action)0.0867950.1693150.0409860.4722222.7890150.0262911.573929
6(Thriller)(Action)0.1894790.1693150.0629040.3319841.9607460.0308221.243510
7(Action)(Thriller)0.1693150.1894790.0629040.3715211.9607460.0308221.289654
8(Children)(Adventure)0.0638900.1224110.0292600.4579763.7412990.0214391.619096
9(Adventure)(Children)0.1224110.0638900.0292600.2390333.7412990.0214391.230158
10(Fantasy)(Adventure)0.0716710.1224110.0306850.4281353.4975180.0219121.534608
11(Adventure)(Fantasy)0.1224110.0716710.0306850.2506713.4975180.0219121.238881
12(Sci-Fi)(Adventure)0.0867950.1224110.0277260.3194442.6096070.0171011.289519
13(Adventure)(Sci-Fi)0.1224110.0867950.0277260.2265002.6096070.0171011.180614
14(Children)(Animation)0.0638900.0489860.0270680.4236718.6487580.0239391.650122
15(Animation)(Children)0.0489860.0638900.0270680.5525738.6487580.0239392.092205
16(Children)(Comedy)0.0638900.3632880.0328770.5145801.4164530.0096661.311672
17(Comedy)(Children)0.3632880.0638900.0328770.0904981.4164530.0096661.029255
18(Comedy)(Romance)0.3632880.1693150.0900820.2479641.4645110.0285721.104581
19(Romance)(Comedy)0.1693150.3632880.0900820.5320391.4645110.0285721.360609
20(Thriller)(Crime)0.1894790.1205480.0578630.3053792.5332560.0350221.266089
21(Crime)(Thriller)0.1205480.1894790.0578630.4800002.5332560.0350221.558693
22(Drama)(Romance)0.4783560.1693150.1012600.2116841.2502360.0202671.053746
23(Romance)(Drama)0.1693150.4783560.1012600.5980581.2502360.0202671.297810
24(War)(Drama)0.0402190.4783560.0310140.7711171.6120150.0117752.279087
25(Drama)(War)0.4783560.0402190.0310140.0648341.6120150.0117751.026321
26(Horror)(Thriller)0.0961100.1894790.0433970.4515392.3830520.0251861.477810
27(Thriller)(Horror)0.1894790.0961100.0433970.2290342.3830520.0251861.172413
28(Thriller)(Mystery)0.1894790.0595070.0360550.1902833.1976720.0247791.161509
29(Mystery)(Thriller)0.0595070.1894790.0360550.6058933.1976720.0247792.056601
30(Thriller)(Sci-Fi)0.1894790.0867950.0289320.1526891.7592060.0124861.077769
31(Sci-Fi)(Thriller)0.0867950.1894790.0289320.3333331.7592060.0124861.215781
32(Comedy, Drama)(Romance)0.1040000.1693150.0350680.3371971.9915360.0174601.253291
33(Romance)(Comedy, Drama)0.1693150.1040000.0350680.2071201.9915360.0174601.130057
34(Drama, Crime)(Thriller)0.0676160.1894790.0320000.4732582.4976730.0191881.538742
35(Thriller, Drama)(Crime)0.0871230.1205480.0320000.3672963.0468840.0214971.389989
36(Crime)(Thriller, Drama)0.1205480.0871230.0320000.2654553.0468840.0214971.242778
37(Thriller)(Drama, Crime)0.1894790.0676160.0320000.1688842.4976730.0191881.121845
rules_movies[(rules_movies.lift>4)].sort_values(by=['lift'], ascending=False)
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
14(Children)(Animation)0.0638900.0489860.0270680.4236718.6487580.0239391.650122
15(Animation)(Children)0.0489860.0638900.0270680.5525738.6487580.0239392.092205

Children和Animation 这俩题材是最相关的

movies[(movies.genres.str.contains('Children')) & (~movies.genres.str.contains('Animation'))]
<tr>
  <th>8917</th>
  <td>135266</td>
  <td>Zenon: The Zequel (2001)</td>
  <td>Adventure|Children|Comedy|Sci-Fi</td>
</tr>
<tr>
  <th>8918</th>
  <td>135268</td>
  <td>Zenon: Z3 (2004)</td>
  <td>Adventure|Children|Comedy</td>
</tr>
<tr>
  <th>8960</th>
  <td>139620</td>
  <td>Everything's Gonna Be Great (1998)</td>
  <td>Adventure|Children|Comedy|Drama</td>
</tr>
<tr>
  <th>8967</th>
  <td>140152</td>
  <td>Dreamcatcher (2015)</td>
  <td>Children|Crime|Documentary</td>
</tr>
<tr>
  <th>8981</th>
  <td>140747</td>
  <td>16 Wishes (2010)</td>
  <td>Children|Drama|Fantasy</td>
</tr>
<tr>
  <th>9052</th>
  <td>149354</td>
  <td>Sisters (2015)</td>
  <td>Children|Comedy</td>
</tr>
movieIdtitlegenres
12Jumanji (1995)Adventure|Children|Fantasy
78Tom and Huck (1995)Adventure|Children
2627Now and Then (1995)Children|Drama
3234Babe (1995)Children|Drama
3638It Takes Two (1995)Children|Comedy
5154Big Green, The (1995)Children|Comedy
5660Indian in the Cupboard, The (1995)Adventure|Children|Fantasy
7480White Balloon, The (Badkonake sefid) (1995)Children|Drama
8187Dunston Checks In (1996)Children|Comedy
98107Muppet Treasure Island (1996)Adventure|Children|Comedy|Musical
114126NeverEnding Story III, The (1994)Adventure|Children|Fantasy
125146Amazing Panda Adventure, The (1995)Adventure|Children
137158Casper (1995)Adventure|Children
148169Free Willy 2: The Adventure Home (1995)Adventure|Children|Drama
160181Mighty Morphin Power Rangers: The Movie (1995)Action|Children
210238Far From Home: The Adventures of Yellow Dog (1995)Adventure|Children
213241Fluke (1995)Children|Drama
215243Gordy (1995)Children|Comedy|Fantasy
222250Heavyweights (Heavy Weights) (1995)Children|Comedy
230258Kid in King Arthur's Court, A (1995)Adventure|Children|Comedy|Fantasy|Romance
234262Little Princess, A (1995)Children|Drama
280314Secret of Roan Inish, The (1994)Children|Drama|Fantasy|Mystery
308343Baby-Sitters Club, The (1995)Children
320355Flintstones, The (1994)Children|Comedy|Fantasy
326362Jungle Book, The (1994)Adventure|Children|Romance
338374Richie Rich (1994)Children|Comedy
361410Addams Family Values (1993)Children|Comedy|Fantasy
371421Black Beauty (1994)Adventure|Children|Drama
404455Free Willy (1993)Adventure|Children|Drama
431484Lassie (1994)Adventure|Children
............
770783177Yogi Bear (2010)Children|Comedy
773584312Home Alone 4 (2002)Children|Comedy|Crime
782387383Curly Top (1935)Children|Musical|Romance
790089881Superman and the Mole-Men (1951)Children|Mystery|Sci-Fi
792990866Hugo (2011)Children|Drama|Mystery
793591094Muppets, The (2011)Children|Comedy|Musical
794291286Little Colonel, The (1935)Children|Comedy|Crime|Drama
797191886Dolphin Tale (2011)Children|Drama
809695740Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)Children|Musical|Mystery
819998441Rebecca of Sunnybrook Farm (1938)Children|Comedy|Drama|Musical
820098458Baby Take a Bow (1934)Children|Comedy|Drama
8377104074Percy Jackson: Sea of Monsters (2013)Adventure|Children|Fantasy
8450106441Book Thief, The (2013)Children|Drama|War
8558110461We Are the Best! (Vi är bäst!) (2013)Children|Comedy|Drama
8592111659Maleficent (2014)Action|Adventure|Children|IMAX
8689115139Challenge to Lassie (1949)Children|Drama
8761118997Into the Woods (2014)Children|Comedy|Fantasy|Musical
8765119155Night at the Museum: Secret of the Tomb (2014)Adventure|Children|Comedy|Fantasy
8766119655Seventh Son (2014)Adventure|Children|Fantasy
8792122932Elsa & Fred (2014)Children|Comedy|Romance
8845130073Cinderella (2015)Children|Drama|Fantasy|Romance
8850130450Pan (2015)Adventure|Children|Fantasy
8871132046Tomorrowland (2015)Action|Adventure|Children|Mystery|Sci-Fi

336 rows × 3 columns

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mind_programmonkey

你的鼓励是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值