大数据背景下用户商品购买体系的自动化分析

声明:本文原创,文章一切解释权归本人所有。本文允许转载,如需引用,请附属本文链接。

摘要:随大数据的发展,中大型商户渐渐需要依靠大数据对顾客进行更深层次的了解。为了保证中大型商户的利益最大化,常常需要对用户进行动态采样,适当获取用户个人信息,匹配到他们真正想要的产品,因此,了解一样商品的复购率就显得尤为重要了。

本文主要依靠Python自动化分析用户购买商品的匹配度、相似度,以及商品的热力关系(复购率),了解中大型商户在为用户进行个性化推荐时的底层逻辑并进行简单模拟。

 注意:在开始之前,请先下载本文附属的excel文件(已经过数据清洗的某超市2011-2014年商品购买情况数据集),并另存在本项目的根目录中。

(excel文件版权说明:该文件为AI批量生成,无任何真实性,本人不对此文件负责)


目录

一、相似度分析

①底层逻辑分析

②复购率分析

基本环境设置

筛选用户的有效信息

商品统计

遍历至所有类别

每列间的相似度

计算复购率

二、Lift关联与置信度分析

① apriori类 与 Lift关联性简介

② apriori实践

③ 使用 seaborn 绘制热力图

 ④现实意义


一、相似度分析

该篇目主要会对用户购买商品的相似度进行分析。

①底层逻辑分析

关于用户购买商品的相似度,相信大家都很好理解。但是,你有时可能会感到不解——为什么某宝、某东会给我推送符合我可能想要购买的商品呢?对于这个问题,有人可能会解答说,是大数据无时不刻地在“偷窃”你的隐私。但,事实也许并不是这样的。随着我国对应用程序访问个人隐私的管控与限制,大数据暗地里“偷情报”的操作越来越少,似乎又可能是数据库在作妖?

对于这些问题,其实可以用一张图片来解释:

图1:不同准则的商品筛选
(引自:卢明东的博客lumingdong.cn)

来分析一下图1,左边是“User-Based Filtering”,也就是“以用户为准则的筛选”,即,要把用户相互联系起来。例如,左边的图中(假设自上往下为A、B、C),A用户购买了葡萄、苹果、菠萝以及梨,B用户仅购买了苹果,C用户购买了苹果与菠萝。仅靠这三位用户的购买数据对比后可以发现,A用户与C用户的商品相似度较高,为2件商品(苹果与菠萝),那么,我们不妨假设一下,C用户有没有可能也喜欢葡萄与梨呢?这时,系统便可以将这两位用户相关联,并在下次C用户访问商户时,商户可对C用户推荐这两种产品,这就是User-Based Filtering的逻辑。

右边是“Item-Based Filtering”,也就是“以商品为准则的筛选”,即,要把商品相互联系起来。例如,图1右边A用户购买了葡萄、苹果、菠萝以及梨子,B用户购买了葡萄和菠萝,C用户购买了菠萝。那么,不难发现,菠萝受到了广泛的喜爱。而A用户与B用户有一个共同点,就是他们在同时购买菠萝的同时,也都购买了葡萄。那么,系统就可以为C用户推荐葡萄这样产品了,原因是他们的商品选择的相似度较高,这就是Item-Based Filtering的逻辑。

②复购率分析

我们不妨先来分析一下每件商品的复购率。

基本环境设置

开始前,请确保项目根目录中已安装matplotlab、 pandas、 numpy这三个库,如未安装,可以使用pip 或 conda安装,基本命令如下:

$ > pip install matplotlab
$ > pip install pandas
$ > pip install numpy

------or------

$ > conda install matplotlab
$ > conda install pandas
$ > conda install numpy

如有需要,可添加上-i-pre附属参数。

安装完成后,可以在程序头导入库与excel文件,并过滤错误的行,如下:

# 引入库
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# 设置基本参数
pd.set_option('display.width', 60)

# 导入完整数据
df = pd.read_excel('supermarket_data_clean.xlsx', index_col=0)

# 过滤数据类型错误的信息所在的行
def is_number(value):
    return isinstance(value, (int, float))
def is_string(value):
    return isinstance(value, (str,))
df = df[df["Quantity"].map(lambda x: isinstance(x, (int, float)))]
df = df[df["Sub-Category"].map(is_string)]
print(df)

运行结果如下:

                 Order Date  Order Date Year  \
Order ID                                       
AG-2011-2040       1/1/2011             2011   
IN-2011-47883      1/1/2011             2011   
HU-2011-1220       1/1/2011             2011   
IT-2011-3647632    1/1/2011             2011   
IN-2011-47883      1/1/2011             2011   
...                     ...              ...   
CA-2014-115427   31-12-2014             2014   
MO-2014-2560     31-12-2014             2014   
MX-2014-110527   31-12-2014             2014   
MX-2014-114783   31-12-2014             2014   
CA-2014-156720   31-12-2014             2014   

                 Order Date Month  Order Date Day  \
Order ID                                            
AG-2011-2040                    1               1   
IN-2011-47883                   1               1   
HU-2011-1220                    1               1   
IT-2011-3647632                 1               1   
IN-2011-47883                   1               1   
...                           ...             ...   
CA-2014-115427                 12              31   
MO-2014-2560                   12              31   
MX-2014-110527                 12              31   
MX-2014-114783                 12              31   
CA-2014-156720                 12              31   

                Ship Date  Ship Date Year  \
Order ID                                    
AG-2011-2040     6/1/2011            2011   
IN-2011-47883    8/1/2011            2011   
HU-2011-1220     5/1/2011            2011   
IT-2011-3647632  5/1/2011            2011   
IN-2011-47883    8/1/2011            2011   
...                   ...             ...   
CA-2014-115427   4/1/2015            2015   
MO-2014-2560     5/1/2015            2015   
MX-2014-110527   2/1/2015            2015   
MX-2014-114783   6/1/2015            2015   
CA-2014-156720   4/1/2015            2015   

                 Ship Date Month  Ship Date Day  \
Order ID                                          
AG-2011-2040                   6              1   
IN-2011-47883                  8              1   
HU-2011-1220                   5              1   
IT-2011-3647632                5              1   
IN-2011-47883                  8              1   
...                          ...            ...   
CA-2014-115427                 4              1   
MO-2014-2560                   5              1   
MX-2014-110527                 2              1   
MX-2014-114783                 6              1   
CA-2014-156720                 4              1   

                      Ship Mode Customer ID  ...  \
Order ID                                     ...   
AG-2011-2040     Standard Class    TB-11280  ...   
IN-2011-47883    Standard Class    JH-15985  ...   
HU-2011-1220       Second Class      AT-735  ...   
IT-2011-3647632    Second Class    EM-14140  ...   
IN-2011-47883    Standard Class    JH-15985  ...   
...                         ...         ...  ...   
CA-2014-115427   Standard Class    EB-13975  ...   
MO-2014-2560     Standard Class     LP-7095  ...   
MX-2014-110527     Second Class    CM-12190  ...   
MX-2014-114783   Standard Class    TD-20995  ...   
CA-2014-156720   Standard Class    JM-15580  ...   

                                            Sub-Category  \
Order ID                                                   
AG-2011-2040                                     Storage   
IN-2011-47883                                   Supplies   
HU-2011-1220                                     Storage   
IT-2011-3647632                                    Paper   
IN-2011-47883                                Furnishings   
...                                                  ...   
CA-2014-115427                                   Binders   
MO-2014-2560     Wilson Jones Hole Reinforcements, Clear   
MX-2014-110527                                    Labels   
MX-2014-114783                                    Labels   
CA-2014-156720                                 Fasteners   

                                                    Product Name  \
Order ID                                                           
AG-2011-2040                                 Tenex Lockers, Blue   
IN-2011-47883                           Acme Trimmer, High Speed   
HU-2011-1220                             Tenex Box, Single Width   
IT-2011-3647632                      Enermax Note Cards, Premium   
IN-2011-47883                         Eldon Light Bulb, Duo Pack   
...                                                          ...   
CA-2014-115427   Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl   
MO-2014-2560                                                3.99   
MX-2014-110527            Hon Color Coded Labels, 5000 Label Set   
MX-2014-114783            Hon Legal Exhibit Labels, Alphabetical   
CA-2014-156720                               Bagged Rubber Bands   

                   Sales Quantity Discount  Profit  \
Order ID                                             
AG-2011-2040       408.3        2     0.00  106.14   
IN-2011-47883    120.366        3     0.10  36.036   
HU-2011-1220       66.12        4     0.00   29.64   
IT-2011-3647632   44.865        3     0.50 -26.055   
IN-2011-47883     113.67        5     0.10   37.77   
...                  ...      ...      ...     ...   
CA-2014-115427    13.904        2     0.20  4.5188   
MO-2014-2560           1        0     0.42    0.49   
MX-2014-110527      26.4        3     0.00   12.36   
MX-2014-114783      7.12        1     0.00    0.56   
CA-2014-156720     3.024        3     0.20 -0.6048   

                Shipping Cost Order Priority Unnamed: 24  \
Order ID                                                   
AG-2011-2040            35.46         Medium         NaN   
IN-2011-47883            9.72         Medium         NaN   
HU-2011-1220             8.17           High         NaN   
IT-2011-3647632          4.82           High         NaN   
IN-2011-47883             4.7         Medium         NaN   
...                       ...            ...         ...   
CA-2014-115427           0.89         Medium         NaN   
MO-2014-2560           Medium            NaN         NaN   
MX-2014-110527           0.35         Medium         NaN   
MX-2014-114783            0.2         Medium         NaN   
CA-2014-156720           0.17         Medium         NaN   

                Unnamed: 25  
Order ID                     
AG-2011-2040            NaN  
IN-2011-47883           NaN  
HU-2011-1220            NaN  
IT-2011-3647632         NaN  
IN-2011-47883           NaN  
...                     ...  
CA-2014-115427          NaN  
MO-2014-2560            NaN  
MX-2014-110527          NaN  
MX-2014-114783          NaN  
CA-2014-156720          NaN  

[51152 rows x 30 columns]

然后可以针对某用户查看其订单的数量,代码如下:

cc = df.groupby("Customer Name").agg({"count"})
print(cc)

我们可以查看其中的一段内容:

                   Order Date Order Date Year  \
                        count           count   
Customer Name                                   
Aaron Bergman              89              89   
Aaron Hawkins              56              56   
Aaron Smayling             60              60   
Adam Bellavance            68              68   
Adam Hart                  82              82   
...                       ...             ...   
Xylona Preis               61              61   
Yana Sorensen              62              62   
Yoseph Carroll             56              56   
Zuschuss Carroll           85              85   
Zuschuss Donatelli         54              54  

例如,可以发现,用户 Aaron Bergman 在当年内,一共访问该商户89次。我们接下来将先以该用户作为一个简单的示例,来说明该如何筛选用户的有效信息。

筛选用户的有效信息

当我们对用户Aaron Bergman进行分析时,我们会优先地关注该用户的消费次数、单次的消费价格、所购买的物品以及其数量。而在excel表格中,这些信息以及都被一一罗列了出来。例如 Sub-Category 代表着购买的物品,Quantity 表示购买的数量,那么,我们便可通过 df 的附属操作来自动获取这些内容了,如下:

u1 = df[df['Customer Name'] == 'Aaron Bergman'][['Sub-Category', 'Sales', 'Quantity', 'Order Date']]
print(u1)

运行结果如下:

                Sub-Category   Sales Quantity  Order Date
Order ID                                                 
MX-2011-127215        Phones   82.26        1   3/11/2011
ES-2011-4146320          Art    50.7        2    4/4/2011
ES-2011-4146320       Labels    32.4        3    4/4/2011
CA-2011-156587        Chairs  48.712        1    7/3/2011
CA-2011-156587           Art   17.94        3    7/3/2011
...                      ...     ...      ...         ...
ES-2011-4184901  Furnishings   75.96        4  30-08-2011
US-2013-123806        Chairs  59.328        3  31-05-2013
US-2013-123806           Art  21.816        1  31-05-2013
US-2013-103450        Chairs  86.416        1  31-10-2013
US-2013-103450     Fasteners   27.18        3  31-10-2013

[89 rows x 4 columns]
商品统计

做完上面这项操作以后,每个用户的购买情况都变得一目了然,我们便可以轻而易举地计算出该用户对于某产品的购买次数,如下:

u1sc = u1[["Quantity", 'Sub-Category']].groupby('Sub-Category').agg({'sum'})
print(u1sc)

运行结果如下:

             Quantity
                  sum
Sub-Category         
Accessories        26
Art                14
Binders            23
Bookcases           9
Chairs             16
Copiers            36
Envelopes           1
Fasteners          16
Furnishings        35
Labels             18
Machines           20
Phones             27
Storage            29
Supplies           31

这样就可以得出用户Aaron Bergman对他历史上买过的每个产品的购买次数了。

遍历至所有类别

我们现在可对所有商品类别的总购买次数进行遍历,如下:

sc = df[['Sub-Category', 'Quantity']].groupby('Sub-Category').agg({'sum'})
sc.columns = ['All']
print(sc)

运行结果如下:

                                             All
Sub-Category                                    
Accessories                                10806
Acco 3-Hole Punch, Recycled                    0
Acco Binder, Economy                           0
Acco Binding Machine, Recycled                 0
Acco Hole Reinforcements, Durable              0
...                                          ...
Wilson Jones Index Tab, Economy                0
Xerox Cards & Envelopes, Multicolor            0
Xerox Cards & Envelopes, Recycled              0
Xerox Computer Printout Paper, Multicolor      0
Xerox Parchment Paper, Multicolor              0

[480 rows x 1 columns]

我们可对几个用户进行分析并拼接数据,如下:

users = ["Aaron Bergman", "Aaron Hawkins", "Aaron Smayling", "Adam Bellavance"]
for user in users:
    u1 = df[df['Customer Name'] == user][['Sub-Category', 'Sales', 'Quantity', 'Order Date']]
    u1sc = u1[["Quantity", 'Sub-Category']].groupby('Sub-Category').agg({'sum'})
    u1sc.columns = [user]
    sc = pd.concat([sc, u1sc], axis=1)
sc = sc.fillna(0)
print(sc)
                                               All  \
Sub-Category                                         
Accessories                                10806.0   
Acco 3-Hole Punch, Recycled                    0.0   
Acco Binder, Economy                           0.0   
Acco Binding Machine, Recycled                 0.0   
Acco Hole Reinforcements, Durable              0.0   
...                                            ...   
Wilson Jones Index Tab, Economy                0.0   
Xerox Cards & Envelopes, Multicolor            0.0   
Xerox Cards & Envelopes, Recycled              0.0   
Xerox Computer Printout Paper, Multicolor      0.0   
Xerox Parchment Paper, Multicolor              0.0   

                                           Aaron Bergman  \
Sub-Category                                               
Accessories                                           26   
Acco 3-Hole Punch, Recycled                            0   
Acco Binder, Economy                                   0   
Acco Binding Machine, Recycled                         0   
Acco Hole Reinforcements, Durable                      0   
...                                                  ...   
Wilson Jones Index Tab, Economy                        0   
Xerox Cards & Envelopes, Multicolor                    0   
Xerox Cards & Envelopes, Recycled                      0   
Xerox Computer Printout Paper, Multicolor              0   
Xerox Parchment Paper, Multicolor                      0   

                                           Aaron Hawkins  \
Sub-Category                                               
Accessories                                           10   
Acco 3-Hole Punch, Recycled                            0   
Acco Binder, Economy                                   0   
Acco Binding Machine, Recycled                         0   
Acco Hole Reinforcements, Durable                      0   
...                                                  ...   
Wilson Jones Index Tab, Economy                        0   
Xerox Cards & Envelopes, Multicolor                    0   
Xerox Cards & Envelopes, Recycled                      0   
Xerox Computer Printout Paper, Multicolor              0   
Xerox Parchment Paper, Multicolor                      0   

                                           Aaron Smayling  \
Sub-Category                                                
Accessories                                            20   
Acco 3-Hole Punch, Recycled                             0   
Acco Binder, Economy                                    0   
Acco Binding Machine, Recycled                          0   
Acco Hole Reinforcements, Durable                       0   
...                                                   ...   
Wilson Jones Index Tab, Economy                         0   
Xerox Cards & Envelopes, Multicolor                     0   
Xerox Cards & Envelopes, Recycled                       0   
Xerox Computer Printout Paper, Multicolor               0   
Xerox Parchment Paper, Multicolor                       0   

                                           Adam Bellavance  
Sub-Category                                                
Accessories                                           13.0  
Acco 3-Hole Punch, Recycled                            0.0  
Acco Binder, Economy                                   0.0  
Acco Binding Machine, Recycled                         0.0  
Acco Hole Reinforcements, Durable                      0.0  
...                                                    ...  
Wilson Jones Index Tab, Economy                        0.0  
Xerox Cards & Envelopes, Multicolor                    0.0  
Xerox Cards & Envelopes, Recycled                      0.0  
Xerox Computer Printout Paper, Multicolor              0.0  
Xerox Parchment Paper, Multicolor                      0.0  

[480 rows x 5 columns]
每列间的相似度

再完成上方的操作之后,我们便可以先尝试分析每列间的相似度情况了,如下:

similar = sc.corr(method = 'pearson', min_periods=1)
print(similar)
                      All  Aaron Bergman  Aaron Hawkins  \
All              1.000000       0.816475       0.809099   
Aaron Bergman    0.816475       1.000000       0.722000   
Aaron Hawkins    0.809099       0.722000       1.000000   
Aaron Smayling   0.898259       0.660771       0.594060   
Adam Bellavance  0.875031       0.588859       0.676802   

                 Aaron Smayling  Adam Bellavance  
All                    0.898259         0.875031  
Aaron Bergman          0.660771         0.588859  
Aaron Hawkins          0.594060         0.676802  
Aaron Smayling         1.000000         0.752868  
Adam Bellavance        0.752868         1.000000

也可以筛选出与指定用户最相似的用户(也就是用户自己):

name = "Aaron Bergman"
us = similar[[name]]
us = us.drop('All')
us = us.drop(name)
similar_user = us.iloc[us[name].argmax()].index.values[0]
print(similar_user)

以及那位用户最喜欢的商品:

suf = sc[[similar_user]]
line = suf.iloc[suf[similar_user].argmax()]
print(line)
计算复购率

做完这一切,我们就可以来计算商品的复购率了,如下:

df_2011 = df[df['Order Date Year'] == 2012]
plt.rcParams['font.family'] = 'SimHei'

df_pivot_counts = df_2011.pivot_table(index='Customer ID', columns='Order Date Month', values='Quantity', aggfunc='count')
#print(df_pivot_counts)
# 将数据中两次及以上的转为1,以下的转为0
df_pivot_counts_repurchase = df_pivot_counts.applymap(lambda x: 1 if x >= 2 else 0 if pd.notnull(x) else np.nan)
(df_pivot_counts_repurchase.sum()/df_pivot_counts_repurchase.count()).plot(marker='o',figsize=(12, 6))
plt.title('每月的复购率')
plt.grid(linestyle='-.')
plt.show()

运行生成折线图如图2:

图2:经过数据分析后生成的商品复购率折线图

二、Lift关联与置信度分析

本篇目主要利用apyori库对数据表进行Lift关联性分析,利用其分析商品的支持度与置信度,并画出热力图。

① apriori类 与 Lift关联性简介

apriori类附属于apyori库中,是apyori库最常用的类之一,作为一个强大的算法库,apriori为自动化运维提供了3个非常强大的方法:Support(支持度)、Confidence(可信度/置信度)、Lift(提升度),定义如下:

Support(支持度)表示同时包含 A 和 B 的事物占所有事物的比例。)

Confidence(可信度)表示包含 A 的事物中同时包含 B 的事物的比例,即同时包含 A 和 B 的事物占包含 A 的事物的比例。

Lift(提升度)表示“包含 A 的事物中同时包含 B 的事物的比例”与“包含 B 的事物的比例”的比值。公式表达:Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)。

提升度反映了关联规则中的 A 与 B 的相关性,提升度 > 1 且越高表明正相关性越高,提升度 < 1 且越低表明负相关性越高,提升度 = 1 表明没有相关性。

其中,

Support(X,Y) = P(X,Y) = \frac{num(xy)}{num(allsamples)},

Confidence(XY) = P(x|Y) = \frac{P(xy)}{P(y)}

② apriori实践

了解apriori后,我们便可对数据集进行支持度、置信度、提升度进行分析。

首先,请先确保在项目更目录中已安装 apyori 库,如未安装,可使用 pip 或 conda 安装,命令如下:

$ > pip install apyori

------or------

$ > conda install apyori

安装完成后,请在程序头导入 pandas 库与 apyori 库,以及所需要的数据集文件,如下:

import pandas as pd
from apyori import apriori

# 读取数据,查看数据格式
file_path = "supermarket_data_clean.xlsx"
dfs = pd.read_excel(file_path, index_col=0)

为了方便apriori进行算法分析,我们需要罗列数据集,代码如下:

df = dfs[dfs["Profit"].map(lambda x: isinstance(x, (int, float)))]
df = df[df["Customer ID"].map(lambda x: isinstance(x, str))]
df = df[df["Product Name"].map(lambda x: isinstance(x, str))]
print(df)

其输出如下:

                 Order Date  Order Date Year  Order Date Month  \
Order ID                                                         
AG-2011-2040       1/1/2011             2011                 1   
IN-2011-47883      1/1/2011             2011                 1   
HU-2011-1220       1/1/2011             2011                 1   
IT-2011-3647632    1/1/2011             2011                 1   
IN-2011-47883      1/1/2011             2011                 1   
...                     ...              ...               ...   
MX-2014-108574   31-12-2014             2014                12   
CA-2014-115427   31-12-2014             2014                12   
MX-2014-110527   31-12-2014             2014                12   
MX-2014-114783   31-12-2014             2014                12   
CA-2014-156720   31-12-2014             2014                12   

                 Order Date Day Ship Date  Ship Date Year  Ship Date Month  \
Order ID                                                                     
AG-2011-2040                  1  6/1/2011            2011                6   
IN-2011-47883                 1  8/1/2011            2011                8   
HU-2011-1220                  1  5/1/2011            2011                5   
IT-2011-3647632               1  5/1/2011            2011                5   
IN-2011-47883                 1  8/1/2011            2011                8   
...                         ...       ...             ...              ...   
MX-2014-108574               31  4/1/2015            2015                4   
CA-2014-115427               31  4/1/2015            2015                4   
MX-2014-110527               31  2/1/2015            2015                2   
MX-2014-114783               31  6/1/2015            2015                6   
CA-2014-156720               31  4/1/2015            2015                4   

                 Ship Date Day       Ship Mode Customer ID  ... Sub-Category  \
Order ID                                                    ...                
AG-2011-2040                 1  Standard Class    TB-11280  ...      Storage   
IN-2011-47883                1  Standard Class    JH-15985  ...     Supplies   
HU-2011-1220                 1    Second Class      AT-735  ...      Storage   
IT-2011-3647632              1    Second Class    EM-14140  ...        Paper   
IN-2011-47883                1  Standard Class    JH-15985  ...  Furnishings   
...                        ...             ...         ...  ...          ...   
MX-2014-108574               1  Standard Class    JB-16045  ...       Labels   
CA-2014-115427               1  Standard Class    EB-13975  ...      Binders   
MX-2014-110527               1    Second Class    CM-12190  ...       Labels   
MX-2014-114783               1  Standard Class    TD-20995  ...       Labels   
CA-2014-156720               1  Standard Class    JM-15580  ...    Fasteners   

                                                    Product Name    Sales  \
Order ID                                                                    
AG-2011-2040                                 Tenex Lockers, Blue    408.3   
IN-2011-47883                           Acme Trimmer, High Speed  120.366   
HU-2011-1220                             Tenex Box, Single Width    66.12   
IT-2011-3647632                      Enermax Note Cards, Premium   44.865   
IN-2011-47883                         Eldon Light Bulb, Duo Pack   113.67   
...                                                          ...      ...   
MX-2014-108574          Novimex Legal Exhibit Labels, Adjustable    16.74   
CA-2014-115427   Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl   13.904   
MX-2014-110527            Hon Color Coded Labels, 5000 Label Set     26.4   
MX-2014-114783            Hon Legal Exhibit Labels, Alphabetical     7.12   
CA-2014-156720                               Bagged Rubber Bands    3.024   

                Quantity Discount  Profit Shipping Cost Order Priority  \
Order ID                                                                 
AG-2011-2040           2      0.0  106.14         35.46         Medium   
IN-2011-47883          3      0.1  36.036          9.72         Medium   
HU-2011-1220           4      0.0   29.64          8.17           High   
IT-2011-3647632        3      0.5 -26.055          4.82           High   
IN-2011-47883          5      0.1   37.77           4.7         Medium   
...                  ...      ...     ...           ...            ...   
MX-2014-108574         3      0.0    0.66          1.32         Medium   
CA-2014-115427         2      0.2  4.5188          0.89         Medium   
MX-2014-110527         3      0.0   12.36          0.35         Medium   
MX-2014-114783         1      0.0    0.56           0.2         Medium   
CA-2014-156720         3      0.2 -0.6048          0.17         Medium   

                Unnamed: 24 Unnamed: 25  
Order ID                                 
AG-2011-2040            NaN         NaN  
IN-2011-47883           NaN         NaN  
HU-2011-1220            NaN         NaN  
IT-2011-3647632         NaN         NaN  
IN-2011-47883           NaN         NaN  
...                     ...         ...  
MX-2014-108574          NaN         NaN  
CA-2014-115427          NaN         NaN  
MX-2014-110527          NaN         NaN  
MX-2014-114783          NaN         NaN  
CA-2014-156720          NaN         NaN  

[50629 rows x 30 columns]

然后再按用户合并商品为字典:

combine_dict = df.groupby('Customer ID').apply(
    lambda x: {col: x[col].tolist()[0] if col != 'Product Name' else x[col].tolist() for col in x.columns}).to_dict()
new_data_dict = []
order_item_list = []
for key, value in combine_dict.items():
    sample = {"Customer ID": key, "Product Name": value["Product Name"]}
    new_data_dict.append(sample)
    order_item_list.append(value["Product Name"])
print(new_data_dict[0])
print(order_item_list[0])
{'Customer ID': 'AA-10315', 'Product Name': ['Fiskars Trimmer, Serrated', 'Avery Shipping Labels, Alphabetical', 'SanDisk Numeric Keypad, USB', 'Elite Shears, High Speed', 'Tenex Personal Project File with Scoop Front Design, Black', 'High Speed Automatic Electric Letter Opener', 'Polycom VVX 310 VoIP phone', 'Verbatim 25 GB 6x Blu-ray Single Layer Recordable Disc, 1/Pack', 'Acco Banker\'s Clasps, 5 3/4"-Long', 'SanDisk Memo Slips, Multicolor', 'BIC Highlighters, Blue', 'Samsung Audio Dock, Full Size', 'Nokia Speaker Phone, Full Size', 'Bush Floating Shelf Set, Metal', 'Apple Speaker Phone, with Caller ID', 'Smead Lockers, Wire Frame', 'Cardinal Binder Covers, Durable', 'Binney & Smith Pencil Sharpener, Water Color', 'Tenex Folders, Blue', 'Konica Calculator, Wireless', 'Hon File Folder Labels, Laser Printer Compatible', 'Master Caster Door Stop, Large Neon Orange', 'Staples', 'Sauder Library with Doors, Traditional', 'Motorola Signal Booster, Full Size', 'Motorola Headset, with Caller ID', 'Avery Index Tab, Economy', 'Eldon Box, Industrial', 'Cardinal Binder, Durable', 'Advantus Stacking Tray, Erganomic', 'Samsung Audio Dock, Cordless', 'Ibico Binding Machine, Recycled', 'Ibico Binder Covers, Durable', 'Wilson Jones Hole Reinforcements, Economy', 'Hamilton Beach Toaster, Black', 'Deflect-O Frame, Duo Pack', 'Boston Canvas, Fluorescent', "Belkin 325VA UPS Surge Protector, 6'", 'Avery Binding System Hidden Tab Executive Style Index Sets', 'GBC DocuBind 200 Manual Binding Machine', 'Fellowes Advanced Computer Series Surge Protectors', 'Bush Stackable Bookrack, Pine']}
['Fiskars Trimmer, Serrated', 'Avery Shipping Labels, Alphabetical', 'SanDisk Numeric Keypad, USB', 'Elite Shears, High Speed', 'Tenex Personal Project File with Scoop Front Design, Black', 'High Speed Automatic Electric Letter Opener', 'Polycom VVX 310 VoIP phone', 'Verbatim 25 GB 6x Blu-ray Single Layer Recordable Disc, 1/Pack', 'Acco Banker\'s Clasps, 5 3/4"-Long', 'SanDisk Memo Slips, Multicolor', 'BIC Highlighters, Blue', 'Samsung Audio Dock, Full Size', 'Nokia Speaker Phone, Full Size', 'Bush Floating Shelf Set, Metal', 'Apple Speaker Phone, with Caller ID', 'Smead Lockers, Wire Frame', 'Cardinal Binder Covers, Durable', 'Binney & Smith Pencil Sharpener, Water Color', 'Tenex Folders, Blue', 'Konica Calculator, Wireless', 'Hon File Folder Labels, Laser Printer Compatible', 'Master Caster Door Stop, Large Neon Orange', 'Staples', 'Sauder Library with Doors, Traditional', 'Motorola Signal Booster, Full Size', 'Motorola Headset, with Caller ID', 'Avery Index Tab, Economy', 'Eldon Box, Industrial', 'Cardinal Binder, Durable', 'Advantus Stacking Tray, Erganomic', 'Samsung Audio Dock, Cordless', 'Ibico Binding Machine, Recycled', 'Ibico Binder Covers, Durable', 'Wilson Jones Hole Reinforcements, Economy', 'Hamilton Beach Toaster, Black', 'Deflect-O Frame, Duo Pack', 'Boston Canvas, Fluorescent', "Belkin 325VA UPS Surge Protector, 6'", 'Avery Binding System Hidden Tab Executive Style Index Sets', 'GBC DocuBind 200 Manual Binding Machine', 'Fellowes Advanced Computer Series Surge Protectors', 'Bush Stackable Bookrack, Pine']

接下来我们便可进行关联性定义了,代码如下:

results = apriori(order_item_list, min_support=0.005, min_confidence=0.25)

定义完成后,可遍历结果数据:

list1, list2, list3, list4 = [], [], [], []
for result in results:
    # 获取支持度,并保留3位小数
    support = round(result.support, 3)
    # 遍历ordered_statistics对象
    for rule in result.ordered_statistics:
        # 获取前件和后件并转成列表
        head_set = list(rule.items_base)
        tail_set = list(rule.items_add)
        # 跳过前件为空的数据
        if not head_set:
            continue
        # 将前件、后件拼接成关联规则的形式
        related_category = str(head_set) + '→' + str(tail_set)
        # 提取置信度,并保留3位小数
        confidence = round(rule.confidence, 3)
        # 提取提升度,并保留3位小数
        lift = round(rule.lift, 3)
        # 查看强关联规则,支持度,置信度,提升度
        print(related_category, support, confidence, lift)
        list1.append(related_category)
        list2.append(support)
        list3.append(confidence)
        list4.append(lift)

输出如下:

['Acco Binder Covers, Durable']→['Staples'] 0.005 0.258 2.171
['Acco Binder, Durable']→['Staples'] 0.007 0.268 2.257
['Acco Index Tab, Durable']→['Staples'] 0.008 0.351 2.956
['Acme Trimmer, Easy Grip']→['Staples'] 0.005 0.4 3.365
['Advantus Clamps, Assorted Sizes']→['Staples'] 0.005 0.333 2.804
['Advantus Door Stop, Erganomic']→['Staples'] 0.008 0.429 3.605
['Advantus Light Bulb, Black']→['Staples'] 0.005 0.348 2.926
['Advantus Stacking Tray, Erganomic']→['Staples'] 0.006 0.25 2.103
['Ames Peel and Seal, Set of 50']→['Staples'] 0.005 0.32 2.692
['Apple Signal Booster, Cordless']→['Staples'] 0.005 0.308 2.589
['Avery Binder Covers, Durable']→['Staples'] 0.006 0.281 2.366
['Avery Binder Covers, Economy']→['Staples'] 0.006 0.257 2.163
['BIC Canvas, Blue']→['Staples'] 0.006 0.29 2.442
['BIC Markers, Water Color']→['Staples'] 0.006 0.25 2.103
['BIC Pens, Fluorescent']→['Staples'] 0.005 0.286 2.404
['Binney & Smith Highlighters, Easy-Erase']→['Staples'] 0.005 0.267 2.243
['Binney & Smith Pencil Sharpener, Water Color']→['Staples'] 0.009 0.269 2.265
['Boston Sketch Pad, Easy-Erase']→['Staples'] 0.006 0.281 2.366
['Cardinal Binding Machine, Durable']→['Staples'] 0.005 0.25 2.103
['Cardinal Hole Reinforcements, Economy']→['Staples'] 0.006 0.294 2.474
['Cisco Office Telephone, VoIP']→['Staples'] 0.005 0.333 2.804
['Cisco Smart Phone, Cordless']→['Staples'] 0.005 0.296 2.493
['Dania Classic Bookcase, Pine']→['Staples'] 0.005 0.381 3.205
['Deflect-O Clock, Black']→['Staples'] 0.005 0.348 2.926
['Deflect-O Stacking Tray, Black']→['Staples'] 0.005 0.333 2.804
['Elite Letter Opener, Easy Grip']→['Staples'] 0.005 0.296 2.493
['Elite Trimmer, Serrated']→['Staples'] 0.005 0.364 3.059
['Enermax Flash Drive, Erganomic']→['Staples'] 0.007 0.55 4.627
['Fellowes Shelving, Single Width']→['Staples'] 0.006 0.257 2.163
['Fellowes Trays, Single Width']→['Staples'] 0.005 0.267 2.243
['Fiskars Letter Opener, Easy Grip']→['Staples'] 0.005 0.258 2.171
['Fiskars Trimmer, Easy Grip']→['Staples'] 0.005 0.296 2.493
['GBC Standard Therm-A-Bind Covers']→['Staples'] 0.005 0.727 6.118
['Harbour Creations Round Labels, Alphabetical']→['Staples'] 0.005 0.364 3.059
['Hon Executive Leather Armchair, Black']→['Staples'] 0.005 0.308 2.589
['Hon Shipping Labels, Laser Printer Compatible']→['Staples'] 0.005 0.296 2.493
['Hon Steel Folding Chair, Adjustable']→['Staples'] 0.005 0.296 2.493
['Hon Swivel Stool, Black']→['Staples'] 0.006 0.25 2.103
['Ibico 3-Hole Punch, Recycled']→['Staples'] 0.006 0.27 2.274
['Ibico Binding Machine, Recycled']→['Staples'] 0.008 0.302 2.543
['Kleencut Box Cutter, High Speed']→['Staples'] 0.005 0.276 2.321
['Kraft Business Envelopes, Recycled']→['Staples'] 0.005 0.267 2.243
['Memorex Flash Drive, Programmable']→['Staples'] 0.005 0.4 3.365
['Novimex Bag Chairs, Black']→['Staples'] 0.005 0.276 2.321
['Novimex Chairmat, Set of Two']→['Staples'] 0.005 0.296 2.493
['Novimex Executive Leather Armchair, Red']→['Staples'] 0.006 0.321 2.704
['OIC Push Pins, Bulk Pack']→['Staples'] 0.005 0.444 3.739
['Office Star Rocking Chair, Black']→['Staples'] 0.005 0.267 2.243
['Office Star Rocking Chair, Set of Two']→['Staples'] 0.007 0.393 3.305
['Rogers Folders, Wire Frame']→['Staples'] 0.006 0.281 2.366
['Rogers Lockers, Industrial']→['Staples'] 0.005 0.258 2.171
['SAFCO Executive Leather Armchair, Adjustable']→['Staples'] 0.006 0.346 2.912
['SAFCO Executive Leather Armchair, Black']→['Staples'] 0.005 0.286 2.404
['SAFCO Rocking Chair, Red']→['Staples'] 0.005 0.4 3.365
['SAFCO Steel Folding Chair, Red']→['Staples'] 0.006 0.37 3.116
['Smead Lockers, Wire Frame']→['Staples'] 0.006 0.265 2.227
['Smead Trays, Single Width']→['Staples'] 0.009 0.359 3.02
['Stiletto Shears, Serrated']→['Staples'] 0.005 0.471 3.959
['Stockwell Paper Clips, Assorted Sizes']→['Staples'] 0.011 0.295 2.482
['Stockwell Thumb Tacks, Bulk Pack']→['Staples'] 0.007 0.289 2.435
['Tenex Lockers, Single Width']→['Staples'] 0.005 0.267 2.243
['Tenex Trays, Single Width']→['Staples'] 0.006 0.263 2.214
['Wilson Jones Hole Reinforcements, Economy']→['Staples'] 0.008 0.324 2.728
['Xerox 1881']→['Staples'] 0.005 0.727 6.118

现在,每行中匹配到的商品关联,都依次可以现实支持度、置信度、提升度。我们还可以将这些关联规则保存至 csv 文件:

df = pd.DataFrame()
df['related_category'] = list1
df['support'] = list2
df['confidence'] = list3
df['lift'] = list4
df = df.sort_values('lift', ascending=False)
df.to_csv("./shopping_basket_result.csv")
print('Save successfully.')

运行后,便会在当前目录中产生一个名为 shopping_basket_result 的 csv 文件,包含的内容与上方的控制台输出相同。

③ 使用 seaborn 绘制热力图

seaborn 是一个常用的开源可视化图表库,因为该库与 matplotlab 有着良好衔接,我们可以使用这两个库对支持度、置信度、提升度进行可视化。

请先确保项目根目录中已安装 seaborn 库,如未安装,可使用 pip 与 conda 安装,代码如下:

$ > pip install seaborn

------or------

$ > conda install seaborn

安装完成后,请在程序头导入这两个库,代码如下:

import matplotlib.pyplot as plt
import seaborn as sns

导入完成后,便可画出三个数据的热力图了,完整代码如下:
 

# 设置字体以支持中文
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# 从保存的文件中加载结果
df = pd.read_csv("./shopping_basket_result.csv", index_col=0)

# Lift值最高的前10个规则
top_10_lift = df.sort_values('lift', ascending=False).head(10)

# 绘制条形图
plt.figure(figsize=(12, 8))
sns.barplot(data=top_10_lift, y='related_category', x='lift', palette='viridis')
plt.title('Lift值最高的前10个关联规则')
plt.xlabel('Lift值')
plt.ylabel('关联规则')
plt.tight_layout()
plt.savefig('Lift值最高的前10个关联规则.png')
plt.show()

# 支持度和置信度的热图
plt.figure(figsize=(12, 8))
pivot_table = df.pivot(index='related_category', columns='support', values='confidence')
sns.heatmap(pivot_table, cmap="YlGnBu", cbar_kws={'label': '置信度'})
plt.title('支持度和置信度的热图')
plt.xlabel('支持度')
plt.ylabel('关联规则')
plt.tight_layout()
plt.savefig('支持度和置信度热图.png')
plt.show()

运行后会在控制台中显示两张热力图,它们也会保存在当前目录中,如图3、图4所示:

图3:Lift值最高的前10个关联数据
图4:支持度与置信度热力图

 ④现实意义

至此,我们已经完成了针对关联规则进行热力可视化分析,作为一个商户的身份来看时,例如,便可以发现购买 GBC Standard therm-A-Bind Covers 的用户最有可能同时购买 Staples,那么,就可以向单单购买 GBC Standard therm-A-Bind Covers 的用户推荐 Staples,一个个性化推荐就这样完成了。


The end.

参考文献:

  • Multi-Interest Network with Dynamic Routing for Recommendation at Tmall
  • Deep Session Interest Network for Click-Through Rate Prediction
  • Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
  • Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba
  • Personal Recommendation Using Deep Recurrent Neural Networks in NetEase
  • Deep Reinforcement Learning for List-wise Recommendations
  • Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning
  • Learning Tree-based Deep Model for Recommender Systems
  • Item2Vec- Neural Item Embedding for Collaborative Filtering
  • Deep Neural Networks for YouTube Recommendations
  • Deep Learning based Recommender System- A Survey and New Perspectives
  • Wide & Deep Learning for Recommender Systems
  • Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
  • Ad Click Prediction- a View from the Trenches
  • Greedy function approximation: a gradient boosting machine
  • Practical Lessons from Predicting Clicks on Ads at Facebook
  • Google News Personalization: Scalable Online Collaborative Filtering
  • 14
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值