rstudio 关联r
背景 (Background)
Retailers typically have a wealth of customer transaction data which consists of the type of items purchased by a customer, their value and the date they were purchased. Unless the retailer has a loyalty rewards system, they may not have demographic information on their customers such as height, age, gender and address. Thus, in order to make suggestions on what this customer might want to buy in the future, i.e which products to recommend to a customer, this has to be based on their purchase history and information on the purchase history of other customers.
零售商通常拥有大量的客户交易数据,这些数据包括客户购买的商品的类型,其价值和购买日期。 除非零售商具有忠诚度奖励制度,否则他们可能没有客户的人口统计信息,例如身高,年龄,性别和地址。 因此,为了提出关于该顾客将来可能想要购买什么的建议,即向顾客推荐哪些产品,这必须基于他们的购买历史和关于其他顾客的购买历史的信息。
In collaborative filtering, recommendations are made to customers based on finding similarities between the purchase history of customers. So, if Customers A and B both purchase Product A, but customer B also purchases Product B, then it is likely that customer A may also be interested in Product B. This is a very simple example and there are various algorithms that can be used to find out how similar customers are in order to make recommendations.
在协同过滤中 ,会根据发现客户购买历史之间的相似性来向客户提出建议。 因此,如果客户A和客户B都购买了产品A,但客户B也购买了产品B,则客户A可能也对产品B感兴趣。这是一个非常简单的示例,可以使用多种算法找出相似的客户以提出建议。
One such algorithm is k-nearest neighbour where the objective is to find k customers that are most similar to the target customer. It involves choosing a k and a similarity metric (with Euclidean distance being most common). The basis of this algorithm is that points that are closest in space to each other are also likely to be most similar to each other.
一种这样的算法是k最近邻居 ,其目的是找到与目标客户最相似的k客户。 它涉及选择k和相似性度量(以欧几里得距离最为常见)。 该算法的基础是,在空间上彼此最接近的点也可能彼此最相似。
Another techinque is to use basket analysis or association rules. In this method, the aim is to find out which items are bought together (put in the same basket) and the frequency of this purchase. The output of this algorithm is a series of if-then rules i.e. if a customer buys a candle, then they are also likely to buy matches. Association rules can assist retailers with the following:
另一种技术是使用购物篮分析或关联规则。 在这种方法中,目的是找出一起购买的物品(放在同一篮子中)和购买的频率。 该算法的输出是一系列的if-then规则,即,如果客户购买了一支蜡烛,那么他们也很可能会购买火柴。 关联规则可以协助零售商进行以下工作:
- Modifying store layout where associated items are stocked together; 修改将相关物料存放在一起的商店布局;
- Sending emails to customers with recommendations on products to purchase based on their previous purchase (i.e. we noticed you bought a candle, perhaps these matches may interest you?); and 向客户发送电子邮件,并根据他们先前的购买建议购买产品(即,我们注意到您购买了一支蜡烛,也许这些匹配可能会让您感兴趣?); 和
- Insights into customer behaviour 洞察客户行为
Let’s now apply association rules to a dummy dataset
现在让我们将关联规则应用于虚拟数据集
数据集 (The dataset)
A dataset of 2,178,282 observations/rows and 16 variables/features was provided.
提供了2,178,282个观测/行和16个变量/特征的数据集。
The first thing I did with this dataset was quickly check for any missing values or NAs as per follows. As shown below, no missing values were found.
我对此数据集所做的第一件事是按照以下步骤快速检查是否有任何缺失值或NA。 如下所示,未找到缺失值。
Now the variables were all either read in as numeric or string variables. In order to meaningfully interpret categorical variables, they need to be changed to factors. As such, the following changes were made.
现在,所有变量都以数字或字符串变量形式读入。 为了有意义地解释分类变量,需要将其更改为因子。 因此,进行了以下更改。
retail <- retail %>%
mutate(MerchCategoryName = as.factor(MerchCategoryName)) %>%
mutate(CategoryName = as.factor(CategoryName)) %>%
mutate(SubCategoryName = as.factor(SubCategoryName)) %>%
mutate(StoreState = as.factor(StoreState)) %>%
mutate(OrderType = as.factor(OrderType)) %>%
mutate (BasketID = as.numeric(BasketID)) %>%
mutate(MerchCategoryCode = as.numeric(MerchCategoryCode)) %>%
mutate(CategoryCode = as.numeric(CategoryCode)) %>%
mutate(SubCategoryCode = as.numeric(SubCategoryCode)) %>%
mutate(ProductName = as.factor(ProductName))
Then, all the numeric variables were summarised into their five-point summary (min, median, max, std dev., and mean) to identify any outliers within the data. By running this summary, it was found that the features MerchCategoryCode, CategoryCode, and SubCategoryCode contained a large number of NAs. Upon further inspection, it was found that the majority of these code values contained digits; however, the ones that had been converted to NAs contained characters such as “Freight” or the letter “C”. As these codes are not related to customer purchases, these observations were removed.
然后,将所有数值变量汇总到其五点汇总中(最小值,中位数,最大值,标准偏差和均值),以识别数据中的任何异常值。 通过运行这个总结,发现特征MerchCategoryCode,CategoryCode和 SubCategoryCode包含了大量的NAS。 经过进一步检查,发现这些代码值中的大多数包含数字。 但是,已转换为NA的字母包含“运费”或字母“ C”之类的字符。 由于这些代码与客户购买无关,因此删除了这些观察结果。
Negative gross sales and negative quantity indicate either erroneous values or customer returns. This may be interesting information; however, it is not related to our objective of analysis and as such these observations were omitted.
负销售总额和负数量表示错误的价值或客户退货。 这可能是有趣的信息。 但是,这与我们的分析目标无关,因此省略了这些观察。
数据探索 (Data Exploration)
It is always a good idea to explore the data to see if you can see any trends or patterns within the dataset. Later on, you can use an algorithm/machine learning model to validate these trends.
探索数据以查看是否可以看到数据集中的任何趋势或模式始终是一个好主意。 稍后,您可以使用算法/机器学习模型来验证这些趋势。
The graph below shows me that the highest number of transactions come from Victoria followed by Queensland. If a retailer wants to know where to increase sales then this plot may be useful as the number of sales are proportionately low in all other states.
下图显示了交易量最高的国家是维多利亚州,其次是昆士兰州。 如果零售商想知道在哪里增加销售额,那么该图可能很有用,因为在所有其他州,销售额均成比例降低。
The below plot shows us that most gross sales values around >0-$40 (median is $37.60).
下图显示了大多数销售总额> 0- $ 40(中位数为$ 37.60)。
We can also see this plot by state as below. However, the transactions from Victoria and Queensland seem to cover up information for other states. Boxplots may be better for visualisation.
我们还可以按状态查看此图,如下所示。 但是,维多利亚州和昆士兰州的交易似乎掩盖了其他州的信息。 箱线图可能更适合可视化。
The below boxplots (though hard to see due to the scale being extended by the outliers) show that most sales across all states are close to the overall median. There in an abnormally high outlier for NT and a couple for VIC. For our purpose, since we are only interested in understanding which products do customers buy together in order to make recommendations, we do not need to deal with these outliers.
下面的方框图(由于异常值扩大了规模,因此很难看到)表明,所有州的大多数销售额都接近整体中位数。 NT异常高,VIC异常高。 就我们的目的而言,由于我们只想了解客户一起购买哪些产品以提出建议,因此我们不需要处理这些异常值。
Now that we have had a look at sales by state. Let’s try and get a better understanding of the products purchased by customers.
现在,我们已经按州查看了销售额。 让我们尝试更好地了解客户购买的产品。
The plot below is coloured based on the frequency of purchases per item. Lighter shades of blue indicate higher frequencies.
下图是根据每件物品的购买频率着色的。 较浅的蓝色阴影表示频率较高。
Some key takeaways are:
一些关键要点是:
- No sales for team sports in ACT, NSW, SA, and WA — could be due to these products not being stocked there or perhaps they need to be marketed better ACT,NSW,SA和WA的团队运动没有销售-可能是因为这些产品没有在那里库存,或者可能需要更好地销售
- No sales for ski products in ACT, NSW, SA, and WA. I find this quite shocking as NSW and ACT are quite close to some major ski resorts like Thredbo. It is weird that there are ski product sales in QLD which experiences a warm climate throughout the year. Either these products have been mislabelled or they were not stocked in NSW and ACT. ACT,NSW,SA和WA的滑雪产品没有销售。 我觉得这很令人震惊,因为新南威尔士州和ACT靠近一些主要的滑雪胜地,如Thredbo。 奇怪的是,昆士兰州的滑雪产品销售全年都处于温暖的气候。 这些产品贴错了标签,或者没有存放在新南威尔士州和首都地区。
- Paint and panel sales in WA only. 仅在华盛顿州的油漆和面板销售。
- Bike sales in VIC only. 仅在VIC进行自行车销售。
- Camping and apparel recorded highest sales in VIC, followed by Gas, Fuel and BBQing. 露营和服装在维也纳国际中心的销售额最高,其次是天然气,燃料和烧烤。
Due to the distribution of sales by product and state, it appears that any association rules we come up with will mainly be based on sales from VIC and QLD. Furthermore, as not all products were stocked/sold in all states, it is expected that the association rules will be limited to a very few number of products. However, since I have already embarked on this mode of analysis, let’s continue to see what we get.
由于按产品和州划分的销售额分布,看来我们提出的任何关联规则都将主要基于VIC和QLD的销售额。 此外,由于并非所有产品都在所有州都有库存/出售,因此,预计关联规则将限于极少数产品。 但是,由于我已经开始采用这种分析模式,所以让我们继续看看我们得到了什么。
We have two years worth of data, 2016 and 2017. So, I decided to compare the gross number of sales for the two years.
我们有2016年和2017年的两年数据,因此,我决定比较这两年的销售总额。
Despite the higher number of transactions in 2016 (2.5 times more than 2017), mean gross sales were higher for 2017 than 2016. This seems quite counter-intuitive. So, I decided to dive into this deeper by looking at monthly sales.
尽管2016年交易数量增加(比2017年增加了2.5倍),但2017年的平均销售总额却比2016年更高。这似乎是违反直觉的。 因此,我决定通过查看月度销售来更深入地研究。
Year# of TransactionsMean Gross Sales ($)2016 1481922
$69.02017 593315
$86.0
交易年份1481922
销售总额($)2016 1481922
$ 69.02017 593315
$ 86.0
In 2016, the highest number of sales were recorded for January and March with steep declines in September to November and then an increase in December. However, transactions continued to decline in 2017 with an increase in December (Xmas season).
2016年,1月和3月的销售记录最高,9月至11月急剧下降,然后在12月上升。 但是,2017年交易继续下降,12月(圣诞节季节)有所增加。
Deduction: As highest number of sales are for Camping, apparel and BBQ & Gas, it makes sense that sales for these products is high during the holiday season
扣除 :由于露营,服装和烧烤与天然气的销售量最高,因此在假期期间这些产品的销售量很高
Recommendation to the retailer: May want to explore whether stores have sufficient stock for these products in Dec-Jan as they are the most popular.
给零售商的 建议 :可能想探索商店中是否有足够的库存来存放这些产品,因为它们是最受欢迎的产品。
Deduction: Despite the steady decline in the number of transactions, mean gross sales continue to increase month on month with it being highest in Dec 2017. This indicates fewer customers that made purchases but made purchases of products of greater value.
扣除额 :尽管交易数量稳步下降,但平均销售总额仍逐月增加,在2017年12月达到最高。这表明购买商品的顾客减少了,但购买了更高价值的商品。
Recommendation: What can the retailer do to ensure there is a steady state of purchases throughout the year rather than an increasing trend with maximum number of purchases at the end of the year as the retailer is still paying overhead costs and employee salaries amongst other costs to run its stores?
建议 :零售商应采取什么措施确保全年的采购状况稳定,而不是在年底增加采购数量的增加趋势,因为零售商仍需支付间接费用和员工薪金等开店?
购物篮分析/关联规则 (Basket Analysis/Association Rules)
Let’s go back to our objective.
让我们回到我们的目标。
Aim: To determine which products are customers likely to buy together in order to make recommendations for products
目的 :确定客户可能一起购买哪些产品,以便为产品提供建议
I used the arules package and the read.transactions function to convert the dataset into a transaction object. A summary of this object gives the following output
我使用了arules包和read.transactions函数将数据集转换为事务对象。 该对象的摘要提供以下输出
## transactions as itemMatrix in sparse format with
## 1019952 rows (elements/itemsets/transactions) and
## 21209 columns (items) and a density of 9.531951e-05
##
## most frequent items:
## GAS BOTTLE REFILL 9KG* GAS BOTTLE REFILL 4KG*
## 30628 11724
## 6 PACK BUTANE - WILD COUNTRY SNAP HOOK ALUMINIUM GRIPWELL
## 9209 7086
## PEG TENT GALV 225X6.3MM P04G (Other)
## 6948 1996372
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 546138 234643 109888 55319 30185 16656 9878 6018 3716 2332
## 11 12 13 14 15 16 17 18 19 20
## 1611 993 751 490 353 237 157 140 99 88
## 21 22 23 24 25 26 27 28 29 30
## 53 48 28 31 20 13 12 15 8 1
## 31 32 33 34 35 36 37 38 39 40
## 4 2 4 3 4 1 4 2 1 4
## 43 46
## 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.022 2.000 46.000
##
## includes extended item information - examples:
## labels
## 1 10
## 2 11
## 3 11/12
Based on the output above, we can conclude the following.
根据上面的输出,我们可以得出以下结论。
- There are 1019952 collections (baskets) of items and 21209 items. 有1019952个项目(购物篮)和21209个项目。
Density measures the percentage of non-zero cells in a sparse matrix. It is the total number of items that are purchased divided by the possible number of items in that matrix. You can calculate how many items were purchased by using density: 1019952212090.0000953 = 2,061,545
密度衡量的是稀疏矩阵中非零单元格的百分比。 它是购买的商品总数除以该矩阵中的可能商品数。 您可以使用密度来计算购买了多少商品:1019952 21209 0.0000953 = 2,061,545
Element (itemset/transaction) length distribution: This tells you you how many transactions are there for 1-itemset, for 2-itemset and so on. The first row is telling you the number of items and the second row is telling you the number of transactions.
元素(项目集/事务)长度分布:告诉您1项目集,2项目集等的事务数量。 第一行告诉您项目的数量,第二行告诉您交易的数量。
- Majority of baskets (87%) consist of between 1 to 3 items. 大部分篮子(87%)由1至3个物品组成。
- Minimum number of items in a basket = 1 and maximum = 46 (only one basket) 一个篮子中的最小项目数= 1,最大= 46(仅一个篮子)
- Most popular items are gas bottle, gas bottle refill, gripwell, and peg tent. 最受欢迎的物品是气瓶,气瓶笔芯,握把和固定帐篷。
We can look at this information graphically via absolute frequency and relative frequency plots.
我们可以通过绝对频率图和相对频率图以图形方式查看此信息。
Both plots are in descending order of frequency of purchase. The absolute frequency plot tells us that the highest number of sales are for gas related products. The relative frequency plot shows how the sales of the products that are close to each other in the bar chart are related to each other (i.e. relative). Thus, a recommendation that one can make to the retailer is to stock these products together in the store or send customers an EDM making recommendations for products that are related in the plot and have not yet been purchased by the customer.
两种地块均按购买频率降序排列。 绝对频率图告诉我们,与气体相关的产品销量最高。 相对频率图显示了条形图中彼此接近的产品的销售额如何相互关联(即相对)。 因此,可以向零售商提出的建议是将这些产品一起存储在商店中,或者向客户发送EDM,以为该地块中相关但尚未被客户购买的产品提供建议。
The next step to do is to generate rules for our transaction object. The output is as follows.
下一步是为我们的交易对象生成规则。 输出如下。
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1019
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21209 item(s), 1019952 transaction(s)] done [2.52s].
## sorting and recoding items ... [317 item(s)] done [0.04s].
## creating transaction tree ... done [0.84s].
## checking subsets of size 1 2 done [0.04s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.25s].
The above output shows us that 7 rules were generated.
上面的输出向我们显示了生成了7条规则。
Details of these rules are shown below.
这些规则的详细信息如下所示。
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001128 Min. :0.5458 Min. : 26.30 Min. :1150
## 1st Qu.:0.001464 1st Qu.:0.6395 1st Qu.: 80.36 1st Qu.:1493
## Median :0.001650 Median :0.6634 Median :154.58 Median :1683
## Mean :0.001652 Mean :0.6759 Mean :154.48 Mean :1685
## 3rd Qu.:0.001668 3rd Qu.:0.7265 3rd Qu.:245.30 3rd Qu.:1701
## Max. :0.002524 Max. :0.7898 Max. :249.14 Max. :2574
##
## mining info:
## data ntransactions support confidence
## tr 1019952 0.001 0.5
Now each of these rules have support, confidence, and lift values.
现在,每个规则都具有支持,信心和提升值。
Let’s start with support which is the proportion of transactions out of all transactions used to generate the rules (i.e. 1,019,952) that contain the two items together (i.e. 1190/1019952 = 0.0011 or 0.11%, where count is the number of transactions that contain the two items.
让我们从支持开始,这是用于生成包含两个项目的规则(即1,019,952)的所有交易中交易的比例(即1190/1019952 = 0.0011或0.11%,其中count是包含交易的数量)。两个项目。
Confidence is the proportion of transactions where two items are bought together out of all transactions where one of the item is purchased. As these are apriori rules, the probability of buying item B is based on the purchase of item A.
置信度是在购买一件商品的所有交易中,同时购买两项的交易所占的比例。 由于这些是先验规则,因此购买项目B的概率基于对项目A的购买。
Mathematically, this looks like the following:
从数学上讲,这类似于以下内容:
Confidence(A=>B) = P(A∩B) / P(A) = frequency(A,B) / frequency(A)
置信度(A => B)= P(A∩B)/ P(A)=频率(A,B)/频率(A)
In the results above, confidence values range from 54% to 79%.
在以上结果中,置信度范围为54%至79%。
Probability of customers buying items together with confidence ranges from 54% to 79%, where buying item A has a positive effect on buying item B (as lift values are all greater than 1) .
客户购买商品的概率连同置信度在54%到79%之间,其中购买商品A对购买商品B有积极影响(因为提升值都大于1)。
Note: When I ran the algorithm, I experimented with higher support and confidence values as if there is a greater number of transactions within the dataset where two items are bought together then the higher the confidence. However, when I ran the algorithm with 80% or more confidence, I obtained zero rules.
注意:当我运行算法时,我尝试了更高的支持度和置信度值,好像在数据集中有两个项目一起购买的交易数量较多时,置信度越高。 但是,当我以80%或更高的置信度运行算法时,我获得了零规则。
This was expected due to the sparsity in data for frequent items where 1-item baskets are most common and the majority of purchased items related to camping or gas products.
可以预见,这是因为经常出现的物品(其中最常见的是1个项目的篮子)的数据稀疏,并且购买的大多数物品都与露营或天然气产品有关。
Thus, the algorithm was run with the following parameters.
因此,该算法使用以下参数运行。
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.5,maxlen=10))
Lift indicates how two items are correlated to each other. A positive lift value indicates that buying item A is likely to result in a purchase of item B. Mathematically, lift is calculated as follows.
提升指示两个项目如何相互关联。 正提升值表示购买商品A可能导致购买商品B。在数学上, 提升计算如下。
Lift(A=>B) = Support / (Supp(A) * Supp(B) )
提升(A => B)=支撑/(支持(A)*支持(B))
All our rules have positive lift values indicating that buying item A is likely to lead to a purchase of item B.
我们所有的规则都具有正提升值,表明购买商品A可能导致购买商品B。
规则检查 (Rules inspection)
Let’s now inspect the rules.
现在让我们检查规则。
lhs rhs support confidence lift count
## [1] {GAS BOTTLE 9KG POL CODE 2 DC} => {GAS BOTTLE REFILL 9KG*} 0.001650078 0.7897701 26.30036 1683
## [2] {WEBER BABY Q (Q1000) ROASTING TRIVET} => {WEBER BABY Q CONVECTION TRAY} 0.001127504 0.6526674 241.45428 1150
## [3] {GAS BOTTLE 2KG CODE 4 DC} => {GAS BOTTLE REFILL 2KG*} 0.001344181 0.7308102 154.58137 1371
## [4] {GAS BOTTLE 4KG POL CODE 2 DC} => {GAS BOTTLE REFILL 4KG*} 0.001583408 0.7222719 62.83544 1615
## [5] {YTH L J PP THERMAL OE} => {YTH LS TOP PP THERMAL OE} 0.001667726 0.6634165 249.13587 1701
## [6] {YTH LS TOP PP THERMAL OE} => {YTH L J PP THERMAL OE} 0.001667726 0.6262887 249.13587 1701
## [7] {UNI L J PP THERMAL OE} => {UNI L S TOP PP THERMAL OE} 0.002523648 0.5458015 97.88840 2574
Interpretation of the first rule is as follows:
第一条规则的解释如下:
If a customer buys the 9kg gas bottle, there is a 79% chance that customer will also buy its refill. This is identified for 1,683 transactions in the dataset.
如果客户购买了9公斤的气瓶,则客户也有79%的机会购买其补充装。 在数据集中为 1,683个事务确定了这一点 。
Now, let’s look at these plots visually.
现在,让我们直观地查看这些图。
All rules have a confidence value greater than 0.5 with lift ranging from 26 to 249.
所有规则的置信度值都大于0.5,提升范围为26至249。
The Parallel coordinates plot for the seven rules shows how the purchase of one product influences the purchase of another product. RHS is the item we propose the customer buy. For LHS, 2 is the most recent addition to the basket and 1 is the item that the customer previously purchased.
七个规则的平行坐标图显示了一种产品的购买如何影响另一种产品的购买。 RHS是我们建议客户购买的物品。 对于LHS,购物篮中最新添加了2个,客户先前购买的商品是1个。
Looking at the first arrow we can see that if a customer has Weber Baby (Q1000) roasting trivet in their basket, then they are likely to purchase weber babgy q convection tray.
查看第一个箭头,我们可以看到,如果客户的购物篮中装有Weber Baby(Q1000)烤三角架,那么他们很可能会购买Weber babgy q对流托盘。
The below plots would be more useful if we could visualize more than 2-itemset baskets.
如果我们可以可视化超过2个项目的购物篮,则以下图表将更加有用。
结语 (Wrapping up)
You have now learnt how to make recommendations to customers based on which items are most frequently purchased together based on apriori rules. However, some important things to note about this analysis.
现在,您已经了解了如何根据先验规则,根据最常一起购买的商品向客户提出建议。 但是,有关此分析的一些重要注意事项。
- The most popular/frequent items have confounded the analysis to some extent where it appears that we can only make recommendations with respect to only seven association rules with confidence. This is due to the uneven distribution of the number of items by frequency in the basket. 最受欢迎/最常见的项目在某种程度上使分析变得混乱,因为我们似乎只能自信地针对七个关联规则提出建议。 这是由于篮子中物品数量按频率的不均匀分布所致。
- Customer segmentation may be another approach for this dataset where customers are grouped by spend (SalesGross), product type (i.e. CategoryCode), StateStore, and time of sale (i.e. Month/Year). However, it would be useful to have more features on customers to do this effectively. 客户细分可能是此数据集的另一种方法,其中按支出(SalesGross),产品类型(即CategoryCode),StateStore和销售时间(即月/年)对客户进行分组。 但是,为客户提供更多功能以有效地执行此操作将很有用。
Code and dataset: https://github.com/shedoesdatascience/basketanalysis
代码和数据集: https : //github.com/shedoesdatascience/basketanalysis
rstudio 关联r