Implementing Apriori Algorithm in R

421 篇文章 15 订阅

There are a bunch of blogs out there posted that show how to implement apriori algorithm in R. However, when I was working on the same, I hit a roadblock since the data was neither in single format, nor in basket(Step 2 explains what a basket format is). I spent quite some time converting the data into the required format to be able to find the association rules.
So, here goes…

Step 1: Read the data

Read the groceries csv file. Here is a link to the csv file.

df_groceries <- read.csv("groceries.csv")

The data consists of three columns:
Member_number: An ID that can help distinguish different purchases by different customers.
Date: The date of transaction
ItemDescription: The description of the actual item that was bought.

Step 2: Data cleaning and manipulations using R

The data required for Apriori must be in the following basket format:

image10
The basket format must have first column as a unique identifier of each transaction, something like a unique receipt number. The second columns consists of the items bought in that transaction, separated by spaces or commas or some other separator.

However, the data we have is something like this:

MEMBER NUMBER DATE ITEM DESCRIPTION
168812202019912/26/2014Citrus fruit
168812202019910/05/2011Whole milk
168812202019910/05/2011chocolates
161809036829903/29/2011dishes

Since the structure of the data is not in the format necessary to find association rules, we have to perform some data manipulations before finding the relationships.

Lets first make sure that the Member numbers are of numeric data type and then sort the dataframe based on the Member_number.

df_sorted <- df_groceries[order(df_groceries$Member_number),]
df_sorted$Member_number <- as.numeric(df_sorted$Member_number)

Learn more about vectors, matrices and data frames in R, or check those videos.

Now, we have to convert the dataframe into transactions format such that we have all the items bought at the same time in one row. For this, we use a function called ddply, offered by package plyr.

install.packages(“plyr”, dependencies= TRUE)

Make sure that you do not have package ‘dplyr’ attached to the session. You might end up getting something like this:
‘You have loaded plyr after dplyr – this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr)library(dplyr)

Hence, detach dplyr package first and then load the package

if(sessionInfo()['basePkgs']=="dplyr" | sessionInfo()['otherPkgs']=="dplyr"){
  detach(package:dplyr, unload=TRUE)
}

library(plyr)

The next step is to actually convert the dataframe into basket format, based on the Member_number and Date of transaction

df_itemList <- ddply(df_groceries,c("Member_number","Date"), 
                       function(df1)paste(df1$itemDescription, 
                       collapse = ","))

The above function ddply() checks the date and member number and pivots the item descriptions with same date and same member number in one line, separated by commas.

Something like this:

MEMBER NUMBER DATE ITEM DESCRIPTION
168812202019912/26/2014Citrus fruit
168812202019910/05/2011Whole milk
168812202019910/05/2011chocolates
161809036829903/29/2011dishes

becomes:

MEMBER NUMBER DATE ITEM DESCRIPTION
168812202019912/26/2014Citrus fruit
168812202019910/05/2011Whole milk,chocolates
161809036829903/29/2011dishes

Notice how member number 1688122020199 bought Whole milk and dishes on the same date; which means they were bought together. Thus we group them together in one row, separated by commas.
Thus, we now have the data in the necessary basket format. We can now implement Apriori on this data. The ddply function works pretty well even with larger datasets, I have tried it with a million rows and it takes only a few minutes to pivot the table.

Once we have the transactions, we no longer need the date and member numbers in our analysis. Go ahead and delete those columns.

df_itemList$Member_number <- NULL
df_itemList$Date <- NULL

#Rename column headers for ease of use
colnames(df_itemList) <- c("itemList")

Write the resulting table to a csv file. The reason we do this is, when we write a dataframe to a .csv file, it attaches a row number by default. (unless, of course you were to explicitly tell it not to, by using the argument “row.names=FALSE” in the write.csv function).
We can simply use these row numbers as transaction IDs, as they would be unique to each transaction. Convenient?

Write dataframe to a csv file using write.csv()

write.csv(df_itemList,"ItemList.csv", row.names = TRUE)

Step 3: Find the association rules

Read the csv file u just saved and you will automatically get the transaction IDs in the dataframe
Run algorithm on ItemList.csv to find relationships among the items. Apriori find these relations based on the frequency of items bought together.

For implementation in R, there is a package called ‘arules’ available that provides functions to read the transactions and find association rules.

So, install and load the package:

install.packages(“arules”, dependencies=”TRUE”)
library(arules)

Using the read.transactions() functions, we can read the file ItemList.csv and convert it to a transaction format

txn = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);

Parameters: Transaction file: ItemList.csv
rm.duplicates : to make sure that we have no duplicate transaction entried
format : basket (row 1: transaction ids, row 2: list of items)
sep: separator between items, in this case commas
cols : column number of transaction IDs

Quotes are introduced in transactions, which are unnecessary and result in some incorrect results. So, we must get rid of them:

txn@itemInfo$labels <- gsub("\"","",,txn@itemInfo$labels)

Finally, run the apriori algorithm on the transactions by specifying minimum values for support and confidence.

basket_rules <- apriori(txn,parameter = list(sup = 0.01, conf = 0.5,target="rules"));

Print the association rules. To print the association rules, we use a function called inspect(). However, if you have package ‘tm’ attached in the session, it creates a conflict with the arules package. Thus, we need to check and detach the package.

if(sessionInfo()['basePkgs']=="tm" | sessionInfo()['otherPkgs']=="tm"){
    detach(package:tm, unload=TRUE)
  }

inspect(basket_rules)

#Alternative to inspect() is to convert rules to a dataframe and then use View()
df_basket <- as(basket_rules,"data.frame")
View(df_basket)

Plot a few graphs that can help you visualize the rules. Install and load the ‘arulesViz’ library for association rules specific visualizations:

library(arulesViz)
plot(basket_rules)
plot(basket_rules, method = "grouped", control = list(k = 5))
plot(basket_rules, method="graph", control=list(type="items"))
plot(basket_rules, method="paracoord",  control=list(alpha=.5, reorder=TRUE))
plot(basket_rules,measure=c("support","lift"),shading="confidence",interactive=T)

Graph to display top 5 items

itemFrequencyPlot(txn, topN = 5)

Thats’s all Folks! I hope it was simple to understand and implement. I also have my code on githubif you dont want to type everything.

A special thanks to this blogpost, where I first learned the basics of implementing apriori in R. Also, this is my first attempt at writing a blog. Please feel free to reach out if you have any suggestions and comments.!

Thank you.

    Related Post

    1. Handling missing data with MICE package; a simple approach
    2. Best packages for data manipulation in R
    3. Identify, describe, plot, and remove the outliers from the dataset
    4. Learn R By Intensive Practice – Part 2
    5. Working with databases in R
    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值