RFM Analysis of E-commerce (Kaggle Program)

RFM analysis of user behaviour(Kaggle program)
Lu Zhang

Program Description

Background

The project data comes from Kaggle, which is the transactional data of UK retailers, including all transactions that occurred in non-physical online retail registered in the UK from December 1, 2010 to December 9, 2011. The company mainly sells unique all-occassion gifts. Many of the company’s customers are wholesalers.

link of kaggle data:
link

Aim of analysis

Analyze user behavior and perform user stratification using the RFM model.

Descriptions of Values

The dataset is composed of:
InvoiceNo --> Invoice number: 6-digit string
StockCode --> Stock code: 6-digit string
Description --> Description of the product
Quantity --> Number of trading products
InvoiceDate --> Order date: the date and time the order occurred UnitPrice --> Unit price: float
CustomerID --> Customer ID: 5-digit string
Country --> Customer location

Get the dataSQL

Prepare the Data (MySQL)

Data Cleaning

1.Processing of missing value
Check if there are any missing values in each column

SELECT
SUM(CASE WHEN InvoiceNo IS NULL THEN 1 ELSE 0 END) AS Sum_invoice,
SUM(CASE WHEN Stockcode IS NULL THEN 1 ELSE 0 END) AS Stockcode,
SUM(CASE WHEN Description IS NULL THEN 1 ELSE 0 END) AS Description,
SUM(CASE WHEN Quantity IS NULL THEN 1 ELSE 0 END) AS Quantity,
SUM(CASE WHEN InvoiceDate IS NULL THEN 1 ELSE 0 END) AS InvoiceDate,
SUM(CASE WHEN UnitPrice IS NULL THEN 1 ELSE 0 END) AS UnitPrice,
SUM(CASE WHEN CustomerID IS NULL THEN 1 ELSE 0 END) AS CustomerID,
SUM(CASE WHEN Country IS NULL THEN 1 ELSE 0 END) AS Country

在这里插入图片描述
We found that there are missing values in the column of ‘Description’ and ‘CustomerID’, and the product description lacks 1453 rows. Considering that we are studying customer behavior, so the missing values of product description do not greatly affect our analysis, we would not deal with it temporarily; and the customer number is missing 132796 rows, since we cannot know the reason for the missing, we intend to replace the missing value with 0, and store the replaced result in the table data1.

CREATE TABLE data1 LIKE data;
INSERT INTO data1 SELECT InvoiceNo, Stockcode,Description,Quantity, InvoiceDate, UnitPrice, COALESCE(CustomerID,0), Country FROM data;

在这里插入图片描述
There are 539109 rows in data1

2.Outlier handling
Whether the unit price and quantity are abnormal: Check whether the unit price and quantity have negative or zero values.

SELECT MIN(Quantity) as Min_Quantity ,
MAX(Quantity)as Max_Quantity,
MIN(UnitPrice)as Min_UnitPrice,
MAX(UnitPrice)as Max_UnitPrice FROM data1;

在这里插入图片描述
Details of abnormal data
The query result shows that the minimum quantity and minimum unit price have negative values Query the number of rows with outliers in the column of Quantity and UnitPrice.

SELECT COUNT(Quantity)
FROM data1
WHERE quantity<=0 OR UnitPrice<=0

在这里插入图片描述
There are 11779 rows that the outlier exists.

SELECT * FROM data1
WHERE quantity<=0;
SELECT * FROM data1
WHERE UnitPrice<=0;

在这里插入图片描述
在这里插入图片描述
Delete data
Since we don’t know the reason for the appearance of outliers, and these kinds of data are useless for the analysis, we decided to to delete the data.

DELETE FROM data1
WHERE quantity<=0;
DELETE FROM data1
WHERE UnitPrice<=0;

在这里插入图片描述
在这里插入图片描述
After the processing of the abnormal values, there are 527330 valid data remained

3、Process in consistant format
Currently, there are only unit prices and Quantities of the goods in the data set, in order to facilitate subsequent analysis, we add a column of total consumption amount, we name it as Amount.
A m o u n t s = Q u a n t i t y ∗ U n i t P r i c e Amounts=Quantity*UnitPrice Amounts=QuantityUnitPrice

ALTER TABLE data1 ADD Amounts float not null;
UPDATE data1 SET Amounts=Quantity*UnitPrice

在这里插入图片描述
Also, we standardize the format of InvoiceDate

ALTER TABLE data1 ADD InvoiceTime VARCHAR(255) NOT NULL;
UPDATE data1 SET InvoiceTime=STR_TO_DATE(InvoiceDate,’%m/%d/%Y %H:%i’)

在这里插入图片描述
4. Remove the duplicated values

CREATE TABLE data2 LIKE data1;
INSERT INTO data2 SELECT DISTINCT * FROM data1;
DROP TABLE data1;
ALTER TABLE data2 RENAME TO data1

We have 522107 rows of data right now.

Data Analysis

RFM Analysis

RFM analysis is an essential customer segmentation technique that used to measure customer value and customer profitability, it describe the customer’s value status through three indicators regarding to users’
past purchase behaviors: Recency, Frequency, Monetary.

R e c e n c y ( R ) Recency(R) Recency(R)
Days since last consumption time, representing a customer’s recent purchase behavior

F r e q u e n c y ( F ) Frequency(F) Frequency(F)
the frequency of each customers’s comsumption (number of purchases)

M o n e t a r y ( M ) Monetary(M) Monetary(M)
Total money the customer spent.

RFM helps divide customers into various categories to identify important customers who are more likely to respond to promotions and consume goods.

SELECT MIN(InvoiceTime),MAX(InvoiceTime),COUNT(),COUNT(distinct CustomerID) FROM data1
WHERE CustomerID!=0
*
在这里插入图片描述
There were a total of 392,194 valid orders and 4,335 customers (not including customers with CustomerID 0) between Dec 1,2010 and Dec 9,2011.

Classify the data
create those 3 customer attributes for each customer:Last consumption time (the interval between the last consumption and the reference time-Recency), the frequency of consumption (Frequency), and the amount of consumption (Monetary);
max invoicetime is the last purchase time of the customer, we take the last date we have in the table as reference. Thus,
R e c e n c y = 09 / 12 / 2011 − L a s t P u r c h a s e T I m e Recency=09/12/2011-LastPurchaseTIme Recency=09/12/2011LastPurchaseTIme

CREATE TABLE Customers
SELECT CustomerID,
DATEDIFF(‘2011-12-09’,MAX(InvoiceTime))as Recency,
COUNT(DISTINCT InvoiceNo)as Frequency,
ROUND(SUM(Amounts),2) as Monetary
FROM data1 GROUP BY CustomerID
ORDER BY Recency DESC,Monetary DESC,Frequency DESC;

在这里插入图片描述

Evaluation

*create table Customers2
select ,
(case when Recency<=30 then 5
when Recency >30 and Recency <=90 then 4
when Recency>90 and Recency<=180 then 3
when Recency>180 and Recency<=365 then 2
else 1 END)as ‘R_score’,
(case when Frequency<=10 then 1
when Frequency >10 and Frequency <=30 then 2
when Frequency>30 and Frequency<=50 then 3
when Frequency>50 and Frequency<=80 then 4
else 5 END)as ‘F_score’,
(case when Monetary<=1000 then 1
when Monetary>1000 and Monetary <=3000 then 2
when Monetary>3000 and Monetary<=5000 then 3
when Monetary>5000 and Monetary<=8000 then 4
else 5 END)as ‘M_score’
from Customers;
select * from Customers2

在这里插入图片描述
Client segmentaion threshold

After the RFM scoring evaluation, compute the average value of each column, then use the average value as the threshold for customer division to do the final customer stratification process.

SELECT ROUND(AVG(R_score),1)as R_avg,
ROUND(AVG(F_score),1)as F_avg,
ROUND(AVG(M_score),1)as M_avg from Customers2;

在这里插入图片描述
Compare the customer’s score with the average and store the result in the table Customers3.

Create table Customers3
SELECT * ,
(case when R_score>3.8 then 1 else 0 end) as ‘R_value’,
(case when F_score>1.1 then 1 else 0 end) as ‘F_value’,
(case when M_score>1.6 then 1 else 0 end) as ‘M_value’
from Customers2;
SELECT *
FROM Customers3
WHERE UnitPrice<=0;

在这里插入图片描述

Client Description

Describe each segmetaion in detail.
在这里插入图片描述
Customer stratification
According to the RFM value, scocre and threshold, the Client is processed hierarchically. (“↑” means greater than the mean value, “↓” means less than the mean value)

在这里插入图片描述
CREATE TABLE Customers4 SELECT * ,
(case when R_value=1 and F_value=1 and M_value=1 then ‘Champions’ when R_value=0 and F_value=1 and M_value=1 then ‘Can’t lose’
when R_value=1 and F_value=0 and M_value=1 then ‘Potential Customers’
when R_value=0 and F_value=0 and M_value=1 then ‘Promising Customers’
when R_value=1 and F_value=1 and M_value=0 then ‘Core Customers’
when R_value=0 and F_value=1 and M_value=0 then ‘At Risk’ when R_value=1 and F_value=0 and M_value=0 then ‘Recent Customers’ else ‘Lost Customers’ end) as Clients from Customers3;
select from Customers4

在这里插入图片描述
SELECT Clients, COUNT( * )
FROM Customers4
GROUP BY Clients

在这里插入图片描述
There are 6 categories of Customers based on the given data.

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
Plot the structure of customer stratification.

在这里插入图片描述
在这里插入图片描述
Interpretation
Among the visualizational analysis, we have a deeper understanding of our segmentations from the graphs above, and we can make plan to target or deal with each segment.

在这里插入图片描述
Reference
[1]: https://docs.exponea.com/docs/rfm-segmentation link
[2]:https://zhuanlan.zhihu.com/p/82914157 link

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值