This is an optional capstone project from the Google Data Analytics Course no. 8: Capstone Project.
The analysis follows the 6 steps of Data Analysis taught in the Google course: Ask, Prepare, Process, Analyse, Share and Act.
Step 1: Ask
In this step, we define the problem and objectives of our case study and its desired outcome.
1.0 Background:
This file contains behavior data for 5 months (Oct 2019 – Feb 2020) from a medium cosmetics online store. Here the data in Jan 2020 was selected. Data was collected by Open CDP (https://rees46.com/en/open-cdp) on large eCommerce store (1.2M visitors per month). All sensitive and personal data was removed. Broken data (like NULL price or products without categories) was removed also.
1.1 Business Task:
Analyse the cosmetics online store sales data in Jan 2020 to gain insights into consumer behaviours and discover trends for the cosmetics online store marketing strategy.
1.2 Business Objectives:
- What are the trends identified?
- How could these trends help improve sales revenue?
1.3 Deliverables:
- A clear summary of the business task
- A description of all data sources used
- A summary of analysis
- Supporting visualizations and key findings
- High-level content recommendations based on the analysis
STEP 2: PREPARE
In the Prepare phase, we identify the data being used and its limitations.
2.1 Information on Data Source:
- Data is publicly available on Kaggle: eCommerce Events History in Cosmetics Shop.
- Data collected by the Open CDP project.
- Each row in the file represents an event. All events are related to products and users. Each event is like a many-to-many relation between products and users.
2.2 Is Data ROCCC?
A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.
- Reliable — HIGH — It has 4,264,752 observations in total
- Original — LOW — Third party provider
- Comprehensive — MED — Some values in columns like brands and category_id are missing
- Current — MED — Data is 2 years old
- Cited — LOW — Data collected from third party, hence unknown
Overall, the dataset is considered good quality data and it is recommended to produce analytical reports based on this data.
2.3 Data Selection
The following file is selected and copied for analysis.
2020-Jan.csv
2.4 Tool
We are using MySQL for data analysis.
STEP 3: PROCESS
Here, we will process the data by cleaning and ensuring that it is correct, relevant, complete and free of error and outlier by performing:
- Explore and observe data
- Check for and treat missing or null values
- Transform data — format data type
- Perform preliminary statistical analysis
3.1 Data cleaning and manipulation
- Observe and familiarize with data
- Check for null or missing values
- Perform sanity check of data
3.1.1 How to read the data
There are different types of events. See below.
Semantics (or how to read it):
User userid during session usersession added to shopping cart (property eventtype is equal cart) product productid of brand brand of category categorycode (categorycode) with price price at event_time
File structure
Property | Description |
---|---|
event_time | Time when event happened at (in UTC). |
event_type | Only one kind of event: purchase. |
product_id | ID of a product |
category_id | Product's category ID |
category_code | Product's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories. |
brand | Downcased string of brand name. Can be missed. |
price | Float price of a product. Present. |
user_id | Permanent user ID. |
** user_session** | Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause. |
Event types
Events can be:
view
- a user viewed a productcart
- a user added a product to shopping cartremove_from_cart
- a user removed a product from shopping cartpurchase
- a user purchased a product
Multiple purchases per session
A session can have multiple purchase events. It's ok, because it's a single order.
3.1.2. Find out the outliers
SELECT *
FROM Cosmetics_Shop
WHERE price < 0
39 rows are returned. It is impossible to have products whose price is less than 0 so those 39 rows will be removed. There are no more outliers then. Thus, we can start to analyze this this dataset though there are some missing value in category_code and brand columns.
STEP 4: ANALYSE
4.1 Key Performance Indicators (KPIs) for Ecommerce
4.1.1 Users behaviour KPI
-- numbers of each event type
SELECT SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS view_num ,
SUM(CASE WHEN event_type = 'cart' THEN 1 ELSE 0 END) AS cart_num ,
SUM(CASE WHEN event_type = 'remove_from_cart' THEN 1 ELSE 0 END) AS remove_from_cart_num ,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_num ,
COUNT(DISTINCT user_id) AS uv ,
ROUND(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) / COUNT(DISTINCT user_id),0) AS 'uv/pv'
FROM Cosmetics_Shop
view_num | cart_num | remove_from_cart_num | purchase_num | uv | uv/pv |
2037608 | 1148323 | 815024 | 263797 | 410073 | 5 |
Page View(PV): 2,037,608
Unique Visitors(UV): 410,073
Page View Per Visitor(UV/PV): 5
4.1.2 Users repeat purchasing rate
-- the purchasing frequency of each customer
CREATE VIEW purchase_times AS
SELECT user_id ,
event_time ,
DENSE_RANK()OVER(PARTITION BY user_id ORDER BY DAY(event_time)) AS purchase_times
FROM Cosmetics_Shop
WHERE event_type = 'purchase'
ORDER BY user_id , event_time ;
-- the rate of repeat purchasing
SELECT COUNT(user_id) AS purchase ,
(SELECT COUNT(user_id)
FROM purchase_times
WHERE purchase_times > 1) AS repeat_purchase ,
(SELECT COUNT(user_id)
FROM purchase_times
WHERE purchase_times > 1) / COUNT(user_id) AS repeat_purchase_rate
FROM purchase_times
purchase | repeat_purchase | repeat_purchase_rate |
263797 | 34057 | 0.1291 |
The repear purchasing rate is 12.91% , which is not high enough. Thus, we need to take actions to keep customers.
4.2 Analyzing conversion funnel
Funnel: view--cart--purchase
-- numbers of each event type
SELECT event_type , COUNT(event_type) AS event_num
FROM Cosmetics_Shop
GROUP BY event_type ;
--- customers only view
SELECT COUNT(user_id)
FROM Cosmetics_Shop
WHERE user_id NOT IN
(SELECT user_id
FROM Cosmetics_Shop
WHERE event_type = 'cart' OR event_type = 'remove_from_cart' OR event_type = 'purchase')
The number of the behaviour only viewing without adding to cart or purchasing is 707,542 , which means the bouncing rate of view-to-cart part is 707,542 / 2,037,608 = 34.7%.
--- customers only add products into cart
SELECT COUNT(user_id)
FROM Cosmetics_Shop
WHERE user_id NOT IN
(SELECT user_id
FROM Cosmetics_Shop
WHERE event_type = 'remove_from_cart' OR event_type = 'purchase') AND
event_type = 'cart'
The number of the behaviour only adding products into cart without purchasing is 37,960 , which means the bouncing rate of cart-to-purchase of adding products into cart without purchasing part is 156,163 / 1,148,323 = 13.6%.
--- customers remove products from cart after adding in
SELECT COUNT(user_id)
FROM Cosmetics_Shop
WHERE user_id NOT IN
(SELECT user_id
FROM Cosmetics_Shop
WHERE event_type = 'purchase') AND
event_type = 'remove_from_cart' AND
user_id IN
(SELECT user_id
FROM Cosmetics_Shop
WHERE event_type = 'cart')
The number of the behaviour removing products from cart after adding in is 357,594 , which means the bouncing rate of cart-to-purchase of removing products from cart after adding in part is 357,594 / 1,148,323 = 31.1%.
In conclusion , we can see many customers bounced at the stage of view-to-cart and they tended to remove products from cart after adding in. Though the purchase rate was decent(13%) , we need to try to lower the boucing rates of each part as much as we can.
4.3 Analyzing the reason behind high bouncing rate
4.3.1 The reason why customers only view without adding to cart
I pick Top 100 sales products and Top 100 views products to see how many of them are matched.
CREATE VIEW Top_100_sales AS
SELECT product_id , COUNT(product_id)
FROM Cosmetics_Shop
WHERE event_type = 'purchase'
GROUP BY product_id
ORDER BY COUNT(product_id) DESC
LIMIT 100 ;
CREATE VIEW Top_100_view AS
SELECT product_id , COUNT(product_id)
FROM Cosmetics_Shop
WHERE event_type = 'view'
GROUP BY product_id
ORDER BY COUNT(product_id) DESC
LIMIT 100 ;
SELECT *
FROM Top_100_view
JOIN Top_100_sales
USING (product_id)
Only 37 of 100 are matched. It can be concluded those products viewed most are not those that are purchased under most circumstances. Thus , it is important to improve our algorithm and recommendation system so that customers can see products they are interested in more often.
4.4 Analyzing average hourly and daily view activity
SELECT hr ,
AVG(view) AS avg_view,
AVG(CASE WHEN wkday = 0 THEN view END) AS mon ,
AVG(CASE WHEN wkday = 1 THEN view END) AS tue ,
AVG(CASE WHEN wkday = 2 THEN view END) AS wed ,
AVG(CASE WHEN wkday = 3 THEN view END) AS thu ,
AVG(CASE WHEN wkday = 4 THEN view END) AS fri ,
AVG(CASE WHEN wkday = 5 THEN view END) AS sat ,
AVG(CASE WHEN wkday = 6 THEN view END) AS sun
FROM (
SELECT HOUR(event_time) AS hr,
WEEKDAY(event_time) AS wkday,
COUNT(event_type) AS view
FROM Cosmetics_Shop
WHERE event_type = 'view'
GROUP BY 1 , 2 ) AS hourly_view
GROUP BY hr
hr | avg_view | mon | tue | wed | thu | fri | sat | sun |
0 | 2991.7143 | 2791 | 3058 | 3153 | 3449 | 3197 | 2683 | 2611 |
1 | 2421 | 2023 | 2255 | 2810 | 3121 | 2649 | 2000 | 2089 |
2 | 2546.7143 | 2336 | 2494 | 2508 | 3186 | 3099 | 2039 | 2165 |
3 | 3543.5714 | 3404 | 3500 | 3624 | 4697 | 4506 | 2538 | 2536 |
4 | 5000.5714 | 4972 | 5070 | 5340 | 6115 | 6536 | 3542 | 3429 |
5 | 7397 | 7197 | 8217 | 8142 | 8759 | 9109 | 5113 | 5242 |
6 | 10128.1429 | 9844 | 11238 | 11112 | 12172 | 12539 | 6950 | 7042 |
7 | 12603 | 13159 | 13378 | 13617 | 14237 | 14839 | 9388 | 9603 |
8 | 14031.1429 | 13325 | 14881 | 15690 | 16322 | 16451 | 10544 | 11005 |
9 | 15467.2857 | 14950 | 17299 | 16810 | 18075 | 16753 | 11818 | 12566 |
10 | 16124.4286 | 15129 | 16534 | 18036 | 18580 | 18109 | 12672 | 13811 |
11 | 16764.7143 | 15851 | 17562 | 18012 | 19819 | 18822 | 12762 | 14525 |
12 | 17288.7143 | 18667 | 17660 | 19612 | 19431 | 19070 | 11941 | 14640 |
13 | 16970.2857 | 19481 | 16311 | 17521 | 19687 | 18836 | 12145 | 14811 |
14 | 15801.8571 | 16412 | 15554 | 16585 | 17856 | 17054 | 12719 | 14433 |
15 | 15346.4286 | 14902 | 14973 | 16372 | 16336 | 16524 | 13149 | 15169 |
16 | 15940 | 15985 | 14962 | 17792 | 16705 | 16747 | 13797 | 15592 |
17 | 17522.7143 | 17531 | 16967 | 18522 | 18736 | 18479 | 15705 | 16719 |
18 | 19101.8571 | 18877 | 18321 | 20372 | 21281 | 19525 | 16552 | 18785 |
19 | 20014.1429 | 20689 | 18582 | 21281 | 22189 | 20193 | 17395 | 19770 |
20 | 18586.8571 | 18476 | 17363 | 20530 | 20258 | 19185 | 16094 | 18202 |
21 | 12676.7143 | 11757 | 11522 | 13760 | 13699 | 13265 | 12287 | 12447 |
22 | 7997.2857 | 7388 | 7199 | 9428 | 7676 | 8220 | 8057 | 8013 |
23 | 4820.7143 | 4612 | 4395 | 5338 | 5029 | 4592 | 4867 | 4912 |
As we can see, more customers view our products on weekdays than they did on weekends. The view behaviour mainly happened during 11AM to 1PM and 6PM to 8PM and the reason is that it’s the time period people take a rest during the working hours and finishing their work. To improve the conversion rate , it’s better to push notifications and send them emails or messages during this time period.
4.5 Analyzing Top 10 spending customers
SELECT user_id , ROUND(SUM(price),0) AS total_spend
FROM Cosmetics_Shop
WHERE event_type = 'purchase'
GROUP BY user_id
ORDER BY ROUND(SUM(price),0) DESC
LIMIT 10
-- Then we analyse what did the 3rd customer No.469299888 purchasing record
SELECT product_id , COUNT(product_id)
FROM Cosmetics_Shop
WHERE event_type = 'purchase' AND user_id = 469299888
GROUP BY product_id
ORDER BY COUNT(product_id) DESC
As we can see , the customer No.469299888 bought 55 different kinds of products. We need to spend more time on finding out what products these users are really interested in so that they will ‘fall in love’ with our store.
4.6 Analyzing customers values with RFM model
4.6.1 Calculate recency and frequency of each customer
-- calculate recency and frequency of each customer
CREATE VIEW recency_and_frequency AS
SELECT DISTINCT user_id,
DATEDIFF('2020-01-31',MAX(event_time)) AS recency,
COUNT(*) AS frequency,
DATE(MAX(event_time)) AS latest_purchase_time
FROM Cosmetics_Shop
WHERE event_type = 'purchase'
GROUP BY user_id
4.6.2 Define the value of each customer
Score | Recency | Frequency |
1 | 0-7 | >6 |
2 | 8-15 | 4-6 |
3 | 16-23 | 2-3 |
4 | 24-30 | 0-1 |
-- define the value of each customer
CREATE VIEW rf_score AS
SELECT user_id,
(CASE WHEN recency BETWEEN 24 AND 30 THEN 4
WHEN recency BETWEEN 16 AND 23 THEN 3
WHEN recency BETWEEN 8 AND 15 THEN 2
WHEN recency BETWEEN 0 AND 7 THEN 1 END) AS r_score ,
(CASE WHEN frequency BETWEEN 0 AND 1 THEN 4
WHEN frequency BETWEEN 2 AND 3 THEN 3
WHEN frequency BETWEEN 4 AND 6 THEN 2
WHEN frequency > 6 THEN 1 END) AS f_score
FROM recency_and_frequency ;
-- calculate the average of recency and frequency scores
SELECT AVG(r_score) , AVG(f_score)
FROM rf_score ;
Average R_score : 2.2188
Average F_score : 1.9153
-- separate customers into different categories
CREATE VIEW ranking AS
SELECT user_id ,
(CASE WHEN r_score >= 2.2188 AND f_score >= 1.9153 THEN 'A'
WHEN r_score >= 2.2188 AND f_score < 1.9153 THEN 'B'
WHEN r_score < 2.2188 AND f_score >= 1.9153 THEN 'C'
WHEN r_score < 2.2188 AND f_score < 1.9153 THEN 'D' END) AS ranking
FROM rf_score ;
-- number of each ranking
SELECT ranking , COUNT(ranking)
FROM ranking
GROUP BY ranking
ORDER BY ranking
ranking | COUNT(ranking) |
A | 6402 |
B | 4997 |
C | 8482 |
D | 8339 |
The proportion of category A : 22.69%
The proportion of category B : 17.71%
The proportion of category C : 30.06%
The proportion of category D : 29.55%
As we can see , category C and D accounts for the majority of customers. Thus, we should find out products they are interested in and put them on their homepages more often so that they are more likely to buy our products. We can also set up questionnaires to understand the customers' needs and where we can improve on.
4.6 Product analysis
SELECT product_id ,
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS view_num ,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_num ,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) / SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS CVR ,
ROUND(SUM(CASE WHEN event_type = 'purchase' THEN price END),2) AS Total_Revenue ,
ROUND((SUM(CASE WHEN event_type = 'purchase' THEN price END) / SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END)),2) AS average_order_value ,
ROUND((SUM(CASE WHEN event_type = 'purchase' THEN price END) / SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END)),2) AS revenue_per_view
FROM Cosmetics_Shop
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 20
product_id | View_Num | Purchase_Num | CVR | Total_Revenue | Average_Order_Value | revenue_per_view |
5560754 | 3092 | 85 | 0.0275 | 16527.4 | 194.44 | 5.35 |
5809910 | 24044 | 2029 | 0.0844 | 10631.96 | 5.24 | 0.44 |
5751422 | 4703 | 798 | 0.1697 | 8738.1 | 10.95 | 1.86 |
5849033 | 4320 | 676 | 0.1565 | 6976.32 | 10.32 | 1.61 |
5560756 | 1545 | 33 | 0.0214 | 6862.02 | 207.94 | 4.44 |
5751383 | 5234 | 648 | 0.1238 | 6687.36 | 10.32 | 1.28 |
5792800 | 2937 | 586 | 0.1995 | 6047.52 | 10.32 | 2.06 |
5809912 | 7678 | 953 | 0.1241 | 4993.72 | 5.24 | 0.65 |
5917178 | 10955 | 293 | 0.0267 | 4790.55 | 16.35 | 0.44 |
5751742 | 1502 | 318 | 0.2117 | 4519.6 | 14.21 | 3.01 |
5846437 | 2774 | 99 | 0.0357 | 4478.76 | 45.24 | 1.61 |
5810480 | 485 | 194 | 0.4 | 4372.76 | 22.54 | 9.02 |
5855507 | 1435 | 53 | 0.0369 | 4198.13 | 79.21 | 2.93 |
5823970 | 224 | 156 | 0.6964 | 4185.48 | 26.83 | 18.69 |
5861760 | 1611 | 53 | 0.0329 | 3785.79 | 71.43 | 2.35 |
5820270 | 180 | 141 | 0.7833 | 3783.03 | 26.83 | 21.02 |
5528035 | 2782 | 394 | 0.1416 | 3750.88 | 9.52 | 1.35 |
5822650 | 468 | 184 | 0.3932 | 3505.2 | 19.05 | 7.49 |
5877453 | 2198 | 64 | 0.0291 | 3474.56 | 54.29 | 1.58 |
5692527 | 1467 | 75 | 0.0511 | 3451.5 | 46.02 | 2.35 |
There are different categories in the top 20 revenue products.
First, for products like 5560754 and 5560756, their prices are quite high, the conversion rates and revenue per views are low. Thus, we need to improve our marketing in promoting campaigns like Black Friday so that we can target people who are willing to buy these products. As a result, we can witness a rapid growth in sales revenue.
Secondly, for products like 5810480 , 5823970 , 5820270 , their view numbers are low, however , their purchase numbers and conversion rates are very impressive. We can invest more or increase our marketing budgets in these products so that we can optimize our marketing investments.
STEP 5: Share
STEP 6: ACT
In the final step, we will deliver our insights and providing recommendations based on our analysis.
Here, we revisit our business questions and share with you our high-level business recommendations.
1. What are the trends identified?
- Majority of customers were bouncing after viewing the products (34.7%) and removing products from the cart (31.1%).
- Customers prefer to view products during weekdays as compared to weekends - perhaps because they are willing to spend their time fragmented when they are taking a break or on their way home.
2. How could these trends help improve sales revenue?
- Marketing team can build a more effective recommendation system by improving the accuracy so that people can find products fitting their requirements more , which is beneficial for us to keep our customers loyal to our store.
- On weekends, we can also prompt notification to encourage users to view and recommend our products or store to their friends when they are spending time together.