How do you prevent the customer churn?

How do you prevent the customer churn?

A customer-facing business often face the challenge of losing customers and need to take actions to keep customers. How do you prevent the customer churn? Predicting churn rates is a common way that data scientists regularly encounter in lots of customer-facing business. The company can promote these customers to keep them. But how to predict customer churn rate?
 
 

First, business understanding


Company’s business understanding is the first step.  Sparkify company is a music provider and collect the dataset of its user behavior log. Users can sign up for Sparkify for free, and members can share their uploaded music for free. If users upgrade to a premium account with a fixed monthly fee, users can enjoy songs on the site without uploading songs. If users don't like the site, they can downgrade or cancel their accounts.
According to above business process, if users do cancel action, their accounts are removed. The event may happen for both paid and free users. So, churn rate is churn users to retained user. But what events can be used to predict churn rate? To answer this question, we should do next step: data understanding.


Second, data understanding


The medium sized dataset is explored and it includes Sparkify user behavior logs. We can find the following statistics information of the dataset:


 

Pandas is used to examine correlations between the numeric columns by generating scatter plots of them. For the Pandas workload, we don't want to pull the entire data set into the Spark driver, as that might exhaust the available RAM and throw an out-of-memory exception. Instead, We shall randomly sample 10% data to get a rough idea of how it looks.
 
According to above figures, we can get the following information:

  • "userid" is same as "first name" and "last name". so "first name" and "last name" can be removed.
  • "method" is related with tools and is not related with business, it can be removed.
  • "status" has not correlation with others. the figure is a line. so both can be removed.
  •  "session id" has high correlation with "ts". the (sessionid,registration) figure is same as (ts,registration) so it can be removed.
  • other columns will be kept for further analysis.

We further perform more exploratory data analysis to observe the behavior for users who retained vs users who churned. we start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

 
The above figures show that the following page events have the largest difference (churn vs retain) in their boxplots. Therefore, these will be used as the bases for our feature selection.

  •     About: Visits to the about page
  •     Add Friend: Adding friends
  •     Add to Playlist: Adding to playlist
  •     Help: Visiting the help page
  •     Home: Visiting the Home Page
  •     NextSong: Playing a song
  •     Setting: Viewing the Settings page
  •     Thumbs up: Giving Thums Up
  •     Downgrade: Upgrading
  •     Error:  Error occur

We can query page events related to services level:

According to above Analysis, we can know the key event about churn in page column is “Cancellation Confirmation”. If users do “Cancellation Confirmation”, their accounts are removed. So, the “Cancellation Confirmation” events can be defined as churn, which happen for both paid and free users.  We also know what page events have a high weight in churn. But what features can be used to predict churn rate? To answer this question, we should do next step: Data Preparation.
 

Third, Data Preparation

Data Preprocessing


•    Convert ts column from unixtime to timestamp
•    Combine all the numeric features altogether by user_id and then scale all of them with MinMaxScaler.
•    drop duplication in users views based on user_id,page,level and ts columns
•    Remove all the log activities with missing user_id via join actions in Feature Engineering step.


Feature Engineering


After the data exploratory, we create some features, aggregating all the log activities to userID level using pyspark. We check their correlation with churn with heatmap. The heapmap figures is following:


 
According to above heatmap graphs and the correlation between features and churn, the following features are removed:

  • The st_use_trend, mt_use_trend and lt_use_trend are highly correlated with st_songs_trend, mt_songs_trend, lt_songs_trend. So st_use_trend, mt_use_trend, lt_use_trend can be skipped.
  • enjoy_songs_last_month is highly correlated with lm_diffsong_avg, so remove enjoy_songs_last_month.

Features selection 

 

After 4 rounds features selection. Finally, the following features are selected based on the feature importance and above correlation.:

According to above heatmap graphs and the correlation between features and churn, the following features are selected:

  • negative_actions: negative behavior (downgrade/Cancel times ) in last month 
  • avg_songs_per_home: the average count for NextSong for every user per home visit
  • l2week_active_ratio: activity days (login days) in last 2 weeks/registration duration until last 2 weeks
  • week_active_ratio: activity days(login days) in last 2 weeks /registration duration until last weeks
  • mt_trend_of_add_playlist: the average number of add_playlist event in the last 2 weeks/ the average number of add_playlist event in the last month
  • mt_songs_trend: the average number of enjoying songs in the last 2 weeks/ the average number of enjoying songs in the last month
  • lt_songs_trend: the average number of enjoying songs in the last month/ the average number of enjoying songs in the last 2 months
  • lt_active_trend: month_active_ratio/l2month_active_ratio and active_ratio = registration duration /activity days(login days).
  • regist_days: registration duration until last date.

Their heatmap is following:
 
 

Forth, Modeling

Stratified Sampling features


We statistics the Sparkify dataset and find the churn amount is 92 and the retain amount is 355. So, the churn user distribution is imbalance.  
 
Stratified sampling on the raw dataset must be done to form sampled dataset. Then churn records should be split into training and test dataset based on split ratio when they are created. 


Metrics selection:


F1 score is used to evaluate and select the model.

  • Churn in the dataset is imbalance. Churn rate usually not very high across the whole population. 
  • Accuracy is not very helpful in the case of user churn. Accuracy may very high. But if prediction error on churned user may heavily impact on churn rate because Only few users are churn.
  • Precision (those who were predicted churn would truly churn in our case) and recall in our scenario should be focused. And F1 score balances Precision and recall. 

Therefore, f1 score is a better metrics reflecting what we care about.


Model

We should like to determine which parameter values of the classifiers produce the best model. We use k-fold cross validation, where the data is randomly split into k partitions. 

  • Each partition is used once as the testing data set, while the rest are used for training. 
  • Models are generated using the training sets and evaluated with the testing sets, resulting in k model performance measurements. 
  • The average of the performance scores is often taken to be the overall score of the model, given its build parameters. 
  • RandomForestDecisionClassificator and LinearRegression model are used. For model selection, their model parameters can be searched through. We compare their cross-validation performances. The model parameters leading to the highest performance metric produce the best model.

 
A Pipeline is built, and it includes the following:

  • VectorAssembler: Transforms features into a single feature vector. The next step need this transformation.
  • MinMaxScaler: Scales individual features to a common range between 0 to 1.
  • RandomForestClassifier/ LogisticRegression/GBTClassifier: Use Feature vector and a label vector

And the model was trained on the large dataset with the default settings for each classifier.


Fifth, Model Evaluation

Finally, three classifiers get the following scores:

  • GBTClassifier
    • Training: accuracy: 95.39%, Precision: 97.02%, Recall: 95.39%, F1 Score: 95.22%
    • Test:     accuracy: 84.93%, Precision: 87.40%, Recall: 84.93%, F1 Score: 84.28%
  • LogisticRegression:
    • Training: accuracy: 80.22%, Precision: 79.28%, Recall: 81.03%, F1 Score: 74.16%
    • Test:     accuracy: 79.45%, Precision: 62.67%, Recall: 76.71%, F1 Score: 74.39%
  • RandomForestClassifier
    •  Training: accuracy: 90.24%, Precision: 89.84%, Recall: 90.79%, F1 Score: 90.40%
    • Test:     accuracy: 90.41%, Precision: 90.38%, Recall: 90.41%, F1 Score: 89.65%

The highest f1 average score model is 89.65%. The best performing model for the first iteration was RandomForestClassifier. Its variance and bias are low while training scores compare with test scores.  So, it means RandomForestClassifier generalization on the test dataset performs better than other classifiers. While this is a good first iteration step, there is still a lot of work to do before the model is released into production.
Here, RandomForestClassifier is better than other models because RandomForestClassifier benefits from the Random selection and bagging approach. Random Forest training can be highly parallelized, which is advantageous for the training speed of large samples in big data; since the decision tree nodes can be randomly selected to divide features, it can still efficiently train the model when the sample feature dimension is very high; the trained model has small variance and strong generalization ability due to the use of random sampling. This approach works well in this case, especially if we want to correctly predict those user churn.

Therefore, we decided to choose Random Forest tree as the best model with tunned parameter (maxDepth=4, numTrees=100). And we fit our best model on the validation set. And we look at the feature importance as well.


Finally, Improvement

We can improve the model’s performance in two ways. On the one hand, it is a further iteration of feature engineering; on the other hand, it is a continuous hyperparameters tuning.
First, feature engineering should iterate further. Most page events are currently not sufficiently analyzed. We can see from the iteration of feature selection in 4 rounds that the different combination of features improves the performance of the model, so future improvement is possible by studying more features.
Secondly, more optimization work needs to be done on the parameters of the model and the tuning hyperparameters to further improve the model. At present, the basic Settings of the random forest tree algorithm are used, but better performance can be achieved by tuning more parameters such as maxDepth, maxBins, numTrees, minInfoGain, and so on.
Finally,the churn prediction model is a methodology that does not directly lead to a reduction in customer churn. The purpose of the customer churn prediction model should be to improve the effectiveness of retaining care and keep customers active to the maximum, rather than to reduce the customer churn rate substantially.
The best time for customers to keep is before they lose them. In the face of increasingly fierce market competition, most companies pay more and more attention to customer care work, through continuous investment to do a good job in customer care and retention work, the greatest possible to retain customers.

Reference:
    Carol McDonald: Churn Prediction With Apache Spark Machine Learning
    Summer Memories: Python machine learning in pratice - user churn prediction
 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值