Customer Segmentation Report for Arvato Financial Services

最新推荐文章于 2023-05-01 17:23:00 发布

weixin_40140086

最新推荐文章于 2023-05-01 17:23:00 发布

阅读量1.4k

点赞数

文章标签： Customers Data Scientist Machine learning

本文链接：https://blog.csdn.net/weixin_40140086/article/details/85015604

版权

@ Udacity Data Scientist Nano Degree

Customer Segmentation Report for Arvato Financial Services

Project Overview

In this project, I will analyze more than 190,000 demographics records for customers of a mail-order sales company in Germany, comparing it against more than 89,000 demographics information records for the general population.
The tasks involved are the following:
（1）Get to know the data
（2）Customer Segmentation Report
（3）Supervised learning model
By analysis these data, mail-order sales companies can better understand the subdivision characteristics of customer groups, and then take more accurate marketing projects.

Problem Statement

The goal is to create a unsupervised machine learning model better understand the subdivision characteristics of customer groups, and a supervise machine learn model to help the company to take more accurate marketing projects.
Through analysis, I hope to solve the following three problems:

What’s special about customers compared with the public?
How to Segment Customers?
How can we better identify potential customers?

Metrics

Accuracy is a common metric for binary classifiers; it takes into account both true positives and true negatives with equal weight.

Accuracy = (True positives + True negatives) / dataset size

Since the number of positive cases is far less than negative cases in our model, recall rate is also a very important measure.

Recall = True positive / (True positive + False Negative)

Analysis

Data Exploration

Now let’s dive deep into this project. Here we analyzed the distribution of five indicators in this two populations, and we find the following conclusions.
Azdias is demographics data for the general population of Germany; It contains 891 211 persons (rows) x 366 features (columns).
Customers is demographics data for customers of a mail-order company; It contains 191 652 persons (rows) x 369 features (columns).

(1) Screen the columns which have over 90% null cells.

These columns include ‘ALTER_KIND1’, ‘ALTER_KIND2’, ‘ALTER_KIND3’, and ‘ALTER_KIND4’.

（2） Compare the distribution of values for three columns where there are no or few missing values, between the two subsets.

a) these customers may be richer.

In the picture above, you can see there are five groups in each population, from 1 to 5, means very high, high, average, low, and very low in financial investment. More than half of the customers are “very high”. However, in the public, the corresponding population accounted for about 25%.
Household
In the picture above, you can see there are five groups in each population, from 1 to 5, means very high, high, average, low, and very low in owning house. Top three groups of the customers are “average”,“high” and “very high”. However, in the public, top three groups of the public are “average”,“very low” and “high”.

b）. These customers may have lower affinity.

In the picture above, you can see there are seven groups in each population, from 1 to 5, means highest, very high, high, average, low, very low and lowest in affinity. In our customers, group 6 is the top 1 group, and far higher than other groups. However, in general population, the gap between groups is smaller.

Algorithms and Techniques

1. PCA-Principal component analysis

sklearn.decomposition.PCA
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

2. Kmeans clustering

sklearn.cluster.KMeans
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of samples into disjoint clusters , each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”;
kmeans

3. SGDClassifier

sklearn.linear_model.SGDClassifier
SGDClassifier are linear classifiers (SVM, logistic regression, a.o.) with SGD training.
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning, see the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.
This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).
The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

Methodology

Data Preprocessing

The preprocessing steps done in the notebook consists of the following steps:

1. Remove the columns which have over 90% null cells.

Here, ‘ALTER_KIND1’, ‘ALTER_KIND2’, ‘ALTER_KIND3’, ‘ALTER_KIND4’ were removed.

2. Re-encode categorical features.

“OST_WEST_KZ” was re-encoded to numerical values.

3. Re-encode string features.

String values in “CAMEO_DEUG_2015” and “CAMEO_INTL_2015” were re-encoded to numerical values.

4. Re-encode date features.

Time stamps in “EINGEFUEGT_AM” were re-encoded to numerical values.

5. Impute NaNs in the data.

As there are many categorical features, we used most frequents values to fill the NaNs.

6. Standard Scaling the data.

For best results using SGDClassifier, the data should have zero mean and unit variance.

Implementation

The implementation process can be split into two main stages:

Perform dimensional reduction.
Training unsupervised model stage.
Training supervised model stage.
Use the trained model to predict.

Perform dimensional reduction

After 10% of the variability, there is a pretty strong flat-line of the amount of variance explained by each component. For this reason, I wouldn’t go beyond ~30 components. I think it is fair to say the first 2-3 components are likely to hold information that we can dive into about the relationships (and latent features) in the original data.
Top30 PCA components

Perform Kmeans Clustering

After clusters increased to 6, there is a pretty strong flat-line of kmeans model scores. For this reason, I choose to split the customers into 6 subgroups.
kmeans score

2. Supervised Classification

Now that we’ve found which parts of the population are more likely to be customers of the mail-order company, it’s time to build a prediction model. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.
The “MAILOUT” data has a column, “RESPONSE”, that states whether or not a person became a customer of the company following the campaign. There are 532 persons became customers of the company following the campaign, while there are 42,430 persons not. So we did up-sampling for the positive case as following steps:
1 Clean and standard scaling the data.
2 One third of the samples were randomly sampled as test data, the same number of negative samples were sampled as test data.
3 The remaining two-thirds of positive samples are increased to the same number as the remaining negative samples by up-sampling. These samples were used as train data.
4 Training sklearn SGDClassifier model to predict which individuals are most likely to respond to a mailout campaign.
5 Test that model in competition through Kaggle.

Result

Customer Characteristics

（1） Top 1 principal component of customers ——Financial.

在这里插入图片描述
Through analysis, we found that the first principal component is mainly about health and financial risk. It is mainly composed of the following features.
VERS_TYP :insurance typology
NATIONALITAET_KZ:nationaltity (scored by prename analysis)
HEALTH_TYP: health typology
ALTER_HH:main age within the household
SEMIO_VERT:affinity indicating in what way the person is dreamily.

（2） No.2 principal component of customers ——Houses.

House
The second principal component is mainly about houses and family. It is mainly composed of the following features.
PLZ8_ANTG3: number of 6-10 family houses in the PLZ8
ORTSGR_KLS9: classified number of inhabitants

（3） No.3 principal component of customers ——Cars.

Car
The third principal component is mainly about cars. It is mainly composed of the following features.
KBA13_KMH_211: share of cars with a greater max speed than 210 km/h within the PLZ8
KBA13_KMH_250 : share of cars with max speed between 210 and 250 km/h within the PLZ8
KBA05_MOTOR : most common engine size in the microcell
KBA13_HERST_BMW_BENZ : share of BMW & Mercedes Benz within the PLZ8

Customer Segments

Here, I used PCA to describe the salient characteristics of clusters of the company’s existing customers.
PCA_Kmeans
Furthermore, we found that people in cluster 2 are more likely be single-buyer, while people in cluster 3 are more likely be multi-buyer; People in cluster 1 are more lilkely to buy cosmetics, while people in cluster 2,4,and 5 are not.

Classification Model Evaluation and validation

During development, a test set was used to evaluate the model. The precision is about 0.6, recall is 0.65.
Report

Conclusion

PCA

When there are many features, we can reduce the time and space complexity of the problem by reducing the dimension.
By analyzing the principal components of the data, we can understand some meaningful relationships between features, and then understand the data more deeply.

KMeans

When we don’t know the number of clusters, we can determine the number of clusters through experiments by findout the inflection point of the decline of clustering score.
PCA can be used to visualize the clustering of high-dimensional data and help us to judge the clustering quality.

Supervised machine learning

When the positive and negative samples are very unbalanced, we can use the up-sampling technology to increase the number of positive examples and optimize the performance of the model.