

1.项目概述(1. Project Overview)

1.1简介(1.1 Introduction)

This is a capstone project for the Udacity data science nanodegree program.


In this project, I analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. I use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, I use a supervised model to predict which individuals are most likely to convert into becoming customers for the company.

在这个项目中,我分析了德国一家邮购销售公司的客户的人口统计数据,并将其与一般人群的人口统计信息进行了比较。 我使用无监督学习技术执行客户细分,确定最能描述公司核心客户群的人群。 然后,我使用监督模型来预测哪些人最有可能转化为公司的客户。

1.2数据集 (1.2 Data sets)

The data is provided by Bertelsmann Arvato Analytics and represents a real-life data science task. There are four data files associated with this project:

数据由贝塔斯曼Arvato Analytics提供,代表了现实生活中的数据科学任务。 有四个与该项目关联的数据文件:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

    Udacity_AZDIAS_052018.csv :德国总人口的人口统计数据; 891211人(行)x 366个特征(列)。

  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

    Udacity_CUSTOMERS_052018.csv :邮购公司客户的人口统计数据; 191652人(行)x 369个特征(列)。

  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).

    Udacity_MAILOUT_052018_TRAIN.csv :作为营销活动目标的个人的人口统计数据; 42982人(行)x 367(列)。

  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

    Udacity_MAILOUT_052018_TEST.csv :作为营销活动目标的个人的人口统计数据; 42833人(行)x 366(列)。

There are also two Excel spreadsheets, providing more information about the columns depicted in the data files.


  • DIAS Information Levels — Attributes 2017.xlsx is a top-level list of attributes and descriptions, organized by the informational category.

  • DIAS Attributes — Values 2017.xlsx is a detailed mapping of data values for each feature in alphabetical order.


1.3问题与方法 (1.3 Problem and Approach)

There are four parts in this project:


  1. Get to know the data


In this part, I will explore the data and then process the data regarding the missing values, data type transformation, data imputation, and feature scaling. The cleaned data will be used in the following study.

在这一部分中,我将探索数据,然后处理有关缺失值,数据类型转换,数据插补和特征缩放的数据。 清除的数据将在以下研究中使用。

2. Customer segmentation report


In this part, I will compare the demographics data for customers against the information for the general population, to identify the core customer base of the company. I will use unsupervised learning techniques (k-means) to perform customer segmentation. Principal component analysis (PCA) will be used to reduce dimensions.

在这一部分中,我将比较客户的人口统计数据与一般人群的信息,以识别公司的核心客户群。 我将使用无监督学习技术(k-means)进行客户细分。 主成分分析(PCA)将用于减小尺寸。

3. Supervised learning model


Here, I will use supervised learning methods to predict which individuals are most likely to convert into becoming customers for the company. I will compare four different models and optimize the model through GridSearchCV.

在这里,我将使用监督学习方法来预测哪些人最有可能转化为公司的客户。 我将比较四种不同的模型,并通过GridSearchCV优化模型。

4. Kaggle competition

4. Kaggle比赛

The result will be submitted for Kaggle competition.


1.4指标 (1.4 Metrics)

I will use the area under the receiver operating characteristic curve (ROC_AUC) for model selection. The ROC curve shows the false positive rate (FPR) against the true positive rate (TPR) at all possible thresholds. The idea curve is close to the top left. The area under the ROC curve (AUC) provides a way to evaluate the ROC curve to select the optimal models. The reason I use ROC_AUC is because this is a classification problem with imbalanced classes, and ROC_AUC is often much more meaningful than accuracy for this kind of problems.

我将使用接收器工作特性曲线(ROC_AUC)下的区域进行模型选择。 ROC曲线显示在所有可能的阈值下的假阳性率(FPR)与真阳性率(TPR)。 想法曲线靠近左上方。 ROC曲线下的面积(AUC)提供了一种评估ROC曲线以选择最佳模型的方法。 我使用ROC_AUC的原因是,这是类不平衡的分类问题,对于此类问题,ROC_AUC通常比准确性更有意义。

2.分析,方法论和结果 (2. Analysis, Methodology and Results)

2.1数据处理(2.1 Data processing)

I first explored four data files associated with this project:


  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

    Udacity_AZDIAS_052018.csv :德国总人口的人口统计数据; 891211人(行)x 366个特征(列)。

Image for post
The first few rows of AZDIAS data set
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

    Udacity_CUSTOMERS_052018.csv :邮购公司客户的人口统计数据; 191652人(行)x 369个特征(列)。

Image for post
The first few rows of CUSTOMERS data set
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).

    Udacity_MAILOUT_052018_TRAIN.csv :作为营销活动目标的个人的人口统计数据; 42982人(行)x 367(列)。

  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

    Udacity_MAILOUT_052018_TEST.csv :作为营销活动目标的个人的人口统计数据; 42833人(行)x 366(列)。

Each row of the demographics files represents a single person, including information about their household, building, and neighborhood. I will use the information from the first two files to figure out how customers (“CUSTOMERS”) are similar to or differ from the general population at large (“AZDIAS”), then make predictions on the other two files (“MAILOUT”), predicting which recipients are most likely to become a customer for the mail-order company.

人口统计信息文件的每一行代表一个人,包括有关他们的家庭,建筑物和邻居的信息。 我将使用前两个文件中的信息来找出客户(“客户”)与一般人群(“ AZDIAS”)之间的相似之处或不同之处,然后对其他两个文件(“ MAILOUT”)做出预测,预测哪些收件人最有可能成为邮购公司的客户。

2.1.1处理丢失的数据 (2.1.1 Process missing data)

Since some encoded values of attributes mean unknown or no available values, I need to convert them to NaNs before process missing data. I created a data frame, in which one column is the attribute, and another column is the values that indicate unknown or no values. Based on this data frame, I converted the encoded values in the Azdias that mean unknown to NaNs.

由于某些属性的编码值表示未知或没有可用值,因此我需要在处理丢失的数据之前将其转换为NaN。 我创建了一个数据框,其中一列是属性,另一列是指示未知或没有值的值。 基于此数据帧,我将Azdias中的编码值转换为NaNs,这意味着未知。

Image for post
Attributes and Values that mean unknown

Then I studied the proportion of missing values in each column and row. The following figure illustrates the distribution of missing values per column. According to this figure, the proportions of missing values in most of the columns are less than 0.3. So, I dropped the columns with the proportion of missing values greater than 0.3.

然后,我研究了每一列和每一行中缺失值的比例。 下图说明了每列缺失值的分布。 根据此图,大多数列中缺失值的比例小于0.3。 因此,我删除了缺失值比例大于0.3的列。

Image for post

The following figure illustrates the distribution of missing values per row. According to this figure, the proportions of missing values in most of the rows are less than 0.1. I dropped the rows with the proportion of missing values greater than 0.1.

下图说明了每行缺失值的分布。 根据此图,大多数行中缺失值的比例小于0.1。 我删除了缺失值比例大于0.1的行。

Image for post

2.1.2过程数据类型 (2.1.2 Process data type)

There are 6 columns that the data types are the object. We need to convert the data type before transforming the data.

数据类型为对象共有6列。 在转换数据之前,我们需要转换数据类型。

Image for post

The approach is to reencode ‘X’ in CAMEO_DEUG_2015 and ‘XX’ in CAMEO_INTL_2015 with NaNs, and then convert CAMEO_DEUG_2015 and CAMEO_INTL_2015 to float; reencode ‘W’ and ‘O’ in OST_WEST_KZ with 1 and 0; drop column CAMEO_DEU_2015, D19_LETZTER_KAUF_BRANCHE, EINGEFUEGT_AM.

方法是使用NaN重新编码CAMEO_DEUG_2015中的“ X”和CAMEO_INTL_2015中的“ XX”,然后将CAMEO_DEUG_2015和CAMEO_INTL_2015转换为浮点型; 在OST_WEST_KZ中用1和0重新编码“ W”和“ O”; 删除列CAMEO_DEU_2015,D19_LETZTER_KAUF_BRANCHE,EINGEFUEGT_AM。

After this process, azdias dataset has 737241 rows and 322 columns.


2.1.3清理客户数据集 (2.1.3 Cleaning Customer dataset)

I created functions regarding the previous cleaning processes and used them to clean the Customer dataset. I also dropped the extra columns in the Customer dataset (i.e., ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’). After the cleaning, the Customer dataset has 134245 rows and 322 columns.

我创建了与以前的清理过程有关的函数,并使用它们清理了客户数据集。 我还删除了客户数据集中的多余列(即“ CUSTOMER_GROUP”,“ ONLINE_PURCHASE”,“ PRODUCT_GROUP”)。 清理后,客户数据集具有134245行和322列。

2.1.4数据插补和特征缩放 (2.1.4 Data imputation and feature scaling)

The data are imputed followed by feature scaling before using unsupervised learning techniques. The missing values are imputed with mean, and StandardScaler is used for feature scaling. The following figure shows the azdias dataset after data imputation and feature scaling.

在使用无监督学习技术之前,先估算数据,然后进行特征缩放。 缺失值用均值估算,并且StandardScaler用于特征缩放。 下图显示了数据插补和特征缩放后的azdias数据集。

Image for post
azdias dataset after data imputation and feature scaling

2.2客户细分报告(2.2 Customer Segmentation Report)

2.2.1 PCA(2.2.1 PCA)

Since there are 322 columns in the datasets, I used Principal Component Analysis (PCA) to reduce dimensions. I plotted the change of cumulative variance explained with the number of components as below. It shows that around after 200 components, the change of cumulative variance explained becomes less significant. So I chose to retain 200 components.

由于数据集中有322列,因此我使用了主成分分析(PCA)来减少维度。 我绘制了累积方差的变化,并用下面的组件数进行了说明。 结果表明,在200个分量之后,所解释的累积方差的变化变得不那么显着。 所以我选择保留200个组件。

Image for post

2.2.2集群 (2.2.2 Clustering)

I used k-means clustering method on the PCA data. Before applying it, I need to find out the ideal number of clusters. I plotted the change of k-means score with the number of clusters. The plot shows that the score decreases rapidly at the beginning and then becomes slow after 9 clusters. So, I selected 9 clusters as the number of clusters for analysis.

我在PCA数据上使用了k均值聚类方法。 在应用它之前,我需要找出理想的集群数量。 我绘制了k均值得分随簇数的变化。 该图显示分数在开始时Swift降低,在9个簇之后变得缓慢。 因此,我选择了9个聚类作为要分析的聚类数。

Image for post

2.2.3 K-均值 (2.2.3 K-means)

I used k-means method for unsupervised learning. The model fits the cleaned azdias dataset and predicts the azdias and customers dataset.

我使用k-means方法进行无监督学习。 该模型适合清洁后的azdias数据集,并预测azdias和customers数据集。

azdias_kmeans = KMeans(9)
azdias_model = azdias_kmeans.fit(azdias_pca_200_transf)
azdias_predicted = azdias_model.predict(azdias_pca_200_transf) customers_predicted = azdias_model.predict(customers_pca_200_transf)

azdias_kmeans = KMeans(9) azdias_model = azdias_kmeans.fit(azdias_pca_200_transf) azdias_predicted = azdias_model.predict(azdias_pca_200_transf)customers_predicted = azdias_model.predict(customers_pca_200_transf)

Then I compared the proportion of each cluster in both azdias and customers datasets. Clusters 8 and 5 are the most overrepresented in the customer compared to the general population. Clusters 3 and 4 are the most under-represented in the customer. Now find out the most important attributes in those clusters.

然后,我比较了Azdias和客户数据集中每个集群的比例。 与总人口相比,聚类8和5在客户中代表最多。 集群3和4在客户中的代表性不足。 现在找出那些集群中最重要的属性。

Image for post

To find out the top attributes in the cluster, I defined two functions. get_top_component(model, n) is to find the top component in the cluster n; get_top_attributes(component_num, top_num) is to find the top n attributes of component n.

为了找出集群中的顶级属性,我定义了两个函数。 get_top_component(model,n)用于查找群集n中的顶部组件; get_top_attributes(component_num,top_num)用于查找组件n的前n个属性。

I found out the top 10 attributes of the most over-represented cluster are mainly regarding the share of middle and upper-class cars. It indicates the people who have middle and upper-class cars are more likely to become customers.

我发现代表最多的集群的前10个属性主要是关于中级和高端汽车的份额。 这表明拥有中高档汽车的人更有可能成为客户。

Image for post

Similarly, I found out the top 10 attributes of the second most over-represented cluster are mainly regarding the financial status and age.


Image for post

The top 10 attributes in the first two most under-represented clusters are the same, and they are mainly regarding the number of family houses, household income, and density of inhabitants.


Image for post

As a summary, the major attributes in the over-represented population are the share of middle and upper-class cars, financial status, and age. The major attributes in the under-represented population are the number of family houses, household income, and density of inhabitants.

综上所述,人数过多的人口的主要属性是中高档汽车的份额,财务状况和年龄。 代表性不足的人口的主要属性是家庭住房的数量,家庭收入和居民密度。

2.3监督学习模型 (2.3 Supervised Learning Model)

Now I will build a supervised learning model to predict whether or not an individual will become a customer. The “MAILOUT” data has been split into two approximately equal parts, each with almost 43 000 data rows. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign. I will verify the model with the “TRAIN” partition, which includes a column, “RESPONSE”, that states whether or not a person became a customer. Then I will create predictions on the “TEST” partition, where the “RESPONSE” column has been withheld.

现在,我将建立一个监督学习模型,以预测个人是否会成为客户。 “ MAILOUT”数据已被分为两个大致相等的部分,每个部分几乎包含43000个数据行。 “ MAILOUT”数据文件中的每一行代表一个针对邮寄活动的个人。 我将使用“ TRAIN”分区来验证模型,该分区包括“ RESPONSE”列,该列说明某人是否成为客户。 然后,我将在保留“响应”列的“测试”分区上创建预测。

The training data is cleaned using the previously defined cleaning functions, followed by imputation and data scaling. I tested four different methods to find out the best classifier. They are Logistic Regression, Random Forest Classifier, AdaBoostClassifier, and Gradient Boosting Classifier. I used 5-fold cross-validation. ROC_AUC was used as the score to evaluate the performance since this is a problem with very imbalanced classes. The classifier function is defined as below,

使用先前定义的清除功能清除训练数据,然后进行归因和数据缩放。 我测试了四种不同的方法以找出最佳的分类器。 它们是逻辑回归,随机森林分类器,AdaBoostClassifier和梯度提升分类器。 我使用了5倍交叉验证。 ROC_AUC用作评估性能的分数,因为这是非常不平衡的类的问题。 分类器功能定义如下:

def classifier(estimator, param_grid, X=X, y=y):
grid = GridSearchCV(estimator=estimator, param_grid=param_grid, scoring=’roc_auc’, cv=5)
grid.fit(X, y)
print(‘Estimator:’, grid.best_estimator_)
print(‘Score:’, grid.best_score_)
return grid.best_estimator_

def分类器(estimator,param_grid,X = X,y = y): grid = GridSearchCV(estimator = estimator,param_grid = param_grid,得分='roc_auc',cv = 5) grid.fit(X,y) print('Estimator:',grid.best_estimator_) print('Score:',grid.best_score_) 返回grid.best_estimator_

The result shows the ROC_AUC scores are 0.68, 0.52, 0.76, and 0.78 for Logistic Regression, Random Forest Classifier, AdaBoostClassifier, and Gradient Boosting Classifier, respectively. The higher score indicates a better model, which gives a higher recall while keeping the false positive rate low. So, I selected Gradient Boosting Classifier as the estimator and optimized the parameters using GridSearchCV. I tested different learning rate and number of estimators,

结果显示,逻辑回归,随机森林分类器,AdaBoostClassifier和梯度提升分类器的ROC_AUC分数分别为0.68、0.52、0.76和0.78。 分数越高表示模型越好,召回率越高,而误报率却越低。 因此,我选择了“梯度提升分类器”作为估计量,并使用GridSearchCV优化了参数。 我测试了不同的学习率和估计量,

param_grid = {
‘learning_rate’: [0.01, 0.1, 1],
‘n_estimators’: [10, 100, 200]

param_grid = { 'learning_rate':[0.01,0.1,1], 'n_estimators':[10、100、200] }

The optimized parameters are learning_rate = 0.1 and n_estimators=100, and the score is 0.78. This model is used as the final estimator.

优化的参数是learning_rate = 0.1和n_estimators = 100,分数是0.78。 该模型用作最终估计量。

This final estimator is


GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto', random_state=42,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)

I checked out the top 10 most important features in the final model. D19_SOZIALES is most important one, but there is no description. Other important features include the number of cars, share of Ford, year of buliding, number of professional title holder, etc. This finding basically agrees with the result shown in the customer segmentation study.

我检查了最终模型中最重要的十大功能。 D19_SOZIALES是最重要的一个,但没有描述。 其他重要特征包括汽车数量,福特份额,生产年份,专业头衔持有者数量等。这一发现基本上与客户细分研究中显示的结果一致。

Image for post
Top 10 most important features

2.4 Kaggle比赛(2.4 Kaggle Competition)

Now it is time to test the model on the TEST dataset and submit for Kaggle competition. The entry to the competition is a CSV file with two columns. The first column is “LNR”, which acts as an ID number for each individual in the “TEST” partition. The second column, “RESPONSE”, is the probabilities of each individual became a customer.

现在是时候在TEST数据集上测试模型并提交Kaggle竞赛了。 参赛作品是一个包含两列的CSV文件。 第一列是“ LNR”,它充当“ TEST”分区中每个人的ID号。 第二列“响应”是每个人成为客户的概率。

Image for post

I submitted it to Kaggle, and the final score is 0.68.


3.结论 (3. Conclusions)

In this project, I analyzed the data of the customers of a mail-order sales company in Germany, and there are some interesting founding.


  • I used unsupervised learning techniques (k-means) to perform customer segmentation. It turns out the major features in the over-represented population are the share of middle and upper class cars, finanical status, and age. On the other hand, the major features in the under-represented population are the number of family houses, household income, and density of inhabitants.

    我使用无监督学习技术(k-means)进行客户细分。 事实证明,人数过多的人口的主要特征是中上阶层汽车的份额,财务状况和年龄。 另一方面,人数不足的人口的主要特征是家庭住房的数量,家庭收入和居民密度。
  • I compared four different supervised learning methods to predict which individuals are most likely to convert into becoming customers for the company. Gradient Boosting Classifier gave a better result than Logistic Regression, Random Forest Classifier, and AdaBoostClassifier. The most important features include the number of cars, share of Ford, year of buliding, number of professional title holder, etc. The optimized model was submitted to Kaggle and the score is 0.68.

    我比较了四种不同的监督学习方法,以预测哪些人最有可能转化为公司的客户。 与Logistic回归,随机森林分类器和AdaBoostClassifier相比,梯度提升分类器提供了更好的结果。 最重要的功能包括汽车的数量,福特的份额,生产的年份,专业头衔的持有者的数量等。优化后的模型已提交给Kaggle,得分为0.68。

The project could be improved in several aspects. For data processing, try MinMax Scaler instead of Standard Scaler, and impute data using different strategies. For unsupervised learning, explore more attributes from more components and clusters. It will give us a better understanding for customer segmentation. For supervised learning, the model could be further optimized by testing more parameters.

该项目可以在几个方面进行改进。 对于数据处理,请尝试使用MinMax Scaler而不是Standard Scaler,并使用不同的策略估算数据。 对于无监督学习,请从更多组件和群集中探索更多属性。 这将使我们对客户细分有更好的了解。 对于监督学习,可以通过测试更多参数来进一步优化模型。

The main findings and code are also included in my github repo (https://github.com/tyuion/Customer-Segmentation)

主要发现和代码也包含在我的github存储库中( https://github.com/tyuion/Customer-Segmentation )

翻译自: https://medium.com/@tyuion1215/customer-segmentation-report-for-arvato-financial-services-16a4cfe1acd4


  • 0
  • 2
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


