数据预处理工具_数据预处理

最新推荐文章于 2024-08-03 00:28:39 发布

weixin_26713521

最新推荐文章于 2024-08-03 00:28:39 发布

阅读量3k

点赞数

文章标签： python 人工智能大数据 java

原文链接：https://medium.com/@markusmller_92879/udacity-data-scientist-nanodegree-capstone-project-using-unsupervised-and-supervised-algorithms-c1740532820a

版权

本文探讨了数据预处理的重要性，引用了一篇关于Udacity数据科学家纳米学位的顶点项目，该项目使用无监督和监督算法。内容涉及数据预处理工具的使用，为后续的数据分析和建模打下基础。

摘要由CSDN通过智能技术生成

数据预处理工具

As the title states this is the last project from Udacity Nanodegree. The goal of this project is to analyze demographics data for customers of a mail-order sales company in Germany.

如标题所示，这是Udacity Nanodegree的最后一个项目。该项目的目的是为德国一家邮购销售公司的客户分析人口统计数据。

The project is divided into four main steps, each with it its unique goals:

该项目分为四个主要步骤，每个步骤都有其独特的目标：

Pre-process the data
预处理数据

The goal of this step is to get familiar with the provided data and perform different cleaning steps to use the data in the next stage.

此步骤的目标是熟悉提供的数据，并执行不同的清理步骤以在下一阶段使用这些数据。

Some things I did:

我做了一些事情：

Check missing values (columns and rows)
检查缺失值(列和行)
Transformed features (create dummy variables )
变换后的特征(创建虚拟变量)
Impute values to remove missing values
估算值以删除缺失值
Scaled features
缩放功能
Dropped highly correlated features
删除了高度相关的功能

2. Use unsupervised learning algorithms to perform customer segmentation

2.使用无监督学习算法进行客户细分

The objective in this step is to find features that differentiate between customers and the general population.

此步骤的目标是找到可以区分客户和一般人群的功能。

Some things I did:

我做了一些事情：

Used PCA to reduce the dimensionality
使用PCA减少尺寸
Interpreted the first components to get an understanding of the attributes
解释了第一个组件以了解属性
Used KMeans to cluster the attributes and compared the two different groups
使用KMeans对属性进行聚类并比较两个不同的组

3. Use supervised learning algorithms to predict if an individual will become a customer

3.使用监督学习算法来预测个人是否会成为客户

In this step a new dataset was introduced, which had the same attributes as before but with a column ‘RESPONSE’. This column indicates if an individual became a customer.

在这一步中，引入了一个新的数据集，该数据集具有与以前相同的属性，但带有“ RESPONSE”列。此列指示个人是否成为客户。

The goal is to train a classification algorithm on that data.

目标是针对该数据训练分类算法。

Some things I did:

我做了一些事情：

Checked multiple classifiers to find the best
检查多个分类器以找到最佳分类器
Hyperparameter tuning for the best classifier
超参数调整以获得最佳分类器

4. Make prediction on an unseen dataset and upload result to Kaggle

4.对看不见的数据集进行预测，然后将结果上传到Kaggle

In the final step the trained classification algorithm should be used to make prediction on unseen data and upload the results to the Kaggle competition

在最后一步中，应使用训练有素的分类算法对看不见的数据进行预测，并将结果上传到Kaggle竞赛中

数据预处理 (Pre-processing of the data)

In this part I will explain the steps I took to make the data usable. But first lets take a look on the datasets. Udacity provided four datasets for this project and two Excel files with descriptions of the attributes, since they were in German:

在这一部分中，我将说明为使数据可用而采取的步骤。但是首先让我们看一下数据集。 Udacity为该项目提供了四个数据集和两个带有属性描述的Excel文件，因为它们是德语的：

Udacity_AZDIAS_052018.csv:

Udacity_AZDIAS_052018.csv：

Demographics data for the general population of Germany
德国总人口的人口统计数据
891 211 persons (rows) x 366 features (columns).
891211人(行)x 366个特征(列)。

Udacity_CUSTOMERS_052018.csv:

Udacity_CUSTOMERS_052018.csv：

Demographics data for customers of a mail-order company
邮购公司客户的人口统计数据
191 652 persons (rows) x 369 features (columns).
191652人(行)x 369个特征(列)。

Udacity_MAILOUT_052018_TRAIN.csv:

Udacity_MAILOUT_052018_TRAIN.csv：

Demographics data for individuals who were targets of a marketing campaign
营销活动目标人群的人口统计数据
42 982 persons (rows) x 367 (columns).
42982人(行)x 367(列)。

Udacity_MAILOUT_052018_TEST.csv:

Udacity_MAILOUT_052018_TEST.csv：

Demographics data for individuals who were targets of a marketing campaign
营销活动目标人群的人口统计数据
42 833 persons (rows) x 366 (columns)
42833人(行)x 366(列)

The first step I took was to check for missing values. From the visual assessment I noticed, that the dataset AZDIAS contained missing values (NaNs), but there were also other encodings for missing or unknown data like ‘-1’. A quick check with the Excel file revealed that missing or unknown values are also encoded with -1, 0 or 9.

我采取的第一步是检查缺失值。从视觉评估中，我注意到，数据集AZDIAS包含缺失值(NaNs)，但是对于缺失或未知数据也有其他编码，例如“ -1”。快速检查Excel文件显示，缺失或未知的值也用-1、0或9编码。

It wasn’t possible just to replace the numbers with np.NaN, because 9 or 0 are also encoded with different meanings for other Attributes. So, I loaded the Excel file in pandas and created a DataFrame with the name of each attribute and the corresponding values for missing or unknown data. With a for-loop I looped through the AZDIAS DataFrame and only performed the transformation for attributes that have -1, 0 or 9 as an encoding for missing or unknown data.

仅用np.NaN替换数字是不可能的，因为9或0也被编码为其他属性具有不同的含义。因此，我将Excel文件加载到了熊猫中，并创建了一个DataFrame，其中包含每个属性的名称以及丢失或未知数据的相应值。通过for循环，我遍历了AZDIAS DataFrame，仅对具有-1、0或9的属性执行了转换，以作为丢失或未知数据的编码。

At that point I also noticed that the count of the attributes in the Excel file isn’t equal to the columns in the AZDIAS DataFrame. After further inspection I came to the result that only 272 Attributes are in the DataFrame that are also in the Excel file and vice versa. Based on the idea that I only can use attributes for which I have the description I dropped those that weren’t in both files. So, I dropped about 94 attributes. In the limitations section of this article I will address this decision, as it turns out it was quite a unique approach.

那时，我还注意到Excel文件中的属性计数不等于AZDIAS DataFrame中的列。经过进一步检查，我得出的结果是，DataFrame中只有272个属性也位于Excel文件中，反之亦然。基于我只能使用具有描述的属性的想法，我删除了两个文件中都不存在的属性。因此，我删除了约94个属性。在本文的局限性部分，我将讨论这个决定，因为事实证明这是一个非常独特的方法。

Now that all the missing values are accounted for, I plotted the count of missing values in each column in a histogram:

现在已经考虑了所有缺失值，我在直方图中绘制了每列中缺失值的计数：

Image for post — Count of columns with missing values

Bases on the histogram I removed cloumns with more than 200000 missing value. I also check for missing values on the row level.

根据直方图，我删除了缺失值超过200000的克隆。我还检查行级别的缺失值。

Based on this histogram I decided to remove columns that had more than 50 missing values. So all in all I removed 7 columns and 153955 rows.

基于此直方图，我决定删除缺失值超过50的列。因此，我总共删除了7列和153955行。

Check non-numeric attributes

检查非数字属性

If I want to use the attributes in the learning algorithms, I need to make sure that all of them are numeric. The following attributes were marked as objects.

如果要在学习算法中使用属性，则需要确保所有属性都是数字。以下属性被标记为对象。

CAMEO_DEU_2015: detailed classification variable with more than 44 items on the scale
CAMEO_DEU_2015：详细的分类变量，具有超过44个项目
CAMEO_DEUG_2015: classification variable for social status with 9 items but encoded in different dtypes (int and floats in the same column) and some rows contained XX
CAMEO_DEUG_2015：具有9个项目的社会地位分类变量，但以不同的dtypes编码(int和float在同一列中)，并且某些行包含XX
OST_WEST_KZ: indication for former region (West-Germany, Ost-Germany) encoded with W and O
OST_WEST_KZ：以W和O编码的先前区域(西德，东德)的指示

I made the necessary transformations to CAMEO_DEUG_2015 and OST_WEST_KZ and decided to drop CAMEO_DEU_2015, because there were to many items.

我对CAMEO_DEUG_2015和OST_WEST_KZ进行了必要的转换，并决定删除CAMEO_DEU_2015，因为其中有很多项目。

Now that all of the attributes are numeric, I manually checked for categorical data that needed to be transformed to dummy variables. I discovered 11 categorical features:

现在，所有属性都是数字属性，我手动检查了需要转换为虚拟变量的分类数据。我发现了11个分类功能：

ANREDE_KZ, CJT_GESAMTTYP, GEBAEUDETYP, GEBAEUDETYP_RASTER, HEALTH_TYP, KBA05_HERSTTEMP, KBA05_MAXHERST, KBA05_MODTEMP, NATIONALITAET_KZ, SHOPPER_TYP, VERS_TYP
ANREDE_KZ，CJT_GESAMTTYP，GEBAEUDETYP，GEBAEUDETYP_RASTER，HEALTH_TYP，KBA05_HERSTTEMP，KBA05_MAXHERST，KBA05_MODTEMP，NATIONALITAETET_KZ，SHOPPER_TYP，VERS_TYP

In the same step I also noted which attributes need to be dropped, because they would add to much complexity to the model (mainly attributes with a scale higher than 10)

在同一步骤中，我还指出了需要删除哪些属性，因为它们会增加模型的复杂性(主要是比例大于10的属性)

GFK_URLAUBERTYP, LP_FAMILIE_FEIN, LP_LEBENSPHASE_GROB, LP_FAMILIE_GROB, LP_LEBENSPHASE_FEIN
GFK_URLAUBERTYP，LP_FAMILIE_FEIN，LP_LEBENSPHASE_GROB，LP_FAMILIE_GROB，LP_LEBENSPHASE_FEIN

Now with the main cleaning steps finished I created a function that cleaned the dataset with the customers data (Udacity_CUSTOMERS_052018.csv). For the next steps it is very important that both datasets have the same shape and columns. I had to remove one column that was created as a dummy variable from the customers dataset ‘GEBAEUDETYP_5.0’.

现在完成了主要的清理步骤，我创建了一个使用客户数据清理数据集的函数(Udacity_CUSTOMERS_052018.csv)。对于后续步骤，两个数据集具有相同的形状和列非常重要。我必须从客户数据集“ GEBAEUDETYP_5.0”中删除作为虚拟变量创建的一列。

Imputation and scaling features

插补和缩放功能

To use columns with missing values I imputed the median for each column. I decided to use the median, because most of the attributes are ordinal scaled, which means that they are categorical but have a quasi linear context. In that case the median is ‘best’ way to impute

为了使用缺少值的列，我估算了每列的中位数。我决定使用中位数，因为大多数属性都是按序缩放的，这意味着它们是分类的，但具有准线性上下文。在这种情况下，中位数是“最佳”估算方式

The second to last thing I did in the preprocessing step was to scale the features. I used to standardize them, which means that the new data has a mean of 0 and a std of 1.

我在预处理步骤中所做的倒数第二件事是缩放功能。我曾经将它们标准化，这意味着新数据的平均值为0，std为1。

And finally, the last thing I did was to eliminate columns that had a correlation above 0.95.

最后，我要做的最后一件事是消除相关性高于0.95的列。

KBA13_HERST_SONST, KBA13_KMH_250, LP_STATUS_GROB, ANREDE_KZ_2.0, KBA05_MODTEMP_5.0
KBA13_HERST_SONST，KBA13_KMH_250，LP_STATUS_GROB，ANREDE_KZ_2.0，KBA05_MODTEMP_5.0

2.使用无监督学习算法进行客户细分 (2. Use unsupervised learning algorithms to perform customer segmentation)

The first objective was to reduce the dimensions. After the preprocessing step I still had close to 300 features. To reduce dimensionality, we can use Principal Component Analysis. PAC uses Singular Value Decomposition of the data to project it onto a lower dimensional space. Simply put, it reduces the complexity of a high feature model.

第一个目标是减小尺寸。在预处理步骤之后，我仍然拥有近300个功能。为了降低维数，我们可以使用主成分分析。 PAC使用数据的奇异值分解将其投影到较低维度的空间上。简而言之，它降低了高功能模型的复杂性。

According to the book ‘Hands on Machine Learning’, it is important to choose a number of dimensions that add up to a sufficiently large portion of variance, like 95%. So, in this case I can reduce the number of dimensions by roughly 50% down to 150.

根据《机器学习的动手》一书，选择许多维数非常重要，这些维加起来需要足够大的方差，例如95％。因此，在这种情况下，我可以将尺寸数量减少大约50％，减少到150。

I will briefly explain the first three competent to give an idea what they are about. I printed the top positve and negative weights for each component:

我将简要解释前三名主管，以给出他们的想法。我打印了每个组件的最高位置和负重量：

Principal Component 1

主要成分1

The first component is mainly related with wealth, status and number of family houses in region(PLZ8):

第一部分主要与该地区的家庭财产，地位和数量有关(PLZ8)：

HH_EINKOMMEN_SCORE: estimated household income
HH_EINKOMMEN_SCORE：估计的家庭收入
CAMEO_DEUG_2015: social status of the individual (upper class -> urban working class)
CAMEO_DEUG_2015：个人的社会地位(上层阶级->城市工人阶级)
PLZ8_ANTG1/3/4: number of family houses in the neighborhood
PLZ8_ANTG1 / 3/4：附近的家庭住宅数量
MOBI_REGIO: moving patterns (high mobility -> very low mobility)
MOBI_REGIO：移动模式(高移动性->低移动性)
LP_Status_FEIN: social status (typical low-income earners -> top earners)
LP_Status_FEIN：社会地位(典型的低收入者->最高收入者)

Principal Component 2

主要组成部分2

The second component is related to cars:

第二部分与汽车有关：

KBA13_HERST_BMW_BENZ: share of BMW & Mercedes Benz within the PLZ8
KBA13_HERST_BMW_BENZ：PLZ8中宝马和奔驰的份额
KBA13_SEG_OBERMITTELKLASSE: share of upper middle-class cars and upper-class cars (BMW5er, BMW7er etc.)
KBA13_SEG_OBERMITTELKLASSE：上等中产车和上等车(BMW5er，BMW7er等)的份额
KBA13_HALTER_50/55/20: age of car owner
KBA13_HALTER_50 / 55/20：车主的年龄

Principal Component 3

主要组成部分3

The third component is related to age, financial decisions and transactions:

第三部分与年龄，财务决策和交易有关：

PRAEGENDE_JUGENDJAHRE: dominating movement in the person’s youth (avantgarde or mainstream)
PRAEGENDE_JUGENDJAHRE：在青年时代(前卫或主流)主导运动
FINANZ _SPARER: financial typology: money saver (very high -> very low)
FINANZ _SPARER：财务类型：省钱(非常高->非常低)
D19_GESAMT_ANZ_24: transaction activity TOTAL POOL in the last 24 months (no transaction -> very high activity)
D19_GESAMT_ANZ_24：过去24个月的交易活动总计TOTAL POOL(无交易->交易量很高)
FINANZ_VORSORGER: financial typology be prepared (very high -> very low)
FINANZ_VORSORGER：准备财务类型(非常高->非常低)
ALTERSKATEGORIE_GROB: age classification through prename analysis
ALTERSKATEGORIE_GROB：通过姓氏分析进行年龄分类

聚类 (Clustering)

Now that we reduced the number of dimensions in both datasets and get a brief understanding of the first components, it is time to cluster them to see if there are any differences between the clusters from the general population and the ones from the customers population. To achievethis, I will use KMeans.

现在，我们减少了两个数据集中的维数，并简要了解了第一个组件，是时候对它们进行聚类了，以查看来自一般总体的聚类和来自客户总体的聚类之间是否存在差异。为此，我将使用KMeans。

We can use the ‘Elbow’-method to get ‘right’ number of clusters. An ‘Elbow’ is defined as the point in the above chart where the decrease in Inertia almost flattens. In my case there isn’t a clear ‘Elbow’ point. 10 seems to be a good choice to have enough clusters to compare against but not too much to add unnecessary complexity.

我们可以使用“肘”方法来获得“正确的”簇数。上图中的“肘”定义为惯性下降几乎趋于平稳的点。就我而言，没有明确的“肘”点。 10有足够的集群进行比较似乎是一个不错的选择，但又不要过多，以免增加不必要的复杂性。

Comparing AZDIAS cluster with CUSTOMERS cluster

比较AZDIAS集群和CUSTOMERS集群

It is clear to see that almost every cluster differentiate between the customers and the general population. When looking at the bars we can easily see which cluster is overrepresented by the customers, which means that customers can be described by the features for that cluster. Customers can be described with the features from cluster 0, 7and 6.

显而易见，几乎每个集群都在客户和普通人群之间有所区别。当查看条形图时，我们可以轻松地看到客户代表了哪个集群，这意味着可以通过该集群的功能来描述客户。可以使用群集0、7和6中的功能描述客户。

We can also describe individuals that won’t become our customers, when we are looking at the clusters where the population is overrepresented, like cluster 8, 3 and 9.

当我们查看人口过多的集群时，例如集群8、3和9，我们还可以描述不会成为客户的个人。

The main customers of the company

公司主要客户

An individual part of that clusters 0:

该部分的单个部分聚集为0：

lives in an area with mostly family homes and low unemployment
生活在大部分家庭住宅和低失业率的地区
has a higher affinity for a fightfull attitude and is financial prepared.
对斗志满满的态度有较高的亲和力，并且有充分的财务准备。
but has low financial interest, is not an investor and not good with saving money
但经济利益低，不是投资者，也不擅长存钱
not really culturally minded
没有真正的文化意识

An individual part of clusters 7 is mainly described by its car choice:

集群7的单个部分主要通过其汽车选择来描述：

has a high income and a high share of upper class cars (BMW 7er etc)
收入高，高档轿车比例高(宝马7er等)
high share of cars per household
每个家庭的汽车占有率很高
very few cars with a max speed between 110 and 210 and were built between 2000 and 2003, so mostly new cars
在2000年至2003年之间生产的极少数汽车的最高速度在110至210之间
has in his area a lot less vans, compared to country average
与全国平均水平相比，他所在地区的货车少很多

An individual part of this cluster 6:

该集群的一部分6：

lives in a low density area in an old building, with only a few family houses around
住在一栋旧楼的低密度区域，周围只有几户人家
has low purchasing power but still a higher car share per household
购买力低，但每个家庭的汽车份额仍然较高
is more minimalistic / independent
更简约/独立
low financial interest
财务利益低
high online affinity
网上亲和力高

Now lets look at clusters where customers are underrepressented

现在，让我们看看客户压力不足的集群

An individual part of this cluster 8:

该集群的一部分8：

has high purchasing power, but has a lower income
具有较高的购买力，但收入较低
is part of the lower middle-class / working-class
是下层中产阶级/工人阶级的一部分
has a low number of family homes in the area
该地区的家庭住宅数量少
low online affinity and share of cars per household
在线亲和力低，每户拥有汽车的比例低
low car share per household
每个家庭的汽车占有率低

An individual part of this cluster 3:

该集群的一部分：

has high mobility, but a low number of cars with less than 5 seats
机动性高，但座位数少于5的汽车数量少
dives mostly small cars (high number of very small cars (Ford Fiesta, Ford Ka) and low number of BMW and Mercedes)
大多潜水小型车(大量的小型车(福特嘉年华，福特嘉年华)和少量的宝马和梅赛德斯)
is mostly between 21 and 25 and drives cars from Asian manufactures
通常在21至25岁之间，并驾驶来自亚洲制造商的汽车
high share hare of car owners below 31 within the PLZ8
PLZ8以内31岁以下车主的高份额野兔
and interestingly high amount of campers
有趣的是，大量的露营者

For cluster 9 it is almost the same.

对于群集9，几乎是相同的。

3.使用监督学习算法来预测个人是否会成为客户 (3. Use supervised learning algorithms to predict if am individual will become a customer)

Now that I have found which parts of the population are more likely to be customers of the mail-order company, it’s time to build the prediction model.

现在，我已经发现人口中的哪些部分更可能成为邮购公司的客户，是时候建立预测模型了。

I used the provided dataset Udacity_MAILOUT_052018_TRAIN.csv to train various models, select the best one and did some hyper-parameter tuning to increase the effectiveness of my model. But first I had to clean the data.

我使用提供的数据集Udacity_MAILOUT_052018_TRAIN.csv来训练各种模型，选择最佳模型，并进行一些超参数调整，以提高模型的有效性。但是首先我必须清理数据。

The cleaning was relatively simple because I can use the cleaning function created in the first part of the project. After that I check missing values:

清理相对简单，因为我可以使用在项目的第一部分中创建的清理功能。之后，我检查缺少的值：

Based on the histograms I decided to only drop rows with more than 30% missing values.

根据直方图，我决定只删除缺失值超过30％的行。

To use the data for the learning algorithm I imputed the median, standardized the data and dropped highly correlated features (>0.95). In each step where I dropped columns, I made sure that I dropped the same columns in the test set (Udacity_MAILOUT_052018_TEST.csv) to make predictions on this unseen dataset.

为了将数据用于学习算法，我估算了中位数，对数据进行了标准化，并删除了高度相关的特征(> 0.95)。在放置列的每个步骤中，确保将相同的列都放置在测试集中(Udacity_MAILOUT_052018_TEST.csv)，以便对该看不见的数据集进行预测。

Finally, the fun part begins: creating the model and male predictions.

最后，有趣的部分开始：创建模型和男性预测。

First, I check the distribution of the relevant variable ‘RESPONSE’. As it turns out ‘RESPONSE’ is highly imbalanced: 0: 34565 and 1: 435.

首先，我检查相关变量“ RESPONSE”的分布。事实证明，“响应”高度不平衡：0：34565和1：435。

If the variable of interest is imbalanced it is important to make sure of the following things:

如果感兴趣的变量不平衡，请确保以下几点很重要：

use stratification for the tranig and validation set: Stratification is a technique to distribute the samples evenly based on sample classes so that training set and validation set have similar ratio of classes.
对成绩单和验证集使用分层：分层是一种根据样本类别均匀分配样本的技术，以便训练集和验证集具有相似的类别比率。
choose the right evaluation metric: Simply choose accuracy won’t give you ‘accurate’ evaluations. The best one in this case would be the roc-auc socre This article explains it.
选择正确的评估指标：仅选择准确性不会给您“准确”的评估。在这种情况下最好的是roc-auc socre 本文对此进行了解释。
use a gradient boosting algorithm: I will run multiple classification algorithms and choose the best one
使用梯度提升算法：我将运行多种分类算法并选择最佳的算法

There are also some advanced techniques to deal with it that I wpn’t implement. You can read about it here.

还有一些我不会实现的高级技术。你可以在这里阅读。

Classification algorithms

分类算法

I tested the following classifications algorithms in a cross validation method:

我以交叉验证方法测试了以下分类算法：

LogisticRegression
Logistic回归
DecisionTreeClassifier
DecisionTreeClassifier
RandomForestClassifier
随机森林分类器
AdaBoostClassifier
AdaBoostClassifier
GradientBoostingClassifier
梯度提升分类器

I used sklearns StratifiedKFold method to make sure I used stratification when doing the evaluation of the classifiers.

我使用sklearns StratifiedKFold方法来确保在评估分类器时使用分层。

I created a pipeline where each model in the classifier dictionary gets evaluated on the ‘roc_auc’ scoring technique.

我创建了一个管道，分类器字典中的每个模型都可以通过“ roc_auc”评分技术进行评估。

Results

结果

LogisticRegression: 0.5388771114225904
Logistic回归：0.5388771114225904
DecisionTreeClassifier: 0.5065447241661969
DecisionTreeClassifier：0.5065447241661969
RandomForestClassifier: 0.5025457616916987
随机森林分类器：0.5025457616916987
AdaBoostClassifier: 0.5262902976401282
AdaBoostClassifier：0.5262902976401282
GradientBoostingClassifier: 0.5461740415775044
梯度提升分类器：0.5461740415775044

As expected, the classifier using gradiant boosting git the best result. In the next step I used GridSearch to find the best Hyperparameters for the GradientBoostingClassifier

不出所料，使用gradient boosting git的分类器效果最佳。在下一步中，我使用GridSearch来找到GradientBoostingClassifier的最佳超参数

With all other options on normal:

在所有其他选项均正常的情况下：

learning_rate: 0.1
学习率：0.1
max_death: 5
max_death：5
n_estimators: 200
n_estimators：200

I increased the score from 0.546 to 0.594.

我将分数从0.546提高到0.594。

4.对看不见的数据集进行预测，然后将结果上传到Kaggle (4. Make prediction on an unseen dataset and upload result to Kaggle)

Now that I tuned and trained the best model I can finally make predictions on the unseen dataset (Udacity_MAILOUT_052018_TEST.csv).

现在，我已经调整和训练了最佳模型，我终于可以对看不见的数据集(Udacity_MAILOUT_052018_TEST.csv)进行预测。

For the final part I just had to impute the missing values and standardize the new dataset, made sure that the columns were the same and run the trained model on the new data.

对于最后一部分，我只需要估算缺少的值并标准化新数据集，请确保列相同，然后对新数据运行经过训练的模型。

I transformed the output to the requirements of the Kaggle competition and uploaded my submission file.

我将输出转换为Kaggle竞赛的要求，并上传了提交文件。

I got a score of 0.536 in the Kaggle Competitio.

我在Kaggle竞赛中获得0.536分。

结论 (Conclusions)

To recap, the first goal of this project was to perform an unsupervised learning algorithm to uncover differences between customers and the general population. The second goal was to perform a supervised learning algorithm to predict if an individual became a customer and the last goal was to use this trained model to predict on unseen data and upload the results to Kaggle.

回顾一下，该项目的第一个目标是执行一种无监督的学习算法，以发现客户与一般人群之间的差异。第二个目标是执行监督学习算法，以预测个人是否成为客户，最后一个目标是使用经过训练的模型来预测看不见的数据，并将结果上传到Kaggle。

The first part (unsupervised learning) was very challenging for me. It was the first time that I worked with a huge datafile (> 1GB). So, at first it was quite frustrating working on the provided workspace, since some operations took a while. I decided to download the data to work on it on my local machine.

第一部分(无监督学习)对我来说非常具有挑战性。这是我第一次使用巨大的数据文件(> 1GB)。因此，起初，在提供的工作空间上进行工作非常令人沮丧，因为某些操作花费了一段时间。我决定下载数据以在本地计算机上进行处理。

Besides the huge dataset, the data cleaning was also very challenging, and I used quite frequently methods that I didn’t used before, so it was on the other side quite rewarding to implement a new method and get the expected result.

除了庞大的数据集之外，数据清理也非常具有挑战性，我经常使用以前从未使用过的方法，因此，另一方面，实施一种新方法并获得预期结果也颇有收获。

Again, it became clear that the most work a data scientist has is the cleaning step.

同样，很明显，数据科学家要做的最大工作就是清理步骤。

局限性 (Limitations)

My final score is compared to others on Kaggle relatively low. I looked at a few other notebooks on github to get an idea why. It seems that my approach, to only keep the columns that are in the dataset and in the excel file is quite unique. To recap, I dropped 94 columns that weren’t in both files, with the idea that I can only use attributes for which I have the description. After the analysis I inspected the excel file and noticed that some Attributes are just spelled differently between the excel file and the dataset. So, all in all I probably dropped some columns that meight would increase my score.

我的最终成绩与Kaggle上的其他人相比较低。我查看了github上的其他笔记本以了解原因。看来，仅保留数据集中和excel文件中的列的方法非常独特。回顾一下，我删除了两个文件中都没有的94列，以为我只能使用具有描述的属性。分析之后，我检查了excel文件，发现excel文件和数据集之间的某些属性拼写有所不同。因此，总的来说，我可能会丢掉一些可能会增加得分的列。

Another thing that I noticed is that I dropped rows in the supervised learning part. Which is debatable because the variable of interest is to highly imbalanced and one can argue that it would be better to keep rows with missing values, so that there is a higher chance for the imbalanced value to appear.

我注意到的另一件事是，我在有监督的学习部分中删除了行。这是值得商because的，因为关注变量的高度不平衡，并且有人可能会说最好保留值缺失的行，这样就更有可能出现不平衡的值。

All in all, here are some things that could be checked to enhance the final score:

总而言之，以下是可以提高最终得分的一些事情：

get a better understanding of the attributes and check if you can use more attributes without dropping them (keep attributes with more than 10 items)
更好地了解属性，并检查是否可以使用更多属性而不删除它们(保留包含10个以上项目的属性)
don’t drop attributes because they aren’t in the Excel file
不要删除属性，因为它们不在Excel文件中
use more advanced methods to impute missing values (imputations based on distributions ore even use a learning algorithm to predict the missing value)
使用更高级的方法来估算缺失值(基于分布矿的算力甚至使用学习算法来预测缺失值)
use more advanced techniques to deal with imbalanced data (Resampling to get more balanced data, weighted classes / cost sensitive learning).
使用更先进的技术来处理不平衡的数据(重新采样以获得更平衡的数据，加权类/对成本敏感的学习)。

If you are interested in the code, you can take a look at this Github repo.

如果您对代码感兴趣，可以查看此 Github存储库。

翻译自: https://medium.com/@markusmller_92879/udacity-data-scientist-nanodegree-capstone-project-using-unsupervised-and-supervised-algorithms-c1740532820a

数据预处理工具