数据探索性分析案例_探索基于案例的方法的数据分析管道

最新推荐文章于 2024-01-31 09:44:37 发布

weixin_26750511

最新推荐文章于 2024-01-31 09:44:37 发布

阅读量1.1k

点赞数

文章标签： python 数据分析大数据人工智能机器学习

原文链接：https://medium.com/@imgvenkatesh/exploring-the-data-analysis-pipeline-a-case-based-approach-75fef6ba5dde

版权

数据探索性分析案例

“在获得数据之前先进行理论分析是一个重大错误。荒谬的是，人们开始扭曲事实以适应理论，而不是理论去适应事实。” (“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”)

― Sir Arthur Conan Doyle, Sherlock Holmes

―亚瑟·柯南·道尔爵士，福尔摩斯

业务问题–(The Business problem –)

A Portuguese bank is facing a decline in revenues. Upon investigation they have identified the root cause as follows — clients are not depositing as frequently as before.

一家葡萄牙银行面临收入下降。经调查，他们确定了根本原因，如下所示-客户存款的频率没有以前那么高。

Term deposit allows the bank to hold onto a deposit for a specific amount of time, allowing it to invest in higher gain financial products to make a profit. Further, the bank typically stands a greater chance to convince long term deposit holders into buying other products such as funds or insurance to further increase their revenues.

定期存款使银行可以保留一定时间的存款，从而可以投资收益更高的金融产品以获利。此外，银行通常更有机会说服长期存款持有人购买其他产品，例如基金或保险，以进一步增加收入。

As a result, the bank wishes to identify existing customers that have a greater chance of being a term deposit subscriber and focus marketing efforts on such clients.

结果，该银行希望确定有更大机会成为定期存款订户的现有客户，并将营销工作重点放在此类客户上。

构筑ML问题- (Framing the ML problem -)

Upon understanding the business problem, the first task as a data scientist is to match the business problem to a machine learning problem.

了解了业务问题后，作为数据科学家的首要任务是使业务问题与机器学习问题匹配。

The target variable (y) or the answer to the question — ‘whether this customer is likely to be a long-term deposit subscriber?’, would be in the form of ‘Yes’ or ‘No’. So that means we need to predict a category and hence this can be identified as a binary classification problem.

目标变量(y)或问题的答案-“此客户是否可能是长期存款订户？”的形式为“是”或“否”。因此，这意味着我们需要预测类别，因此可以将其识别为二进制分类问题。

获得数据集的“感觉” – (Getting a ‘feel’ of the dataset –)

Before jumping into the ‘brainy’ stuff, it’s always handy to make sense of the data. By this I mean, understanding various features variables or columns in the data set, identify numeric or categorical data types, total number of rows etc.

在跳入“聪明”的东西之前，弄清数据总是很方便的。我的意思是，理解数据集中的各种特征变量或列，识别数字或分类数据类型，总行数等。

To create an easy guide for understanding all the feature variables let’s create a data dictionary that provides brief information about the feature variables.

为了创建一个易于理解的指南，让我们创建一个数据字典，其中提供了有关特征变量的简要信息。

Using dataframe.shape — we know there are 32950 rows and 16 columns and dataframe.head — we get a nice snapshot of the first five rows (default) of data across various feature columns

使用dataframe.shape-我们知道有32950行，16列和dataframe.head-我们获得了各个要素列的前五行(默认)数据的漂亮快照

dataframe = pd.read_csv('/content/new_train.csv')
print(dataframe.shape)
dataframe.head()

dataframe.info — we get non-null count and the datatypes of the various variables, here we don’t have null values in any of the feature columns.

dataframe.info —我们获得非空计数和各种变量的数据类型，这里在任何功能列中都没有空值。

dataframe.info()

dataframe.describe — we get a quantitative summary of the numerical variables

dataframe.describe —我们得到了数值变量的定量总结

dataframe.describe()

数据准备– (Data preparation –)

Now we get to the most ‘unglamorous’ but crucial portion of our task.One of the first tasks at this stage is to check for missing values and carry out a missing value imputation based on the type of feature variable. Fortunately, the dataset at hand has no missing values.

现在我们开始执行任务中最``不起眼的''但至关重要的部分。此阶段的首要任务之一是检查缺失值并根据特征变量的类型执行缺失值插补。幸运的是，手头的数据集没有缺失值。

# To identify the number of missing values in every featuretotal = dataframe.isnull().sum()# Converting the number of missing values in percentagepercent = (total/len(dataframe))
print(percent)

Another important aspect is to check for the presence of class imbalance — whether there is a ‘predominance’ of a class or an imbalance in the target variable typically greater than an 80:20 split. It is important to check for this as it could lead to a systemic bias or tendency of a ML model to favour the majority class. Our dataset shows class imbalance of 89 : 11 in the favour of ‘No’ — showing class imbalance.

另一个重要方面是检查类不平衡的存在—类的“优势”还是目标变量的不平衡(通常大于80:20的比例)。检查这一点很重要，因为这可能导致系统化的偏见或ML模型倾向于多数派的趋势。我们的数据集显示班级不平衡为89:11，赞成“否”-显示班级不平衡。

# finding the percentage of each class in the feature 'y'class_values = (dataframe['y'].value_counts()/dataframe['y'].value_counts().sum())*100print(class_values)

探索性数据分析– (Exploratory Data Analysis –)

The Sherlock Homes stuff begins with the Exploratory Data Analysis. It is carried out in two stages — Univariate and Bivariate analysis. To aid creating multiple visualisations at once, lets split the total data frame into numeric and categorical columns and implement a simple for loop.

Sherlock Homes的内容始于探索性数据分析。它分两个阶段执行-单变量和双变量分析。为了帮助一次创建多个可视化效果，让我们将整个数据框架分为数字和类别列，并实现一个简单的for循环。

# Identifying the Numerical featuresnumeric_data = dataframe.select_dtypes(include=np.number)
numeric_col = numeric_data.columns
print(numeric_data.head())

# Identifying the Categorical featurescategorical_data = dataframe.select_dtypes(include=['object'])
categorical_col = categorical.columns
print(categorical_data.head())

单变量分析– (Univariate analysis –)

One feature variable at a time, we can study the mean, variance, range, median, mode, distribution, category imbalances etc. — basically do a descriptive analysis and identify important characteristics of our customer base.

一次可以使用一个特征变量，我们可以研究均值，方差，范围，中位数，众数，众数，分布，类别不平衡等-基本上进行描述性分析并确定我们客户群的重要特征。

Firstly, let’s deal with the categorical variables.

首先，让我们处理分类变量。

plt.style.use('ggplot')# Plotting a bar chart for each of the cateorical variablefor column in categorical_col:
  plt.figure(figsize=(20,4))
  plt.subplot(121)
  dataframe[column].value_counts().plot(kind='bar')
  plt.title(column)

Observations –

观察–

1. Over 65% of our total customers belong to Admin, Blue Collar and Technician job categories. These are steady middle-income jobs that are not very reactive to economic shocks in the short run.

1.我们超过65％的客户属于Admin，Blue Collar和Technician职位类别。这些是稳定的中等收入工作，短期内对经济冲击React不大。

2. About 60% of the customers are married, which implies that our customers would have a financial responsibility not only towards themselves but to their spouse and children which demands financial savviness from their end. Additionally, some of these could be double income families who would be highly interested in future savings and investments.

2.大约60％的客户已婚，这意味着我们的客户不仅要对自己承担经济责任，而且要对配偶和子女承担经济责任，这要求客户从头到尾都精打细算。此外，其中一些可能是双收入家庭，他们对未来的储蓄和投资非常感兴趣。

3. Most of the customers have a ‘University Degree’ as their highest educational qualification and hence it would be safe to assume that they are relatively more informed about the economy, current affairs, personal finance etc. or know more about these through friends and relatives, generally speaking, compared to those in other categories under education.

3.大多数客户都拥有“大学学位”作为他们的最高学历，因此可以安全地假设他们相对较了解经济，时事，个人理财等，或者通过朋友和朋友了解更多这些信息。一般来说，亲戚与受教育的其他类别的亲戚相比。

4. Close to 79% of our customers have never defaulted on credit before, which is a good indicator of their credit worthiness or financial responsibility.

4.我们有近79％的客户以前从未发生过信用违约，这很好地表明了他们的信用价值或财务责任。

5. There is almost an equal split between customers who have opted for a housing loan and those who have not, but about 83% of our customers have not opted for a personal loan. Housing loans have lower interest rates with respect to personal loans and since most families would have only 1 loan, we can expect them to have surplus funds for future savings after their monthly EMI deductions.

5.选择住房贷款的客户与未选择住房贷款的客户之间几乎是平等的，但我们约有83％的客户未选择个人贷款。住房贷款相对于个人贷款具有较低的利率，并且由于大多数家庭只有1笔贷款，因此我们可以预期他们在扣除每月EMI之后会拥有剩余资金，以备将来储蓄。

6. Cell phones seem to be the most favoured mode of reaching out to the customers and thus campaigns should be directed in cognizance of this behaviour.

6.手机似乎是最受客户欢迎的方式，因此应针对这种行为进行宣传。

7. Most of the customers have been previously contacted in the months of May, June, July and Aug — possibly in line with their appraisal timelines.

7.先前已在5月，6月，7月和8月与大多数客户联系过，这可能与他们的评估时间表相符。

8. The target variable (y) is imbalanced — 89% ‘No’ vs 11% ‘Yes’

8.目标变量(y)不平衡-89％“否”与11％“是”

Since some categorical columns consist of ‘unknown’ fields, let’s do a mode imputation to replace the ‘unknown’ with the mode category for that feature variable.

由于某些类别列由“未知”字段组成，因此我们进行模式估算，以该功能变量的模式类别替换“未知”。

# Imputing the missing values of categorical data with modefor column in categorical_col:
  mode = dataframe[column].mode()[0]
  dataframe[column] = dataframe[column].replace('unknown',mode)

Now, let’s analyse the numerical columns.

现在，让我们分析数字列。

for column in numeric_col:
  plt.figure(figsize=(20,5))
  plt.subplot(121)
  sns.distplot(dataframe[column])
  plt.title(column)

Observations from analysis of Numeric Feature Variables –

从数字特征变量分析中观察到的结果–

1. Looking at the distribution curves for ‘age’, ‘duration’ and ‘campaign’ we can observe that they are skewed to the right, indicating presence of outliers.

1.查看“年龄”，“持续时间”和“活动”的分布曲线，我们可以发现它们向右偏斜，表明存在异常值。

2. The plot for ‘pdays’ shows that a huge number of customers have never been contacted previously, which shows that there is a huge unexplored set of customers.

2.“ pdays”图显示以前从未联系过大量客户，这表明有大量未开发的客户。

3. The feature columns — ‘pdays’ and ‘previous’ majorly include a single value, i.e. 999 and 1 respectively, therefore the predictive power of these columns would be very low and hence it is safe to remove them from our dataset before model building.

3.特征列-“ pdays”和“ previous”主要包含单个值，即分别为999和1，因此这些列的预测能力将非常低，因此在模型构建之前将它们从我们的数据集中删除是安全的。

4. The duration column has a long tail, implying that there are some customers who have spent a large amount of time conversing with the bank on previous campaigns, implying high interest levels and thus they should be given special emphasis in our future campaigns as they could be more likely to actually subscribe to our long term products.

4.持续时间栏的尾巴很长，这意味着有些客户花了很多时间与银行进行过先前的广告系列交流，这意味着较高的兴趣水平，因此在我们的未来广告系列中应特别重视他们更有可能实际订阅我们的长期产品。

# dropping the columns 
dataframe.drop(['pdays','previous'],1,inplace=True)

Bivariate Analysis –

双变量分析–

Next layer of analysis is a bivariate analysis of feature variables with the target variable using bar charts for visualisation.

下一步分析是使用条形图进行可视化，对特征变量与目标变量进行双变量分析。

for column in categorical_col:
  plt.figure(figsize=(20,4))
  plt.subplot(121)
  sns.countplot(x=dataframe[column],hue=dataframe['y'],
  data=dataframe)
  plt.title(column)
  plt.xticks(rotation=90)

Observations -

观察-

1. Customers from ‘admin’ have the highest number of subscribers, followed by ‘technician’ and ‘blue collar’.

1.“ admin”用户的订户数量最高，其次是“技术人员”和“蓝领”。

2. Most of the customers who have subscribed to long term products are married.

2.大多数订阅了长期产品的客户已婚。

3. ‘University Degree’ customers have the highest number of subscribers, followed by ‘high school’.

3.“大学学位”的客户人数最多，其次是“高中”。

4. None of our subscribers have defaulted on credit before.

4.我们的订户之前都没有违约信用。

5. Our customers have subscribed more or less equally, irrespective of whether they have or not taken a housing loan. And, customers, who have not taken a personal loan have more subscribers, which is line with our expectation out of the univariate analysis as well.

5.我们的客户或多或少平等地认购，无论他们是否获得住房贷款。而且，没有个人贷款的客户拥有更多的订户，这也符合我们从单变量分析中得出的预期。

6. Customers that ended up subscribing were using cell phones than landline; hence cell phones must be the mode of communication for our campaigns.

6.最终订阅的客户使用的是手机，而不是固定电话。因此，手机必须成为我们竞选活动的沟通方式。

处理异常值和标签编码– (Treating outliers and label encoding –)

There are different techniques for treating outliers, depending on the skewness of the feature variable.

根据特征变量的偏斜度，可以使用不同的技术来处理离群值。

The numeric feature variables are skewed to the right in our data, hence transformation to a lower dimension is to be used. But some data points are zero and thus applying log or roots won’t solve the problem, so we proceed with the winsorization technique. All the values below the 5th percentile will be equated to the value at 5th percentile and likewise all the values above 95th percentile will be equated to the value at the 95th percentile, thus a confidence interval of 90%.

数字特征变量在我们的数据中偏向右侧，因此将使用转换为较小的维度。但是某些数据点为零，因此应用对数或根不能解决问题，因此我们继续进行winsorization技术。低于第5个百分点的所有值都将等于第5个百分点的值，同样，高于第95个百分点的所有值都将等于第95个百分点的值，因此置信区间为90％。

for col in numeric_col:
  dataframe[col] = winsorize(dataframe[col], limits=[0.05,    0.1],inclusive=(True, True))

Label encoding is conducted for the categorical variables and the target variable as our ML model can only make sense of numeric data.

标签编码是针对分类变量和目标变量进行的，因为我们的ML模型只能理解数字数据。

le = LabelEncoder()# Iterating through each of the categorical columns and label encoding themfor feature in categorical_col:  try:
    dataframe[feature] = le.fit_transform(dataframe[feature])  except:
    print('Error encoding '+feature)# Saving the label encoded columns in our dataset
dataframe.to_csv('/content/new_train.csv',index=False)dataframe.head()

应用香草模型和模型选择–(Applying Vanilla models and model selection –)

We call these ‘Vanilla’ models because we are not tweaking their default parameters.

我们称这些为“ Vanilla”模型是因为我们没有调整它们的默认参数。

Alright! We have arrived at the most ‘glamorous’ portion of our data analysis pipeline — Model building.

好的！我们已经到达了数据分析流程中最“迷人的”部分-模型构建。

Since, we have a classification problem at hand let’s experiment with the usual suspects –

既然如此，我们手头有一个分类问题，让我们尝试一下常见的可疑对象–

· Logistic Regression

·Logistic回归

· Decision Tree Classifier

·决策树分类器

· Random Forest Classifier

·随机森林分类器

To select our final model, we will use area under the ROC curve — AUC score.

要选择最终模型，我们将使用ROC曲线下的面积-AUC得分。

# Independent variablesX = dataframe.iloc[:,:-1]# Target variablesy = dataframe.iloc[:,-1]# Dividing the data into train and test subsets
x_train,x_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=42)

Logistic Regression

逻辑回归

# fitting the model
model.fit(x_train, y_train)# predicting the values
y_scores = model.predict(x_val)# plotting the auc roc curveauc = roc_auc_score(y_val, y_scores)false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()

We have obtained a ROC score of 0.5714, let’s see if we can find a better model.

我们的ROC得分为0.5714，让我们看看是否可以找到一个更好的模型。

Decision Tree Classifier

决策树分类器

model = DecisionTreeClassifier()model.fit(x_train, y_train)y_scores = model.predict(x_val)auc = roc_auc_score(y_val, y_scores)false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()

That’s a significant improvement over the logistic regression model with a score of 0.6953.

与得分为0.6953的逻辑回归模型相比，这是一个重大改进。

Random Forest Classifier

随机森林分类器

model = RandomForestClassifier()model.fit(x_train, y_train)y_scores = model.predict(x_val)auc = roc_auc_score(y_val, y_scores)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()

Voila!

瞧！

Decision Tree Classifier is the winning model with a score of 0.6953.

决策树分类器是获胜的模型，得分为0.6953。

Using the decision tree classifier model we can predict the ‘y’ variable with a predictive power of close to 70%. We can apply the model on unlabeled data to predict which customers are more likely to be a long term deposit subscriber and target them with relevant marketing campaigns.

使用决策树分类器模型，我们可以以接近70％的预测力来预测“ y”变量。我们可以将模型应用于未标记的数据，以预测哪些客户更有可能成为长期存款订户，并通过相关的营销活动来定位他们。

未来的旅程- (The journey ahead -)

We have successfully applied a supervised learning technique to tackle our ML problem, but this does not end here. Treating class imbalance and carrying hyperparameter tuning can even improve our model metrics.

我们已经成功地应用了有监督的学习技术来解决我们的机器学习问题，但这还不止于此。处理类不平衡并进行超参数调整甚至可以改善我们的模型指标。

The future scope would be along the lines of –

未来的范围将遵循–

· Treatment of class imbalance using SMOTE

·使用SMOTE处理班级失衡

· Feature selection techniques using RFE or Random Forest inbuilt feature selectors

·使用RFE或Random Forest内置特征选择器的特征选择技术

· Carrying out hyper parameter tuning

·进行超参数调整

· Experimenting with other model evaluation metrics

·试验其他模型评估指标