部分词典
列名 | 对照 |
---|---|
Recency | 距离最后一次到店的时间 |
Kidhome | 家里有多少个儿童 |
NumStorePurchases | 到店购买数量 |
AcceptedCmp3 | 参加活动3 |
Response | 参加最后一次活动 |
注:对于定性变量,1为是,0为否
前言
数据集github链接
总体目标:你是一名营销分析师,首席营销官告诉你,最近的营销活动没有预期的那么有效。您需要分析数据集来理解这个问题,并提出数据驱动的解决方案。)
一、探索性数据分析
导入库,读取文件
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('ml_project1_data.csv')
1.是否存在空值或异常值,如何处理
df.isna().sum()
输出:
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64
Income列有24个空值,先看看它的分布再决定如何处理空值
sns.boxplot(data=df,y='Income')
输出:
有几个异常大的值,为了不改变数据分布所以这里采用中位数填充空值
df['Income'] = df['Income'].fillna(df['Income'].median())
2.有变量需要转换数据类型吗
先看一下数据类型
df.info()
输出:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
有三个object类型的列,下面看看它们分别是什么类型的数据
df[['Education','Marital_Status','Dt_Customer']].head()
输出:
Education | Marital_Status | Dt_Customer | |
---|---|---|---|
0 | Graduation | Single | 2012-09-04 |
1 | Graduation | Single | 2014-03-08 |
2 | Graduation | Together | 2013-08-21 |
3 | Graduation | Together | 2014-02-10 |
4 | PhD | Married | 2014-01-19 |
为方便处理,前两列需要map为数字,时间类就转为datetime类
不过再map之前需要先看下前两列的分布
f, axes = plt.subplots(1, 2, figsize=(16, 4),sharey=True)
for ax,col in zip(axes,['Education','Marital_Status']):
sns.countplot(x=df[col],ax=ax)
输出:
Marital_Status有三个明显异常值,先看看这些数据正不正常
df[df['Marital_Status'].isin(['Alone','Absurd','YOLO'])]
输出:
ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
131 | 433 | 1958 | Master | Alone | 61331.0 | 1 | 1 | 2013-03-10 | 42 | 534 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
138 | 7660 | 1973 | PhD | Alone | 35860.0 | 1 | 1 | 2014-05-19 | 37 | 15 | ... | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
153 | 92 | 1988 | Graduation | Alone | 34176.0 | 1 | 0 | 2014-05-12 | 12 | 5 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
2093 | 7734 | 1993 | Graduation | Absurd | 79244.0 | 0 | 0 | 2012-12-19 | 58 | 471 | ... | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 11 | 1 |
2134 | 4369 | 1957 | Master | Absurd | 65487.0 | 0 | 0 | 2014-01-10 | 48 | 240 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
2177 | 492 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 2012-10-18 | 3 | 322 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
2202 | 11133 | 1973 | PhD | YOLO | 48432.0 | 0 | 1 | 2012-10-18 | 3 | 322 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
7 rows × 29 columns
婚姻状态值为Alone或YOLO的数据显示这些顾客家中都有儿女,因而可以将其归为Divorced类,而值为Absurd的顾客家中无儿女,这里简单的将其归为Single类
可以看到婚姻状况为YOLO两行数据除了ID列外其余列的数据相同,因此认定为重复项,因为数据集还可能有其它重复项,所以在这里先简单处理一下
print('除去',len(df)-len(df.drop_duplicates(subset=df.columns[1:])),'重复项')
df=df.drop_duplicates(subset=df.columns[1:])
输出:
除去 182 重复项
现在回头map前两列数据以及把时间列转为datetime类型
df['Marital_Status']=df['Marital_Status'].map({
'Single':'1','Together':'4','Married':'5',
'Divorced':'2','Widow':'3','Alone':'2',
'Absurd':'1','YOLO':'2'}).astype(int)
df['Education']=df['Education'].map({
'Graduation':'2', 'PhD':'5', 'Master':'4',
'Basic':'1', '2n Cycle':'3'}).astype(int)
df['Dt_Customer']=pd.to_datetime(df['Dt_Customer'])
注:数值大小根据来源为生活经验
3.你注意到数据中有什么异常吗?
现在看看其余列的数据分布是否有异常
df.drop(['ID','Education','Marital_Status','Dt_Customer'],1).plot(subplots=True, kind='box',
layout=(5,5), figsize=(16,16), patch_artist=True)
输出: