竞赛探索性数据分析

 最近在做一个kaggle比赛,是利用一家银行的用户样本的各项数据来对用户是否会贷款进行预测。

探索性数据分析(EDA)

数据总览

train_df = pd.read_csv('train.csv') 
test_df = pd.read_csv('test.csv') 
train_df.head()

 

 ID_codetargetvar_0var_1var_2var_3var_4var_5var_6var_7...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
0train_008.9255-6.786311.90815.093011.4607-9.28345.118718.6266...4.43543.96423.13641.691018.5227-2.39787.87848.563512.7803-1.0914
1train_1011.5006-4.147313.85885.389012.36227.04335.620816.5338...7.64217.72142.583710.951615.43052.03398.12678.788918.35601.9518
2train_208.6093-2.745712.08057.892810.5825-9.08376.942714.6155...2.90579.79051.67041.685821.60423.1417-6.52138.267514.72220.3965
3train_3011.0604-2.15188.95227.195712.5846-1.83615.842814.9250...4.46664.74330.71781.421423.0347-1.2706-2.927510.292217.9697-8.9996
4train_409.8369-1.483412.87466.637512.27722.44865.940519.2514...-1.49059.5214-0.15089.194213.2876-1.51213.92679.503117.9974-8.8104
train_df.shape, train_df.shape

((200000, 202), (200000, 202))

检查是否有缺失值

def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

%%time
missing_data(train_df)
 ID_codetargetvar_0var_1var_2var_3var_4var_5var_6var_7...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
Total0000000000...0000000000
Percent0000000000...0000000000
Typesobjectint64float64float64float64float64float64float64float64float64...float64float64float64float64float64float64float64float64float64float64
%%time
missing_data(test_df)
 ID_codevar_0var_1var_2var_3var_4var_5var_6var_7var_8...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
Total0000000000...0000000000
Percent0000000000...0000000000
Typesobjectfloat64float64float64float64float64float64float64float64float64...float64float64float64float64float64float64float64float64float64

结论

可见无缺失值 

检查数值

train_df.describe()
targetvar_0var_1var_2var_3var_4var_5var_6var_7var_8...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
count200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000...200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000
mean0.10049010.679914-1.62762210.7151926.79652911.078333-5.0653175.40894916.5458500.284162...3.2344407.4384081.9278393.33177417.993784-0.1420882.3033358.90815815.870720-3.326537
std0.3006533.0400514.0500442.6408942.0433191.6231507.8632670.8666073.4180763.332634...4.5599223.0232721.4784233.9920303.1351621.4293725.4543690.9216253.01094510.438015
min0.0000000.408400-15.0434002.117100-0.0402005.074800-32.5626002.3473005.349700-10.505500...-14.093300-2.691700-3.814500-11.7834008.694400-5.261000-14.2096005.9606006.299300-38.852800
25%0.0000008.453850-4.7400258.7224755.2540759.883175-11.2003504.76770013.943800-2.317800...-0.0588255.1574000.8897750.58460015.629800-1.170700-1.9469258.25280013.829700-11.208475
50%0.00000010.524750-1.60805010.5800006.82500011.108250-4.8331505.38510016.4568000.393700...3.2036007.3477501.9013003.39635017.957950-0.1727002.4089008.88820015.934050-2.819550
75%0.00000012.7582001.35862512.5167008.32410012.2611250.9248006.00300019.1029002.937900...6.4062009.5125252.9495006.20580020.3965250.8296006.5567259.59330018.0647254.836800
max1.00000020.31500010.37680019.35300013.18830016.67140017.2516008.44770027.69180010.151300...18.44090016.7165008.40240018.28180027.9288004.27290018.32150012.00040026.07910028.500700
test_df.describe()

 

var_0var_1var_2var_3var_4var_5var_6var_7var_8var_9...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
count200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000...200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000200000.000000
mean10.658737-1.62424410.7074526.78821411.076399-5.0505585.41516416.5291430.2771357.569407...3.1897667.4582691.9259443.32201617.996967-0.1336572.2908998.91242815.869184-3.246342
std3.0367164.0405092.6338882.0527241.6164567.8692930.8646863.4244823.3333751.231865...4.5512393.0251891.4799663.9955993.1406521.4296785.4463460.9209043.00871710.398589
min0.188700-15.0434002.355200-0.0224005.484400-27.7670002.2164005.713700-9.9560004.243300...-14.093300-2.407000-3.340900-11.4131009.382800-4.911900-13.9442006.1696006.584000-39.457800
25%8.442975-4.7001258.7356005.2305009.891075-11.2014004.77260013.933900-2.3039006.623800...-0.0950005.1665000.8829750.58760015.634775-1.160700-1.9486008.26007513.847275-11.124000
50%10.513800-1.59050010.5607006.82235011.099750-4.8341005.39160016.4227000.3720007.632000...3.1624007.3790001.8926003.42850017.977600-0.1620002.4036008.89280015.943400-2.725950
75%12.7396001.34340012.4950258.32760012.2534000.9425756.00580019.0945502.9300258.584825...6.3364759.5311002.9560006.17420020.3917250.8379006.5198009.59590018.0452004.935400
max22.3234009.38510018.71410013.14200016.03710017.2537008.30250028.2928009.66550011.003600...20.35900016.7165008.00500017.63260027.9478004.54540015.92070012.27580026.53840027.907400

结论

1.标准差std都很大

2. min, max, mean, sdt训练集和测试集比较接近

3.mean values平均值分布跨度较大

图表展示

散点图展示部分特征的分布情况

def plot_feature_scatter(df1, df2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(4,4,figsize=(14,14))

    for feature in features:
        i += 1
        plt.subplot(4,4,i)
        plt.scatter(df1[feature], df2[feature], marker='+')
        plt.xlabel(feature, fontsize=9)
    plt.show();

上面好像看不出来啥,感觉都比较聚集

样本是否平衡

import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(train_df['target'])

positive_num = data_train.target[data_train.target==0].value_counts()
negative_num = data_train.target[data_train.target==1].value_counts()

 正样本: positive_num = 1       20098

负样本: negative_num = 0      179902

print("target为1的比例:{}% ".format(100 * train_df["target"].value_counts()[1]/train_df.shape[0]))

 target为1的比例:10.049%

显然这个样本非常不平衡,可以考虑采用样本均衡化

看每个feature的样本分布情况

首先看target0和target之间的区别

def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(18,22))

    for feature in features:
        i += 1
        plt.subplot(10,10,i)
        sns.kdeplot(df1[feature], bw=0.5,label=label1)
        sns.kdeplot(df2[feature], bw=0.5,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.show();
t0 = train_df.loc[train_df['target'] == 0]
t1 = train_df.loc[train_df['target'] == 1]
features = train_df.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features)

features = train_df.columns.values[102:202]
plot_feature_distribution(t0, t1, '0', '1', features)

从上面我们可以看出哪些特征target0,1分布有较大差异

同样特征在train和test的分布

features = train_df.columns.values[2:102]
plot_feature_distribution(train_df, test_df, 'train', 'test', features)

features = train_df.columns.values[102:202]
plot_feature_distribution(train_df, test_df, 'train', 'test', features)

几乎看不出来差别,说明测试集和训练集非常吻合,有利于预测

查看统计数据的分布情况

train和test平均值

#画图展示train和test每一行的平均值的分布
plt.figure(figsize=(16,6))
features = train_df.columns.values[2:202]
plt.title("Distribution of mean values per row in the train and test set")
sns.distplot(train_df[features].mean(axis=1),color="green", kde=True,bins=120, label='train')
sns.distplot(test_df[features].mean(axis=1),color="blue", kde=True,bins=120, label='test')
plt.legend()
plt.show()
#画图展示train和test每一列的平均值的分布
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per column in the train and test set")
sns.distplot(train_df[features].mean(axis=0),color="magenta",kde=True,bins=120, label='train')
sns.distplot(test_df[features].mean(axis=0),color="darkblue", kde=True,bins=120, label='test')
plt.legend()
plt.show()

train和test标准差std

#画图展示train和test每一行的std的分布
plt.figure(figsize=(16,6))
plt.title("Distribution of std values per row in the train and test set")
sns.distplot(train_df[features].std(axis=1),color="black", kde=True,bins=120, label='train')
sns.distplot(test_df[features].std(axis=1),color="red", kde=True,bins=120, label='test')
plt.legend();plt.show()
#画图展示train和test每一列的std的分布
plt.figure(figsize=(16,6))
plt.title("Distribution of std values per column in the train and test set")
sns.distplot(train_df[features].std(axis=0),color="blue",kde=True,bins=120, label='train')
sns.distplot(test_df[features].std(axis=0),color="green", kde=True,bins=120, label='test')
plt.legend(); plt.show()

target0和1平均值

t0 = train_df.loc[train_df['target'] == 0]
t1 = train_df.loc[train_df['target'] == 1]
plt.figure(figsize=(16,6))
plt.title("Distribution of mean values per row in the train set")
sns.distplot(t0[features].mean(axis=1),color="red", kde=True,bins=120, label='target = 0')
sns.distplot(t1[features].mean(axis=1),color="blue", kde=True,bins=120, label='target = 1')
plt.legend(); plt.show()

train和test的min

#画图展示train和test每一行的min的分布
plt.figure(figsize=(16,6))
features = train_df.columns.values[2:202]
plt.title("Distribution of min values per row in the train and test set")
sns.distplot(train_df[features].min(axis=1),color="red", kde=True,bins=120, label='train')
sns.distplot(test_df[features].min(axis=1),color="orange", kde=True,bins=120, label='test')
plt.legend()
plt.show()
#画图展示train和test每一列的min的分布
plt.figure(figsize=(16,6))
features = train_df.columns.values[2:202]
plt.title("Distribution of min values per column in the train and test set")
sns.distplot(train_df[features].min(axis=0),color="magenta", kde=True,bins=120, label='train')
sns.distplot(test_df[features].min(axis=0),color="darkblue", kde=True,bins=120, label='test')
plt.legend()
plt.show()

train和test的max分布

train中target0和1min的分布

train中target0和1max的分布

skew和kurtosis偏度和峰度

#画图展示train和test每一行的skew的分布
plt.figure(figsize=(16,6))
plt.title("Distribution of skew per row in the train and test set")
sns.distplot(train_df[features].skew(axis=1),color="red", kde=True,bins=120, label='train')
sns.distplot(test_df[features].skew(axis=1),color="orange", kde=True,bins=120, label='test')
plt.legend()
plt.show()
#画图展示train和test每一列的skew的分布
plt.figure(figsize=(16,6))
plt.title("Distribution of skew per column in the train and test set")
sns.distplot(train_df[features].skew(axis=0),color="magenta", kde=True,bins=120, label='train')
sns.distplot(test_df[features].skew(axis=0),color="darkblue", kde=True,bins=120, label='test')
plt.legend()
plt.show()

继续分别为train和test的skew分布,train和test的kurtosis分布,train中target0和1中skew、kurtosis的分布

特征相关度

%%time
features = [c for c in train_df.columns if c not in ['ID_code', 'target']]
correlations = train_df[features].corr().abs().unstack().sort_values(kind="quicksort").reset_index()
correlations = correlations[correlations['level_0'] != correlations['level_1']]
correlations.head(10)
#打印出相关系数最高的10组特征

 

level_0level_10
39790var_183var_1890.009359
39791var_189var_1830.009359
39792var_174var_810.009490
39793var_81var_1740.009490
39794var_81var_1650.009714
39795var_165var_810.009714
39796var_53var_1480.009788
39797var_148var_530.009788
39798var_26var_1390.009844
39799var_139var_260.009844

可见相关性都很小

检查每一列的重复值

%%time
features = train_df.columns.values[2:202]
unique_max_train = []
unique_max_test = []
for feature in features:
    values = train_df[feature].value_counts()
    unique_max_train.append([feature, values.max(), values.idxmax()])
    values = test_df[feature].value_counts()
    unique_max_test.append([feature, values.max(), values.idxmax()])
#查看train前15的重复值
np.transpose((pd.DataFrame(unique_max_train, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))
#查看test前15的重复值
np.transpose((pd.DataFrame(unique_max_test, columns=['Feature', 'Max duplicates', 'Value'])).\
            sort_values(by = 'Max duplicates', ascending=False).head(15))
 681261081291103148161257143166125169133
Featurevar_68var_126var_108var_12var_91var_103var_148var_161var_25var_71var_43var_166var_125var_169var_133
Max duplicates11043073021888678746960605853535150
Value5.019711.535714.199913.55466.99391.46594.00045.711413.59650.538911.57382.844612.21895.84556.6873

 可见:train和test的重复值特征排列、出现的次数、特征对应的值都差不多

本文参考:https://www.kaggle.com/gpreda/santander-eda-and-prediction 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值