Kaggle | Santander Customer Transaction Prediction(EDA and Baseline)

Santander Customer Transaction Prediction: EDA and Baseline

1 Description

At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

2 Prepare The Data

2.1 Import and preparation

First we import the packages that we might need in the solution.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import six.moves.urllib as urllib
import sklearn
import scipy
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
import lightgbm as lgb
%matplotlib inline
PATH='E:/kaggle/santander-customer-transaction-prediction/'
train=pd.read_csv(PATH+'train.csv')
test=pd.read_csv(PATH+'test.csv')

Check the data information

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB

Check the dimension of the data

train.shape
(200000, 202)
train.head()
ID_codetargetvar_0var_1var_2var_3var_4var_5var_6var_7...var_190var_191var_192var_193var_194var_195var_196var_197var_198var_199
0train_008.9255-6.786311.90815.093011.4607-9.28345.118718.6266...4.43543.96423.13641.691018.5227-2.39787.87848.563512.7803-1.0914
1train_1011.5006-4.147313.85885.389012.36227.04335.620816.5338...7.64217.72142.583710.951615.43052.03398.12678.788918.35601.9518
2train_208.6093-2.745712.08057.892810.5825-9.08376.942714.6155...2.90579.79051.67041.685821.60423.1417-6.52138.267514.72220.3965
3train_3011.0604-2.15188.95227.195712.5846-1.83615.842814.9250...4.46664.74330.71781.421423.0347-1.2706-2.927510.292217.9697-8.9996
4train_409.8369-1.483412.87466.637512.27722.44865.940519.2514...-1.49059.5214-0.15089.194213.2876-1.51213.92679.503117.9974-8.8104

5 rows × 202 columns

We can observe the basic condition of the data here. We can not infer any actual information from the name of the columns and the data, too. So it is better for us to find out more. Before that, first test whether there are missing values.

2.2 Check the Data

# check the missing values
data_na=(train.isnull().sum()/len(train))*100
data_na=data_na.drop(data_na[data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'MissingRatio':data_na})
print(missing_data)
Empty DataFrame
Columns: [MissingRatio]
Index: []

We can see there are no missing values.

train.target.value_counts()
0    179902
1     20098
Name: target, dtype: int64

The dataset may be quite unbalanced, we can see that almost 90 percent of the items have the target ‘0’ while 10 percent are ‘1’.

We first extract all the features here.

features=[col for col in train.columns if col not in ['ID_code','target']]

3 EDA

3.1 Check the Train-test Distribution

Before we doing our work, we might be extremely interested in the distribution of the dataset. The division of train set and test set should be as balanced as possible in all kinds of aspects. So we first examine this point.

First we check the mean values per row.

# check the distribution
plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per row in the train and test set')
sns.distplot(train[features].mean(axis=1),color='green',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=1),color='red',kde=True,bins=120,label='test')
plt.legend()
plt.show()

在这里插入图片描述

Then we apply the same operation to the columns.

plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per column in the train and test set')
sns.distplot(train[features].mean(axis=0),color='purple',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=0),color='orange',kde=True,bins=120,label='test')
plt.legend()
plt.show()

在这里插入图片描述

Besides, the standard deviation also worth examining.

plt.figure(figsize=(18,10))
plt.title('Distribution of std values per rows in the train and test set')
sns.distplot(train[features].std(axis=1),color='black',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=1),color='yellow',kde=True,bins=120,label='test')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of std values per column in the train and test set')
sns.distplot(train[features].std(axis=0),color='blue',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=0),color='green',kde=True,bins=120,label='test')
plt.legend()
plt.show()

在这里插入图片描述

We can see the data distribution of each row and column in the train set and the test set are almost balanced.

3.2 Check the Feature Correlation

# check the feature correlation
corrmat=train.corr()
plt.subplots(figsize=(18,18))
sns.heatmap(corrmat,vmax=0.9,square=True)
<matplotlib.axes._subplots.AxesSubplot at 0x25c953f7358>

在这里插入图片描述

We can see that the correlation between features are barely slight. Also it is worth to check the biggest correlation value.

%%time
correlations=train[features].corr().unstack().sort_values(kind='quicksort').reset_index()
correlations=correlations[correlations['level_0']!=correlations['level_1']]
Wall time: 16.2 s
correlations.tail(10)
level_0level_10
39790var_122var_1320.008956
39791var_132var_1220.008956
39792var_146var_1690.009071
39793var_169var_1460.009071
39794var_189var_1830.009359
39795var_183var_1890.009359
39796var_174var_810.009490
39797var_81var_1740.009490
39798var_165var_810.009714
39799var_81var_1650.009714
correlations.head(10)
level_0level_10
0var_26var_139-0.009844
1var_139var_26-0.009844
2var_148var_53-0.009788
3var_53var_148-0.009788
4var_80var_6-0.008958
5var_6var_80-0.008958
6var_1var_80-0.008855
7var_80var_1-0.008855
8var_13var_2-0.008795
9var_2var_13-0.008795

Well, the maximum absolute value of feature correlation is below 0.01. So we might not get any useful information from here.

3.3 Further Exploring

How about the distribution of each feature, here we try to print all the distribution plot on a single graph.

# check the distribution of each feature
def plot_features(df1,df2,label1,label2,features):
    sns.set_style('whitegrid')
    plt.figure()
    fig,ax=plt.subplots(10,20,figsize=(18,22))
    i=0
    for feature in features:
        i+=1
        plt.subplot(10,20,i)
        sns.distplot(df1[feature],hist=False,label=label1)
        sns.distplot(df2[feature],hist=False,label=label2)
        plt.xlabel(feature,fontsize=9)
        locs, labels=plt.xticks()
        plt.tick_params(axis='x',which='major',labelsize=6,pad=-6)
        plt.tick_params(axis='y',which='major',labelsize=6)
    plt.show()
        
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[2:202]
plot_features(t0,t1,'0','1',features)
<Figure size 432x288 with 0 Axes>

在这里插入图片描述

features=train.columns.values[2:202]
plot_features(train,test,'train','test',features)
<Figure size 432x288 with 0 Axes>

在这里插入图片描述

All the features here are nearly balanced, it can make our work really convenient.

3.4 Other Statistical Indicators that Worth Checking

In order to have a more comprehensive grasp of the whole data, we can check every statistical indicators that might provide more

# Distribution of min and max
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
plt.figure(figsize=(18,10))
plt.title('Distribution of min values per row in the train set')
sns.distplot(t0[features].min(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of min values per column in the train set')
sns.distplot(t0[features].min(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.plot()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of max values per row in the train set')
sns.distplot(t0[features].max(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of max values per column in the train set')
sns.distplot(t0[features].max(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

# skewness and kurtosis
plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per row in the train set')
sns.distplot(t0[features].skew(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per column in the train set')
sns.distplot(t0[features].skew(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per row in the train set')
sns.distplot(t0[features].kurtosis(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per column in the train set')
sns.distplot(t0[features].kurtosis(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()

在这里插入图片描述

4 Feature Engineering and Modeling

4.1 Create New Features

We can add the statistical indicators to the dataset for modeling. They may be useful.

# creating new features
idx=features=train.columns.values[2:202]
for df in [train,test]:
    df['sum']=df[idx].sum(axis=1)
    df['min']=df[idx].min(axis=1)
    df['max']=df[idx].max(axis=1)
    df['mean']=df[idx].mean(axis=1)
    df['std']=df[idx].std(axis=1)
    df['skew']=df[idx].skew(axis=1)
    df['kurt']=df[idx].kurtosis(axis=1)
    df['med']=df[idx].median(axis=1)
train[train.columns[202:]].head(10)
summinmaxmeanstdskewkurtmed
01456.3182-21.449443.11277.2815919.3315400.1015801.3310236.77040
11415.3636-47.379740.56327.07681810.336130-0.3517344.1102157.22315
21240.8966-22.403833.88206.2044838.753387-0.0569570.5464385.89940
31288.2319-35.165938.10156.4411609.594064-0.4801162.6304996.70260
41354.2310-65.486341.10376.77115511.287122-1.4634269.7873996.94735
51272.3216-44.725735.26646.3616089.313012-0.9204394.5813436.23790
61509.4490-29.976339.95997.5472459.246130-0.1334891.8164537.47605
71438.5083-27.254331.90437.1925419.162558-0.3004151.1742736.97300
81369.7375-31.785542.47986.8486889.8375200.0840471.9970406.32870
91303.1155-39.304234.46406.5155779.943238-0.6700242.5211606.36320
test[test.columns[201:]].head(10)
summinmaxmeanstdskewkurtmed
01416.6404-31.989142.02487.0832029.910632-0.0885181.8712627.31440
11249.6860-41.192435.60206.2484309.541267-0.5597853.3910686.43960
21430.2599-34.348839.36547.1513009.967466-0.1350842.3269017.26355
31411.4447-21.479740.33837.0572248.257204-0.1677412.2530546.89675
41423.7364-24.825445.55107.11868210.0435420.2934842.0449436.83375
51273.1592-19.895230.26476.3657968.728466-0.0318140.1137635.83800
61440.7387-18.748137.46117.2036938.676615-0.0454070.6537826.66335
71429.5281-22.736333.23877.1476409.697687-0.0177840.7130217.44665
81270.4978-17.471928.12256.3524898.257376-0.1386390.3423606.55820
91271.6875-32.877638.33196.3584379.489171-0.3544971.9342906.83960

Now let’s check the distributions of the new features.

def plot_new_features(df1,df2,label1,label2,features):
    sns.set_style('whitegrid')
    plt.figure()
    fig,ax=plt.subplots(2,4,figsize=(18,8))
    i=0
    for feature in features:
        i+=1
        plt.subplot(2,4,i)
        sns.kdeplot(df1[feature],bw=0.5,label=label1)
        sns.kdeplot(df2[feature],bw=0.5,label=label2)
        plt.xlabel(feature,fontsize=11)
        locs,labels=plt.xticks()
        plt.tick_params(axis='x',which='major',labelsize=8)
        plt.tick_params(axis='y',which='major',labelsize=8)
    plt.show()
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[202:]
plot_new_features(t0,t1,'0','1',features)
<Figure size 432x288 with 0 Axes>

在这里插入图片描述

print('Columns in train_set:{} Columns in test_set:{}'.format(len(train.columns),len(test.columns)))
Columns in train_set:210 Columns in test_set:209

4.2 Training the Model

Here’s a baseline model that uses LightGBM.

# training the model
features=[col for col in train.columns if col not in ['ID_code','target']]
target=train['target']
param={
    'bagging_freq':5,
    'bagging_fraction':0.4,
    'boost':'gbdt',
    'boost_from_average':'false',
    'feature_fraction':0.05,
    'learning_rate':0.01,
    'max_depth':-1,
    'metric':'auc',
    'min_data_in_leaf':80,
    'min_sum_hessian_in_leaf':10.0,
    'num_leaves':13,
    'num_threads':8,
    'tree_learner':'serial',
    'objective':'binary',
    'verbosity':1
}
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(train.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900229	valid_1's auc: 0.881617
[2000]	training's auc: 0.91128	valid_1's auc: 0.889429
[3000]	training's auc: 0.918765	valid_1's auc: 0.893439
[4000]	training's auc: 0.924616	valid_1's auc: 0.895931
[5000]	training's auc: 0.929592	valid_1's auc: 0.897636
[6000]	training's auc: 0.933838	valid_1's auc: 0.898786
[7000]	training's auc: 0.937858	valid_1's auc: 0.899318
[8000]	training's auc: 0.941557	valid_1's auc: 0.899733
[9000]	training's auc: 0.94517	valid_1's auc: 0.899901
[10000]	training's auc: 0.948529	valid_1's auc: 0.900143
[11000]	training's auc: 0.951807	valid_1's auc: 0.900281
[12000]	training's auc: 0.954903	valid_1's auc: 0.900269
[13000]	training's auc: 0.957815	valid_1's auc: 0.900107
[14000]	training's auc: 0.960655	valid_1's auc: 0.89994
Early stopping, best iteration is:
[11603]	training's auc: 0.953681	valid_1's auc: 0.900347
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900404	valid_1's auc: 0.882765
[2000]	training's auc: 0.911307	valid_1's auc: 0.889508
[3000]	training's auc: 0.918917	valid_1's auc: 0.893254
[4000]	training's auc: 0.924779	valid_1's auc: 0.895682
[5000]	training's auc: 0.929704	valid_1's auc: 0.897004
[6000]	training's auc: 0.933907	valid_1's auc: 0.897785
[7000]	training's auc: 0.93784	valid_1's auc: 0.89799
[8000]	training's auc: 0.941511	valid_1's auc: 0.898383
[9000]	training's auc: 0.945033	valid_1's auc: 0.898701
[10000]	training's auc: 0.94837	valid_1's auc: 0.898763
[11000]	training's auc: 0.951605	valid_1's auc: 0.89877
[12000]	training's auc: 0.954709	valid_1's auc: 0.898751
[13000]	training's auc: 0.957618	valid_1's auc: 0.898634
Early stopping, best iteration is:
[10791]	training's auc: 0.950935	valid_1's auc: 0.89889
Fold 2
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.90084	valid_1's auc: 0.87531
[2000]	training's auc: 0.911957	valid_1's auc: 0.883717
[3000]	training's auc: 0.919463	valid_1's auc: 0.888423
[4000]	training's auc: 0.925317	valid_1's auc: 0.891101
[5000]	training's auc: 0.930106	valid_1's auc: 0.892821
[6000]	training's auc: 0.93436	valid_1's auc: 0.89362
[7000]	training's auc: 0.938282	valid_1's auc: 0.89429
[8000]	training's auc: 0.941897	valid_1's auc: 0.894544
[9000]	training's auc: 0.945462	valid_1's auc: 0.894652
[10000]	training's auc: 0.948798	valid_1's auc: 0.894821
[11000]	training's auc: 0.952036	valid_1's auc: 0.894888
[12000]	training's auc: 0.955136	valid_1's auc: 0.894657
[13000]	training's auc: 0.958081	valid_1's auc: 0.894511
[14000]	training's auc: 0.960904	valid_1's auc: 0.894327
Early stopping, best iteration is:
[11094]	training's auc: 0.952334	valid_1's auc: 0.894948
Fold 3
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900276	valid_1's auc: 0.882173
[2000]	training's auc: 0.911124	valid_1's auc: 0.889171
[3000]	training's auc: 0.918758	valid_1's auc: 0.893614
[4000]	training's auc: 0.92463	valid_1's auc: 0.89627
[5000]	training's auc: 0.929475	valid_1's auc: 0.897519
[6000]	training's auc: 0.933971	valid_1's auc: 0.898018
[7000]	training's auc: 0.937925	valid_1's auc: 0.898396
[8000]	training's auc: 0.941684	valid_1's auc: 0.898475
[9000]	training's auc: 0.945229	valid_1's auc: 0.898597
[10000]	training's auc: 0.948626	valid_1's auc: 0.898725
[11000]	training's auc: 0.951822	valid_1's auc: 0.898657
[12000]	training's auc: 0.95488	valid_1's auc: 0.898504
[13000]	training's auc: 0.957871	valid_1's auc: 0.898503
Early stopping, best iteration is:
[10712]	training's auc: 0.950891	valid_1's auc: 0.898759
Fold 4
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900213	valid_1's auc: 0.883231
[2000]	training's auc: 0.911052	valid_1's auc: 0.890297
[3000]	training's auc: 0.918649	valid_1's auc: 0.894252
[4000]	training's auc: 0.924548	valid_1's auc: 0.896724
[5000]	training's auc: 0.92951	valid_1's auc: 0.897923
[6000]	training's auc: 0.93393	valid_1's auc: 0.898887
[7000]	training's auc: 0.937896	valid_1's auc: 0.899048
[8000]	training's auc: 0.941556	valid_1's auc: 0.899335
[9000]	training's auc: 0.945033	valid_1's auc: 0.899469
[10000]	training's auc: 0.94841	valid_1's auc: 0.899536
[11000]	training's auc: 0.951679	valid_1's auc: 0.899371
[12000]	training's auc: 0.954731	valid_1's auc: 0.899314
[13000]	training's auc: 0.95771	valid_1's auc: 0.899024
Early stopping, best iteration is:
[10307]	training's auc: 0.949415	valid_1's auc: 0.899591
Fold 5
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899832	valid_1's auc: 0.887942
[2000]	training's auc: 0.910762	valid_1's auc: 0.895511
[3000]	training's auc: 0.918306	valid_1's auc: 0.899303
[4000]	training's auc: 0.924334	valid_1's auc: 0.901522
[5000]	training's auc: 0.929353	valid_1's auc: 0.902569
[6000]	training's auc: 0.933747	valid_1's auc: 0.903396
[7000]	training's auc: 0.937725	valid_1's auc: 0.903844
[8000]	training's auc: 0.941422	valid_1's auc: 0.904181
[9000]	training's auc: 0.944946	valid_1's auc: 0.904167
[10000]	training's auc: 0.948326	valid_1's auc: 0.903872
[11000]	training's auc: 0.951534	valid_1's auc: 0.903846
Early stopping, best iteration is:
[8408]	training's auc: 0.942866	valid_1's auc: 0.904303
Fold 6
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899935	valid_1's auc: 0.884744
[2000]	training's auc: 0.910967	valid_1's auc: 0.892097
[3000]	training's auc: 0.918595	valid_1's auc: 0.896277
[4000]	training's auc: 0.924503	valid_1's auc: 0.898606
[5000]	training's auc: 0.929414	valid_1's auc: 0.89991
[6000]	training's auc: 0.933745	valid_1's auc: 0.900743
[7000]	training's auc: 0.937714	valid_1's auc: 0.901066
[8000]	training's auc: 0.94139	valid_1's auc: 0.900995
[9000]	training's auc: 0.944926	valid_1's auc: 0.901016
Early stopping, best iteration is:
[6986]	training's auc: 0.937661	valid_1's auc: 0.901085
Fold 7
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899968	valid_1's auc: 0.881017
[2000]	training's auc: 0.910826	valid_1's auc: 0.889131
[3000]	training's auc: 0.918484	valid_1's auc: 0.893968
[4000]	training's auc: 0.924432	valid_1's auc: 0.896794
[5000]	training's auc: 0.929348	valid_1's auc: 0.898531
[6000]	training's auc: 0.933656	valid_1's auc: 0.899541
[7000]	training's auc: 0.937572	valid_1's auc: 0.899903
[8000]	training's auc: 0.941255	valid_1's auc: 0.900259
[9000]	training's auc: 0.944865	valid_1's auc: 0.900205
[10000]	training's auc: 0.948314	valid_1's auc: 0.900135
[11000]	training's auc: 0.951556	valid_1's auc: 0.900281
[12000]	training's auc: 0.954647	valid_1's auc: 0.900202
[13000]	training's auc: 0.957629	valid_1's auc: 0.900083
[14000]	training's auc: 0.960473	valid_1's auc: 0.900019
Early stopping, best iteration is:
[11028]	training's auc: 0.951647	valid_1's auc: 0.900328
Fold 8
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.899642	valid_1's auc: 0.889764
[2000]	training's auc: 0.91067	valid_1's auc: 0.897589
[3000]	training's auc: 0.918364	valid_1's auc: 0.901604
[4000]	training's auc: 0.92421	valid_1's auc: 0.903614
[5000]	training's auc: 0.929197	valid_1's auc: 0.904601
[6000]	training's auc: 0.933471	valid_1's auc: 0.905101
[7000]	training's auc: 0.93741	valid_1's auc: 0.905128
[8000]	training's auc: 0.941136	valid_1's auc: 0.905215
[9000]	training's auc: 0.944594	valid_1's auc: 0.905207
[10000]	training's auc: 0.948042	valid_1's auc: 0.905092
[11000]	training's auc: 0.951259	valid_1's auc: 0.905037
Early stopping, best iteration is:
[8028]	training's auc: 0.941228	valid_1's auc: 0.905247
Fold 9
Training until validation scores don't improve for 3000 rounds.
[1000]	training's auc: 0.900193	valid_1's auc: 0.884426
[2000]	training's auc: 0.911194	valid_1's auc: 0.891741
[3000]	training's auc: 0.918785	valid_1's auc: 0.895999
[4000]	training's auc: 0.924653	valid_1's auc: 0.8984
[5000]	training's auc: 0.929607	valid_1's auc: 0.899584
[6000]	training's auc: 0.933898	valid_1's auc: 0.900395
[7000]	training's auc: 0.937896	valid_1's auc: 0.900785
[8000]	training's auc: 0.941574	valid_1's auc: 0.900916
[9000]	training's auc: 0.945132	valid_1's auc: 0.901081
[10000]	training's auc: 0.948568	valid_1's auc: 0.901075
[11000]	training's auc: 0.951714	valid_1's auc: 0.901069
[12000]	training's auc: 0.954815	valid_1's auc: 0.901025
[13000]	training's auc: 0.957792	valid_1's auc: 0.901129
Early stopping, best iteration is:
[10567]	training's auc: 0.950365	valid_1's auc: 0.901193
CV score: 0.90025 

We are also interested in the feature importance. What feature counts most during the prediction process.

cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.show()

在这里插入图片描述

5 Submission and Final Result

submission=pd.DataFrame({"ID_code":test['ID_code'].values})
submission['target']=predictions
submission.to_csv(PATH+'submission.csv',index=False)

The simple submission’s public score here is 0.89889 and the private score is 0.90021, which ranks 329/8780, top 3.7% on private broad.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值