Santander Customer Transaction Prediction: EDA and Baseline
1 Description
At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.
Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?
In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.
2 Prepare The Data
2.1 Import and preparation
First we import the packages that we might need in the solution.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import six.moves.urllib as urllib
import sklearn
import scipy
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score, roc_curve
import lightgbm as lgb
%matplotlib inline
PATH='E:/kaggle/santander-customer-transaction-prediction/'
train=pd.read_csv(PATH+'train.csv')
test=pd.read_csv(PATH+'test.csv')
Check the data information
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB
Check the dimension of the data
train.shape
(200000, 202)
train.head()
ID_code | target | var_0 | var_1 | var_2 | var_3 | var_4 | var_5 | var_6 | var_7 | ... | var_190 | var_191 | var_192 | var_193 | var_194 | var_195 | var_196 | var_197 | var_198 | var_199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train_0 | 0 | 8.9255 | -6.7863 | 11.9081 | 5.0930 | 11.4607 | -9.2834 | 5.1187 | 18.6266 | ... | 4.4354 | 3.9642 | 3.1364 | 1.6910 | 18.5227 | -2.3978 | 7.8784 | 8.5635 | 12.7803 | -1.0914 |
1 | train_1 | 0 | 11.5006 | -4.1473 | 13.8588 | 5.3890 | 12.3622 | 7.0433 | 5.6208 | 16.5338 | ... | 7.6421 | 7.7214 | 2.5837 | 10.9516 | 15.4305 | 2.0339 | 8.1267 | 8.7889 | 18.3560 | 1.9518 |
2 | train_2 | 0 | 8.6093 | -2.7457 | 12.0805 | 7.8928 | 10.5825 | -9.0837 | 6.9427 | 14.6155 | ... | 2.9057 | 9.7905 | 1.6704 | 1.6858 | 21.6042 | 3.1417 | -6.5213 | 8.2675 | 14.7222 | 0.3965 |
3 | train_3 | 0 | 11.0604 | -2.1518 | 8.9522 | 7.1957 | 12.5846 | -1.8361 | 5.8428 | 14.9250 | ... | 4.4666 | 4.7433 | 0.7178 | 1.4214 | 23.0347 | -1.2706 | -2.9275 | 10.2922 | 17.9697 | -8.9996 |
4 | train_4 | 0 | 9.8369 | -1.4834 | 12.8746 | 6.6375 | 12.2772 | 2.4486 | 5.9405 | 19.2514 | ... | -1.4905 | 9.5214 | -0.1508 | 9.1942 | 13.2876 | -1.5121 | 3.9267 | 9.5031 | 17.9974 | -8.8104 |
5 rows × 202 columns
We can observe the basic condition of the data here. We can not infer any actual information from the name of the columns and the data, too. So it is better for us to find out more. Before that, first test whether there are missing values.
2.2 Check the Data
# check the missing values
data_na=(train.isnull().sum()/len(train))*100
data_na=data_na.drop(data_na[data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'MissingRatio':data_na})
print(missing_data)
Empty DataFrame
Columns: [MissingRatio]
Index: []
We can see there are no missing values.
train.target.value_counts()
0 179902
1 20098
Name: target, dtype: int64
The dataset may be quite unbalanced, we can see that almost 90 percent of the items have the target ‘0’ while 10 percent are ‘1’.
We first extract all the features here.
features=[col for col in train.columns if col not in ['ID_code','target']]
3 EDA
3.1 Check the Train-test Distribution
Before we doing our work, we might be extremely interested in the distribution of the dataset. The division of train set and test set should be as balanced as possible in all kinds of aspects. So we first examine this point.
First we check the mean values per row.
# check the distribution
plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per row in the train and test set')
sns.distplot(train[features].mean(axis=1),color='green',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=1),color='red',kde=True,bins=120,label='test')
plt.legend()
plt.show()
Then we apply the same operation to the columns.
plt.figure(figsize=(18,10))
plt.title('Distribution of mean values per column in the train and test set')
sns.distplot(train[features].mean(axis=0),color='purple',kde=True,bins=120,label='train')
sns.distplot(test[features].mean(axis=0),color='orange',kde=True,bins=120,label='test')
plt.legend()
plt.show()
Besides, the standard deviation also worth examining.
plt.figure(figsize=(18,10))
plt.title('Distribution of std values per rows in the train and test set')
sns.distplot(train[features].std(axis=1),color='black',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=1),color='yellow',kde=True,bins=120,label='test')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of std values per column in the train and test set')
sns.distplot(train[features].std(axis=0),color='blue',kde=True,bins=120,label='train')
sns.distplot(test[features].std(axis=0),color='green',kde=True,bins=120,label='test')
plt.legend()
plt.show()
We can see the data distribution of each row and column in the train set and the test set are almost balanced.
3.2 Check the Feature Correlation
# check the feature correlation
corrmat=train.corr()
plt.subplots(figsize=(18,18))
sns.heatmap(corrmat,vmax=0.9,square=True)
<matplotlib.axes._subplots.AxesSubplot at 0x25c953f7358>
We can see that the correlation between features are barely slight. Also it is worth to check the biggest correlation value.
%%time
correlations=train[features].corr().unstack().sort_values(kind='quicksort').reset_index()
correlations=correlations[correlations['level_0']!=correlations['level_1']]
Wall time: 16.2 s
correlations.tail(10)
level_0 | level_1 | 0 | |
---|---|---|---|
39790 | var_122 | var_132 | 0.008956 |
39791 | var_132 | var_122 | 0.008956 |
39792 | var_146 | var_169 | 0.009071 |
39793 | var_169 | var_146 | 0.009071 |
39794 | var_189 | var_183 | 0.009359 |
39795 | var_183 | var_189 | 0.009359 |
39796 | var_174 | var_81 | 0.009490 |
39797 | var_81 | var_174 | 0.009490 |
39798 | var_165 | var_81 | 0.009714 |
39799 | var_81 | var_165 | 0.009714 |
correlations.head(10)
level_0 | level_1 | 0 | |
---|---|---|---|
0 | var_26 | var_139 | -0.009844 |
1 | var_139 | var_26 | -0.009844 |
2 | var_148 | var_53 | -0.009788 |
3 | var_53 | var_148 | -0.009788 |
4 | var_80 | var_6 | -0.008958 |
5 | var_6 | var_80 | -0.008958 |
6 | var_1 | var_80 | -0.008855 |
7 | var_80 | var_1 | -0.008855 |
8 | var_13 | var_2 | -0.008795 |
9 | var_2 | var_13 | -0.008795 |
Well, the maximum absolute value of feature correlation is below 0.01. So we might not get any useful information from here.
3.3 Further Exploring
How about the distribution of each feature, here we try to print all the distribution plot on a single graph.
# check the distribution of each feature
def plot_features(df1,df2,label1,label2,features):
sns.set_style('whitegrid')
plt.figure()
fig,ax=plt.subplots(10,20,figsize=(18,22))
i=0
for feature in features:
i+=1
plt.subplot(10,20,i)
sns.distplot(df1[feature],hist=False,label=label1)
sns.distplot(df2[feature],hist=False,label=label2)
plt.xlabel(feature,fontsize=9)
locs, labels=plt.xticks()
plt.tick_params(axis='x',which='major',labelsize=6,pad=-6)
plt.tick_params(axis='y',which='major',labelsize=6)
plt.show()
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[2:202]
plot_features(t0,t1,'0','1',features)
<Figure size 432x288 with 0 Axes>
features=train.columns.values[2:202]
plot_features(train,test,'train','test',features)
<Figure size 432x288 with 0 Axes>
All the features here are nearly balanced, it can make our work really convenient.
3.4 Other Statistical Indicators that Worth Checking
In order to have a more comprehensive grasp of the whole data, we can check every statistical indicators that might provide more
# Distribution of min and max
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
plt.figure(figsize=(18,10))
plt.title('Distribution of min values per row in the train set')
sns.distplot(t0[features].min(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of min values per column in the train set')
sns.distplot(t0[features].min(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].min(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.plot()
plt.figure(figsize=(18,10))
plt.title('Distribution of max values per row in the train set')
sns.distplot(t0[features].max(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of max values per column in the train set')
sns.distplot(t0[features].max(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].max(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()
# skewness and kurtosis
plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per row in the train set')
sns.distplot(t0[features].skew(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of skew values per column in the train set')
sns.distplot(t0[features].skew(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].skew(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per row in the train set')
sns.distplot(t0[features].kurtosis(axis=1),color='orange',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=1),color='red',kde=True,bins=120,label='1')
plt.legend()
plt.show()
plt.figure(figsize=(18,10))
plt.title('Distribution of kurtosis values per column in the train set')
sns.distplot(t0[features].kurtosis(axis=0),color='blue',kde=True,bins=120,label='0')
sns.distplot(t1[features].kurtosis(axis=0),color='green',kde=True,bins=120,label='1')
plt.legend()
plt.show()
4 Feature Engineering and Modeling
4.1 Create New Features
We can add the statistical indicators to the dataset for modeling. They may be useful.
# creating new features
idx=features=train.columns.values[2:202]
for df in [train,test]:
df['sum']=df[idx].sum(axis=1)
df['min']=df[idx].min(axis=1)
df['max']=df[idx].max(axis=1)
df['mean']=df[idx].mean(axis=1)
df['std']=df[idx].std(axis=1)
df['skew']=df[idx].skew(axis=1)
df['kurt']=df[idx].kurtosis(axis=1)
df['med']=df[idx].median(axis=1)
train[train.columns[202:]].head(10)
sum | min | max | mean | std | skew | kurt | med | |
---|---|---|---|---|---|---|---|---|
0 | 1456.3182 | -21.4494 | 43.1127 | 7.281591 | 9.331540 | 0.101580 | 1.331023 | 6.77040 |
1 | 1415.3636 | -47.3797 | 40.5632 | 7.076818 | 10.336130 | -0.351734 | 4.110215 | 7.22315 |
2 | 1240.8966 | -22.4038 | 33.8820 | 6.204483 | 8.753387 | -0.056957 | 0.546438 | 5.89940 |
3 | 1288.2319 | -35.1659 | 38.1015 | 6.441160 | 9.594064 | -0.480116 | 2.630499 | 6.70260 |
4 | 1354.2310 | -65.4863 | 41.1037 | 6.771155 | 11.287122 | -1.463426 | 9.787399 | 6.94735 |
5 | 1272.3216 | -44.7257 | 35.2664 | 6.361608 | 9.313012 | -0.920439 | 4.581343 | 6.23790 |
6 | 1509.4490 | -29.9763 | 39.9599 | 7.547245 | 9.246130 | -0.133489 | 1.816453 | 7.47605 |
7 | 1438.5083 | -27.2543 | 31.9043 | 7.192541 | 9.162558 | -0.300415 | 1.174273 | 6.97300 |
8 | 1369.7375 | -31.7855 | 42.4798 | 6.848688 | 9.837520 | 0.084047 | 1.997040 | 6.32870 |
9 | 1303.1155 | -39.3042 | 34.4640 | 6.515577 | 9.943238 | -0.670024 | 2.521160 | 6.36320 |
test[test.columns[201:]].head(10)
sum | min | max | mean | std | skew | kurt | med | |
---|---|---|---|---|---|---|---|---|
0 | 1416.6404 | -31.9891 | 42.0248 | 7.083202 | 9.910632 | -0.088518 | 1.871262 | 7.31440 |
1 | 1249.6860 | -41.1924 | 35.6020 | 6.248430 | 9.541267 | -0.559785 | 3.391068 | 6.43960 |
2 | 1430.2599 | -34.3488 | 39.3654 | 7.151300 | 9.967466 | -0.135084 | 2.326901 | 7.26355 |
3 | 1411.4447 | -21.4797 | 40.3383 | 7.057224 | 8.257204 | -0.167741 | 2.253054 | 6.89675 |
4 | 1423.7364 | -24.8254 | 45.5510 | 7.118682 | 10.043542 | 0.293484 | 2.044943 | 6.83375 |
5 | 1273.1592 | -19.8952 | 30.2647 | 6.365796 | 8.728466 | -0.031814 | 0.113763 | 5.83800 |
6 | 1440.7387 | -18.7481 | 37.4611 | 7.203693 | 8.676615 | -0.045407 | 0.653782 | 6.66335 |
7 | 1429.5281 | -22.7363 | 33.2387 | 7.147640 | 9.697687 | -0.017784 | 0.713021 | 7.44665 |
8 | 1270.4978 | -17.4719 | 28.1225 | 6.352489 | 8.257376 | -0.138639 | 0.342360 | 6.55820 |
9 | 1271.6875 | -32.8776 | 38.3319 | 6.358437 | 9.489171 | -0.354497 | 1.934290 | 6.83960 |
Now let’s check the distributions of the new features.
def plot_new_features(df1,df2,label1,label2,features):
sns.set_style('whitegrid')
plt.figure()
fig,ax=plt.subplots(2,4,figsize=(18,8))
i=0
for feature in features:
i+=1
plt.subplot(2,4,i)
sns.kdeplot(df1[feature],bw=0.5,label=label1)
sns.kdeplot(df2[feature],bw=0.5,label=label2)
plt.xlabel(feature,fontsize=11)
locs,labels=plt.xticks()
plt.tick_params(axis='x',which='major',labelsize=8)
plt.tick_params(axis='y',which='major',labelsize=8)
plt.show()
t0=train.loc[train['target']==0]
t1=train.loc[train['target']==1]
features=train.columns.values[202:]
plot_new_features(t0,t1,'0','1',features)
<Figure size 432x288 with 0 Axes>
print('Columns in train_set:{} Columns in test_set:{}'.format(len(train.columns),len(test.columns)))
Columns in train_set:210 Columns in test_set:209
4.2 Training the Model
Here’s a baseline model that uses LightGBM.
# training the model
features=[col for col in train.columns if col not in ['ID_code','target']]
target=train['target']
param={
'bagging_freq':5,
'bagging_fraction':0.4,
'boost':'gbdt',
'boost_from_average':'false',
'feature_fraction':0.05,
'learning_rate':0.01,
'max_depth':-1,
'metric':'auc',
'min_data_in_leaf':80,
'min_sum_hessian_in_leaf':10.0,
'num_leaves':13,
'num_threads':8,
'tree_learner':'serial',
'objective':'binary',
'verbosity':1
}
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train))
predictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
print("Fold {}".format(fold_))
trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=target.iloc[trn_idx])
val_data = lgb.Dataset(train.iloc[val_idx][features], label=target.iloc[val_idx])
num_round = 1000000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
oof[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
Fold 0
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.900229 valid_1's auc: 0.881617
[2000] training's auc: 0.91128 valid_1's auc: 0.889429
[3000] training's auc: 0.918765 valid_1's auc: 0.893439
[4000] training's auc: 0.924616 valid_1's auc: 0.895931
[5000] training's auc: 0.929592 valid_1's auc: 0.897636
[6000] training's auc: 0.933838 valid_1's auc: 0.898786
[7000] training's auc: 0.937858 valid_1's auc: 0.899318
[8000] training's auc: 0.941557 valid_1's auc: 0.899733
[9000] training's auc: 0.94517 valid_1's auc: 0.899901
[10000] training's auc: 0.948529 valid_1's auc: 0.900143
[11000] training's auc: 0.951807 valid_1's auc: 0.900281
[12000] training's auc: 0.954903 valid_1's auc: 0.900269
[13000] training's auc: 0.957815 valid_1's auc: 0.900107
[14000] training's auc: 0.960655 valid_1's auc: 0.89994
Early stopping, best iteration is:
[11603] training's auc: 0.953681 valid_1's auc: 0.900347
Fold 1
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.900404 valid_1's auc: 0.882765
[2000] training's auc: 0.911307 valid_1's auc: 0.889508
[3000] training's auc: 0.918917 valid_1's auc: 0.893254
[4000] training's auc: 0.924779 valid_1's auc: 0.895682
[5000] training's auc: 0.929704 valid_1's auc: 0.897004
[6000] training's auc: 0.933907 valid_1's auc: 0.897785
[7000] training's auc: 0.93784 valid_1's auc: 0.89799
[8000] training's auc: 0.941511 valid_1's auc: 0.898383
[9000] training's auc: 0.945033 valid_1's auc: 0.898701
[10000] training's auc: 0.94837 valid_1's auc: 0.898763
[11000] training's auc: 0.951605 valid_1's auc: 0.89877
[12000] training's auc: 0.954709 valid_1's auc: 0.898751
[13000] training's auc: 0.957618 valid_1's auc: 0.898634
Early stopping, best iteration is:
[10791] training's auc: 0.950935 valid_1's auc: 0.89889
Fold 2
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.90084 valid_1's auc: 0.87531
[2000] training's auc: 0.911957 valid_1's auc: 0.883717
[3000] training's auc: 0.919463 valid_1's auc: 0.888423
[4000] training's auc: 0.925317 valid_1's auc: 0.891101
[5000] training's auc: 0.930106 valid_1's auc: 0.892821
[6000] training's auc: 0.93436 valid_1's auc: 0.89362
[7000] training's auc: 0.938282 valid_1's auc: 0.89429
[8000] training's auc: 0.941897 valid_1's auc: 0.894544
[9000] training's auc: 0.945462 valid_1's auc: 0.894652
[10000] training's auc: 0.948798 valid_1's auc: 0.894821
[11000] training's auc: 0.952036 valid_1's auc: 0.894888
[12000] training's auc: 0.955136 valid_1's auc: 0.894657
[13000] training's auc: 0.958081 valid_1's auc: 0.894511
[14000] training's auc: 0.960904 valid_1's auc: 0.894327
Early stopping, best iteration is:
[11094] training's auc: 0.952334 valid_1's auc: 0.894948
Fold 3
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.900276 valid_1's auc: 0.882173
[2000] training's auc: 0.911124 valid_1's auc: 0.889171
[3000] training's auc: 0.918758 valid_1's auc: 0.893614
[4000] training's auc: 0.92463 valid_1's auc: 0.89627
[5000] training's auc: 0.929475 valid_1's auc: 0.897519
[6000] training's auc: 0.933971 valid_1's auc: 0.898018
[7000] training's auc: 0.937925 valid_1's auc: 0.898396
[8000] training's auc: 0.941684 valid_1's auc: 0.898475
[9000] training's auc: 0.945229 valid_1's auc: 0.898597
[10000] training's auc: 0.948626 valid_1's auc: 0.898725
[11000] training's auc: 0.951822 valid_1's auc: 0.898657
[12000] training's auc: 0.95488 valid_1's auc: 0.898504
[13000] training's auc: 0.957871 valid_1's auc: 0.898503
Early stopping, best iteration is:
[10712] training's auc: 0.950891 valid_1's auc: 0.898759
Fold 4
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.900213 valid_1's auc: 0.883231
[2000] training's auc: 0.911052 valid_1's auc: 0.890297
[3000] training's auc: 0.918649 valid_1's auc: 0.894252
[4000] training's auc: 0.924548 valid_1's auc: 0.896724
[5000] training's auc: 0.92951 valid_1's auc: 0.897923
[6000] training's auc: 0.93393 valid_1's auc: 0.898887
[7000] training's auc: 0.937896 valid_1's auc: 0.899048
[8000] training's auc: 0.941556 valid_1's auc: 0.899335
[9000] training's auc: 0.945033 valid_1's auc: 0.899469
[10000] training's auc: 0.94841 valid_1's auc: 0.899536
[11000] training's auc: 0.951679 valid_1's auc: 0.899371
[12000] training's auc: 0.954731 valid_1's auc: 0.899314
[13000] training's auc: 0.95771 valid_1's auc: 0.899024
Early stopping, best iteration is:
[10307] training's auc: 0.949415 valid_1's auc: 0.899591
Fold 5
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.899832 valid_1's auc: 0.887942
[2000] training's auc: 0.910762 valid_1's auc: 0.895511
[3000] training's auc: 0.918306 valid_1's auc: 0.899303
[4000] training's auc: 0.924334 valid_1's auc: 0.901522
[5000] training's auc: 0.929353 valid_1's auc: 0.902569
[6000] training's auc: 0.933747 valid_1's auc: 0.903396
[7000] training's auc: 0.937725 valid_1's auc: 0.903844
[8000] training's auc: 0.941422 valid_1's auc: 0.904181
[9000] training's auc: 0.944946 valid_1's auc: 0.904167
[10000] training's auc: 0.948326 valid_1's auc: 0.903872
[11000] training's auc: 0.951534 valid_1's auc: 0.903846
Early stopping, best iteration is:
[8408] training's auc: 0.942866 valid_1's auc: 0.904303
Fold 6
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.899935 valid_1's auc: 0.884744
[2000] training's auc: 0.910967 valid_1's auc: 0.892097
[3000] training's auc: 0.918595 valid_1's auc: 0.896277
[4000] training's auc: 0.924503 valid_1's auc: 0.898606
[5000] training's auc: 0.929414 valid_1's auc: 0.89991
[6000] training's auc: 0.933745 valid_1's auc: 0.900743
[7000] training's auc: 0.937714 valid_1's auc: 0.901066
[8000] training's auc: 0.94139 valid_1's auc: 0.900995
[9000] training's auc: 0.944926 valid_1's auc: 0.901016
Early stopping, best iteration is:
[6986] training's auc: 0.937661 valid_1's auc: 0.901085
Fold 7
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.899968 valid_1's auc: 0.881017
[2000] training's auc: 0.910826 valid_1's auc: 0.889131
[3000] training's auc: 0.918484 valid_1's auc: 0.893968
[4000] training's auc: 0.924432 valid_1's auc: 0.896794
[5000] training's auc: 0.929348 valid_1's auc: 0.898531
[6000] training's auc: 0.933656 valid_1's auc: 0.899541
[7000] training's auc: 0.937572 valid_1's auc: 0.899903
[8000] training's auc: 0.941255 valid_1's auc: 0.900259
[9000] training's auc: 0.944865 valid_1's auc: 0.900205
[10000] training's auc: 0.948314 valid_1's auc: 0.900135
[11000] training's auc: 0.951556 valid_1's auc: 0.900281
[12000] training's auc: 0.954647 valid_1's auc: 0.900202
[13000] training's auc: 0.957629 valid_1's auc: 0.900083
[14000] training's auc: 0.960473 valid_1's auc: 0.900019
Early stopping, best iteration is:
[11028] training's auc: 0.951647 valid_1's auc: 0.900328
Fold 8
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.899642 valid_1's auc: 0.889764
[2000] training's auc: 0.91067 valid_1's auc: 0.897589
[3000] training's auc: 0.918364 valid_1's auc: 0.901604
[4000] training's auc: 0.92421 valid_1's auc: 0.903614
[5000] training's auc: 0.929197 valid_1's auc: 0.904601
[6000] training's auc: 0.933471 valid_1's auc: 0.905101
[7000] training's auc: 0.93741 valid_1's auc: 0.905128
[8000] training's auc: 0.941136 valid_1's auc: 0.905215
[9000] training's auc: 0.944594 valid_1's auc: 0.905207
[10000] training's auc: 0.948042 valid_1's auc: 0.905092
[11000] training's auc: 0.951259 valid_1's auc: 0.905037
Early stopping, best iteration is:
[8028] training's auc: 0.941228 valid_1's auc: 0.905247
Fold 9
Training until validation scores don't improve for 3000 rounds.
[1000] training's auc: 0.900193 valid_1's auc: 0.884426
[2000] training's auc: 0.911194 valid_1's auc: 0.891741
[3000] training's auc: 0.918785 valid_1's auc: 0.895999
[4000] training's auc: 0.924653 valid_1's auc: 0.8984
[5000] training's auc: 0.929607 valid_1's auc: 0.899584
[6000] training's auc: 0.933898 valid_1's auc: 0.900395
[7000] training's auc: 0.937896 valid_1's auc: 0.900785
[8000] training's auc: 0.941574 valid_1's auc: 0.900916
[9000] training's auc: 0.945132 valid_1's auc: 0.901081
[10000] training's auc: 0.948568 valid_1's auc: 0.901075
[11000] training's auc: 0.951714 valid_1's auc: 0.901069
[12000] training's auc: 0.954815 valid_1's auc: 0.901025
[13000] training's auc: 0.957792 valid_1's auc: 0.901129
Early stopping, best iteration is:
[10567] training's auc: 0.950365 valid_1's auc: 0.901193
CV score: 0.90025
We are also interested in the feature importance. What feature counts most during the prediction process.
cols = (feature_importance_df[["Feature", "importance"]]
.groupby("Feature")
.mean()
.sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]
plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.show()
5 Submission and Final Result
submission=pd.DataFrame({"ID_code":test['ID_code'].values})
submission['target']=predictions
submission.to_csv(PATH+'submission.csv',index=False)
The simple submission’s public score here is 0.89889 and the private score is 0.90021, which ranks 329/8780, top 3.7% on private broad.