2018/7/12
1、decription
Home Credict利用其他数据,包括电信和其他交易数据预测该客户的还款能力(概率)。
2、evaluation
ROC曲线面积
3、data
application.csv
性别、汽车、孩子数量、收入、消费贷款商品的价格、贷款信用额、贷款年金、申请贷款陪伴的人、收入来源、学历、家庭状况、房子类型、居住地方的人口数量
2018/7/14
EDA
对application_train进行数据可视化分析
"""
看一下标签的分布
"""
app_train['TARGET'].value_counts()
"""
可以看出类别不平衡
"""app_train['TARGET'].astype(int).plot.hist();
plt.show()
"""
检查缺失值
"""
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100*mis_val / len(df)
mis_val_table = pd.concat([mis_val,mis_val_percent],axis=1)
mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values',1:'% of Total Values'})
mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values',
ascending = False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(40)
特征与标签的相关性
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()
# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
将年龄弄成区间统计
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)
# Group by the bin and calculate averages
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups
plt.figure(figsize = (8, 8))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');
特征工程用PolynomialFeatures