对特征
判断数据缺失和异常
x_train.isnull().sum()
x_test.isnull().sum()
偏度与峰度
sns.distplot(x_train['label']);
print("Skewness: %f" % Train_data['label'].skew())
print("Kurtosis: %f" % Train_data['label'].kurt())
sns.distplot(x_train.kurt(),color='orange',axlabel ='Kurtness')
对标签
了解预测值的分布
x_train['label']
x_train['label'].value_counts()
import scipy.stats as st
y = x_train['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
plt.hist(x_train['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
用pandas_profiling生成数据报告
import pandas_profiling
pfr = pandas_profiling.ProfileReport(x_train)
pfr.to_file("./data_analyse.html")