EDA–数据预处理探索分析
nan统计
Train_data.isnull().sum()
数据概览
Test_data.info()
Train_data.describe()
可视化缺失情况
msno.matrix(Train_data.sample(250))
msno.bar(Train_data.sample(1000))
了解数据分布并替换改变分布
Train_data['notRepairedDamage'].value_counts()
"""
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
"""
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
"""
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
"""
对预测值分布情况作概况预览并可视化
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
…待完善