Pandas提供describe方法,可以查看各列的计数、均值、最大最小值等,功能强大。下面介绍一种可根据自身要求,添加各个特征的其他描述的方法。
df.describe().T.assign(missing_rate = df.apply(lambda x : (len(x)-x.count())/float(len(x))))
T为转置,assign为添加的列。
上述代码实现了,展示各列的计数、均值、最大最小值、标准差和第一、二、三个四分位值,同时增加了缺失率的计算。
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
SeriousDlqin2yrs | 150000.0 | 0.066840 | 0.249746 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 |
RevolvingUtilizationOfUnsecuredLines | 150000.0 | 6.048438 | 249.755371 | 0.0 | 0.029867 | 0.154181 | 0.559046 | 50708.0 | 0.000000 |
age | 150000.0 | 52.295207 | 14.771866 | 0.0 | 41.000000 | 52.000000 | 63.000000 | 109.0 | 0.000000 |
NumberOfTime30-59DaysPastDueNotWorse | 150000.0 | 0.421033 | 4.192781 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 98.0 | 0.000000 |
DebtRatio | 150000.0 | 353.005076 | 2037.818523 | 0.0 | 0.175074 | 0.366508 | 0.868254 | 329664.0 | 0.000000 |
MonthlyIncome | 120269.0 | 6670.221237 | 14384.674215 | 0.0 | 3400.000000 | 5400.000000 | 8249.000000 | 3008750.0 | 0.198207 |
NumberOfOpenCreditLinesAndLoans | 150000.0 | 8.452760 | 5.145951 | 0.0 | 5.000000 | 8.000000 | 11.000000 | 58.0 | 0.000000 |
NumberOfTimes90DaysLate | 150000.0 | 0.265973 | 4.169304 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 98.0 | 0.000000 |
NumberRealEstateLoansOrLines | 150000.0 | 1.018240 | 1.129771 | 0.0 | 0.000000 | 1.000000 | 2.000000 | 54.0 | 0.000000 |
NumberOfTime60-89DaysPastDueNotWorse | 150000.0 | 0.240387 | 4.155179 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 98.0 | 0.000000 |
NumberOfDependents | 146076.0 | 0.757222 | 1.115086 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 20.0 | 0.026160 |
最后一列为自定义的缺失率(复制的列名没显示出来)