题目:初识pandas库与缺失数据的补全
- 打开数据(csv文件、excel文件)
- 查看数据(尺寸信息、查看列名等方法)
- 查看空值
- 众数、中位数填补空值
- 利用循环补全所有列的空值
数据的读取与查看
# 读取数据
import pandas as pd
# 使用pandas中的read_csv读取csv文件
data = pd.read_csv(r'data.csv')
# pandas读取excel文件先安装openpyel库
# 使用pandas中的read_excel读取excel文件
data2 = pd.read_excel("data.xlsx")
type(data) # 类
输出:
pandas.core.frame.DataFrame
注:
DataFrame类型:类似二维的、表格型的数据结构,类似excel表格和SQL数据库中的表
# 布尔矩阵显示缺失值,返回一个布尔矩阵,也是dataframe对象
# True表示对应位置的值缺失,False表示对应位置的值存在
data.isnull()
输出:
Id | Home Ownership | Annual Income | Years in current job | Tax Liens | Number of Open Accounts | Years of Credit History | Maximum Open Credit | Number of Credit Problems | Months since last delinquent | Bankruptcies | Purpose | Term | Current Loan Amount | Current Credit Balance | Monthly Debt | Credit Score | Credit Default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7495 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
7496 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7497 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
7498 | False | False | True | True | False | False | False | False | False | True | False | False | False | False | False | False | True | False |
7499 | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False |
7500 rows × 18 columns
# 显示数据前5行,已确认数据正确加载
data.head()
输出:
Id | Home Ownership | Annual Income | Years in current job | Tax Liens | Number of Open Accounts | Years of Credit History | Maximum Open Credit | Number of Credit Problems | Months since last delinquent | Bankruptcies | Purpose | Term | Current Loan Amount | Current Credit Balance | Monthly Debt | Credit Score | Credit Default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Own Home | 482087.0 | NaN | 0.0 | 11.0 | 26.3 | 685960.0 | 1.0 | NaN | 1.0 | debt consolidation | Short Term | 99999999.0 | 47386.0 | 7914.0 | 749.0 | 0 |
1 | 1 | Own Home | 1025487.0 | 10+ years | 0.0 | 15.0 | 15.3 | 1181730.0 | 0.0 | NaN | 0.0 | debt consolidation | Long Term | 264968.0 | 394972.0 | 18373.0 | 737.0 | 1 |
2 | 2 | Home Mortgage | 751412.0 | 8 years | 0.0 | 11.0 | 35.0 | 1182434.0 | 0.0 | NaN | 0.0 | debt consolidation | Short Term | 99999999.0 | 308389.0 | 13651.0 | 742.0 | 0 |
3 | 3 | Own Home | 805068.0 | 6 years | 0.0 | 8.0 | 22.5 | 147400.0 | 1.0 | NaN | 1.0 | debt consolidation | Short Term | 121396.0 | 95855.0 | 11338.0 | 694.0 | 0 |
4 | 4 | Rent | 776264.0 | 8 years | 0.0 | 13.0 | 13.6 | 385836.0 | 1.0 | NaN | 0.0 | debt consolidation | Short Term | 125840.0 | 93309.0 | 7180.0 | 719.0 | 0 |
数据信息的查看
此时data是dataframe类型的一个对象,可以看作是dataframe类的一个实例,具有类的属性和方法。
data.info() # 列名、非空值、数据类型
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 7500 non-null int64
1 Home Ownership 7500 non-null object
2 Annual Income 5943 non-null float64
3 Years in current job 7129 non-null object
4 Tax Liens 7500 non-null float64
5 Number of Open Accounts 7500 non-null float64
6 Years of Credit History 7500 non-null float64
7 Maximum Open Credit 7500 non-null float64
8 Number of Credit Problems 7500 non-null float64
9 Months since last delinquent 3419 non-null float64
10 Bankruptcies 7486 non-null float64
11 Purpose 7500 non-null object
12 Term 7500 non-null object
13 Current Loan Amount 7500 non-null float64
14 Current Credit Balance 7500 non-null float64
15 Monthly Debt 7500 non-null float64
16 Credit Score 5943 non-null float64
17 Credit Default 7500 non-null int64
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB
data.shape # (行数,列数) data的属性
输出:(7500, 18)
代表数据有7500行,18列
data.columns # 所有的列名 data的属性
输出:
Index(['Id', 'Home Ownership', 'Annual Income', 'Years in current job',
'Tax Liens', 'Number of Open Accounts', 'Years of Credit History',
'Maximum Open Credit', 'Number of Credit Problems',
'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
'Credit Score', 'Credit Default'],
dtype='object')
data.describe() # 数值列的基本统计量
输出:
Id | Annual Income | Tax Liens | Number of Open Accounts | Years of Credit History | Maximum Open Credit | Number of Credit Problems | Months since last delinquent | Bankruptcies | Current Loan Amount | Current Credit Balance | Monthly Debt | Credit Score | Credit Default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7500.000000 | 5.943000e+03 | 7500.000000 | 7500.000000 | 7500.000000 | 7.500000e+03 | 7500.000000 | 3419.000000 | 7486.000000 | 7.500000e+03 | 7.500000e+03 | 7500.000000 | 5943.000000 | 7500.000000 |
mean | 3749.500000 | 1.366392e+06 | 0.030133 | 11.130933 | 18.317467 | 9.451537e+05 | 0.170000 | 34.692600 | 0.117152 | 1.187318e+07 | 2.898332e+05 | 18314.454133 | 1151.087498 | 0.281733 |
std | 2165.207842 | 8.453392e+05 | 0.271604 | 4.908924 | 7.041946 | 1.602622e+07 | 0.498598 | 21.688806 | 0.347192 | 3.192612e+07 | 3.178714e+05 | 11926.764673 | 1604.451418 | 0.449874 |
min | 0.000000 | 1.645970e+05 | 0.000000 | 2.000000 | 4.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 1.124200e+04 | 0.000000e+00 | 0.000000 | 585.000000 | 0.000000 |
25% | 1874.750000 | 8.443410e+05 | 0.000000 | 8.000000 | 13.500000 | 2.792295e+05 | 0.000000 | 16.000000 | 0.000000 | 1.801690e+05 | 1.142565e+05 | 10067.500000 | 711.000000 | 0.000000 |
50% | 3749.500000 | 1.168386e+06 | 0.000000 | 10.000000 | 17.000000 | 4.781590e+05 | 0.000000 | 32.000000 | 0.000000 | 3.095730e+05 | 2.093230e+05 | 16076.500000 | 731.000000 | 0.000000 |
75% | 5624.250000 | 1.640137e+06 | 0.000000 | 14.000000 | 21.800000 | 7.935015e+05 | 0.000000 | 50.000000 | 0.000000 | 5.198820e+05 | 3.604062e+05 | 23818.000000 | 743.000000 | 1.000000 |
max | 7499.000000 | 1.014934e+07 | 7.000000 | 43.000000 | 57.700000 | 1.304726e+09 | 7.000000 | 118.000000 | 4.000000 | 1.000000e+08 | 6.506797e+06 | 136679.000000 | 7510.000000 | 1.000000 |
注:count:非缺失值的数量,mean:平均值,std:标准差
data.dtypes # data types的缩写,查看每一列的数据类型,是data的属性
输出:
Id int64
Home Ownership object
Annual Income float64
Years in current job object
Tax Liens float64
Number of Open Accounts float64
Years of Credit History float64
Maximum Open Credit float64
Number of Credit Problems float64
Months since last delinquent float64
Bankruptcies float64
Purpose object
Term object
Current Loan Amount float64
Current Credit Balance float64
Monthly Debt float64
Credit Score float64
Credit Default int64
dtype: object
data["Annual Income"].dtype # 单独查看某一列的数据类型
输出:float64
data.isnull().sum() # 统计每列中缺失值个数
输出:
Id 0
Home Ownership 0
Annual Income 1557
Years in current job 371
Tax Liens 0
Number of Open Accounts 0
Years of Credit History 0
Maximum Open Credit 0
Number of Credit Problems 0
Months since last delinquent 4081
Bankruptcies 14
Purpose 0
Term 0
Current Loan Amount 0
Current Credit Balance 0
Monthly Debt 0
Credit Score 1557
Credit Default 0
dtype: int64
缺失值填补
先要知道哪些列有缺失值,就知道哪些列需要进行缺失值的填充
需要填补的列有Annual Income、Years in current job、Months since last delinquent、Bankruptcies、Credit Score
中位数填充(Annual Income为例)
# 计算Annual Income的中位数,自动忽略缺失值
median_Income = data["Annual Income"].median()
median_Income
输出:1168386.0
# 使用计算出的中位数填充缺失值
# fillna()方法用于填充缺失值,inplace=True表示直接在原DataFrame上进行修改,否则会返回一个新的DataFrame
# 修改的是data,而不是data.csv
data['Annual Income'].fillna(median_Income, inplace=True)
# 检查Annual Income列是否还有缺失值
data['Annual Income'].isnull().sum()
输出:0
众数填充(Annual Income为例)
data = pd.read_csv('data.csv') # 重新读取数据
mode = data['Annual Income'].mode() # 众数
# mode()会返回数据中出现频率最高的所有值
mode
输出:
0 969475.0
1 1043651.0
2 1058376.0
3 1161660.0
4 1338113.0
Name: Annual Income, dtype: float64
# 返回了5个频次最高的值,这里保留第一个
mode = mode[0]
# 众数填补
data['Annual Income'].fillna(mode, inplace=True)
# 检查是否有缺失值
data['Annual Income'].isnull().sum()
输出:0
for循环填补所有数值型缺失值
# 使用tolist()方法,将numpy数组和pandas对象转化为列表list
# 方便使用for循环对每一列进行遍历
c = data.columns.tolist()
c
输出:输出结果是一个列表,可以使用for循环遍历
['Id',
'Home Ownership',
'Annual Income',
'Years in current job',
'Tax Liens',
'Number of Open Accounts',
'Years of Credit History',
'Maximum Open Credit',
'Number of Credit Problems',
'Months since last delinquent',
'Bankruptcies',
'Purpose',
'Term',
'Current Loan Amount',
'Current Credit Balance',
'Monthly Debt',
'Credit Score',
'Credit Default']
# 循环遍历列表c中的每一项
for i in c:
# 找出数据类型是数值型的列,是则继续
if data[i].dtype != 'object':
# 判断列中是否有缺失值,有则继续
if data[i].isnull().sum() > 0:
# 找中位数进行补全
mean_value = data[i].median()
data[i].fillna(mean_value, inplace=True)
data.isnull().sum()
输出:
Id 0
Home Ownership 0
Annual Income 0
Years in current job 371
Tax Liens 0
Number of Open Accounts 0
Years of Credit History 0
Maximum Open Credit 0
Number of Credit Problems 0
Months since last delinquent 0
Bankruptcies 0
Purpose 0
Term 0
Current Loan Amount 0
Current Credit Balance 0
Monthly Debt 0
Credit Score 0
Credit Default 0
dtype: int64
可见,除了object类型的Years in current job,所有数值型列都已无缺失值。