DAY 4 缺失值的处理

题目:初识pandas库与缺失数据的补全
  1. 打开数据(csv文件、excel文件)
  2. 查看数据(尺寸信息、查看列名等方法)
  3. 查看空值
  4. 众数、中位数填补空值
  5. 利用循环补全所有列的空值

数据的读取与查看

# 读取数据
import pandas as pd
# 使用pandas中的read_csv读取csv文件
data = pd.read_csv(r'data.csv')

# pandas读取excel文件先安装openpyel库
# 使用pandas中的read_excel读取excel文件
data2 = pd.read_excel("data.xlsx")
type(data) # 类

输出:

pandas.core.frame.DataFrame

注:

DataFrame类型:类似二维的、表格型的数据结构,类似excel表格和SQL数据库中的表

# 布尔矩阵显示缺失值,返回一个布尔矩阵,也是dataframe对象
# True表示对应位置的值缺失,False表示对应位置的值存在
data.isnull()

输出:

IdHome OwnershipAnnual IncomeYears in current jobTax LiensNumber of Open AccountsYears of Credit HistoryMaximum Open CreditNumber of Credit ProblemsMonths since last delinquentBankruptciesPurposeTermCurrent Loan AmountCurrent Credit BalanceMonthly DebtCredit ScoreCredit Default
0FalseFalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
.........................................................
7495FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
7496FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
7497FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
7498FalseFalseTrueTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalse
7499FalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalse

7500 rows × 18 columns

# 显示数据前5行,已确认数据正确加载
data.head()

输出:

IdHome OwnershipAnnual IncomeYears in current jobTax LiensNumber of Open AccountsYears of Credit HistoryMaximum Open CreditNumber of Credit ProblemsMonths since last delinquentBankruptciesPurposeTermCurrent Loan AmountCurrent Credit BalanceMonthly DebtCredit ScoreCredit Default
00Own Home482087.0NaN0.011.026.3685960.01.0NaN1.0debt consolidationShort Term99999999.047386.07914.0749.00
11Own Home1025487.010+ years0.015.015.31181730.00.0NaN0.0debt consolidationLong Term264968.0394972.018373.0737.01
22Home Mortgage751412.08 years0.011.035.01182434.00.0NaN0.0debt consolidationShort Term99999999.0308389.013651.0742.00
33Own Home805068.06 years0.08.022.5147400.01.0NaN1.0debt consolidationShort Term121396.095855.011338.0694.00
44Rent776264.08 years0.013.013.6385836.01.0NaN0.0debt consolidationShort Term125840.093309.07180.0719.00

数据信息的查看

        此时data是dataframe类型的一个对象,可以看作是dataframe类的一个实例,具有类的属性和方法。

data.info()  # 列名、非空值、数据类型

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   object 
 2   Annual Income                 5943 non-null   float64
 3   Years in current job          7129 non-null   object 
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  3419 non-null   float64
 10  Bankruptcies                  7486 non-null   float64
 11  Purpose                       7500 non-null   object 
 12  Term                          7500 non-null   object 
 13  Current Loan Amount           7500 non-null   float64
 14  Current Credit Balance        7500 non-null   float64
 15  Monthly Debt                  7500 non-null   float64
 16  Credit Score                  5943 non-null   float64
 17  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

data.shape  # (行数,列数) data的属性

输出:(7500, 18)

代表数据有7500行,18列

data.columns  # 所有的列名 data的属性

输出:

Index(['Id', 'Home Ownership', 'Annual Income', 'Years in current job',
       'Tax Liens', 'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default'],
      dtype='object')

data.describe()  # 数值列的基本统计量 

输出:

IdAnnual IncomeTax LiensNumber of Open AccountsYears of Credit HistoryMaximum Open CreditNumber of Credit ProblemsMonths since last delinquentBankruptciesCurrent Loan AmountCurrent Credit BalanceMonthly DebtCredit ScoreCredit Default
count7500.0000005.943000e+037500.0000007500.0000007500.0000007.500000e+037500.0000003419.0000007486.0000007.500000e+037.500000e+037500.0000005943.0000007500.000000
mean3749.5000001.366392e+060.03013311.13093318.3174679.451537e+050.17000034.6926000.1171521.187318e+072.898332e+0518314.4541331151.0874980.281733
std2165.2078428.453392e+050.2716044.9089247.0419461.602622e+070.49859821.6888060.3471923.192612e+073.178714e+0511926.7646731604.4514180.449874
min0.0000001.645970e+050.0000002.0000004.0000000.000000e+000.0000000.0000000.0000001.124200e+040.000000e+000.000000585.0000000.000000
25%1874.7500008.443410e+050.0000008.00000013.5000002.792295e+050.00000016.0000000.0000001.801690e+051.142565e+0510067.500000711.0000000.000000
50%3749.5000001.168386e+060.00000010.00000017.0000004.781590e+050.00000032.0000000.0000003.095730e+052.093230e+0516076.500000731.0000000.000000
75%5624.2500001.640137e+060.00000014.00000021.8000007.935015e+050.00000050.0000000.0000005.198820e+053.604062e+0523818.000000743.0000001.000000
max7499.0000001.014934e+077.00000043.00000057.7000001.304726e+097.000000118.0000004.0000001.000000e+086.506797e+06136679.0000007510.0000001.000000

注:count:非缺失值的数量,mean:平均值,std:标准差

data.dtypes  # data types的缩写,查看每一列的数据类型,是data的属性

输出:

Id                                int64
Home Ownership                   object
Annual Income                   float64
Years in current job             object
Tax Liens                       float64
Number of Open Accounts         float64
Years of Credit History         float64
Maximum Open Credit             float64
Number of Credit Problems       float64
Months since last delinquent    float64
Bankruptcies                    float64
Purpose                          object
Term                             object
Current Loan Amount             float64
Current Credit Balance          float64
Monthly Debt                    float64
Credit Score                    float64
Credit Default                    int64
dtype: object

data["Annual Income"].dtype # 单独查看某一列的数据类型

输出:float64

data.isnull().sum()  # 统计每列中缺失值个数

输出:

Id                                 0
Home Ownership                     0
Annual Income                   1557
Years in current job             371
Tax Liens                          0
Number of Open Accounts            0
Years of Credit History            0
Maximum Open Credit                0
Number of Credit Problems          0
Months since last delinquent    4081
Bankruptcies                      14
Purpose                            0
Term                               0
Current Loan Amount                0
Current Credit Balance             0
Monthly Debt                       0
Credit Score                    1557
Credit Default                     0
dtype: int64

缺失值填补

先要知道哪些列有缺失值,就知道哪些列需要进行缺失值的填充

需要填补的列有Annual Income、Years in current job、Months since last delinquent、Bankruptcies、Credit Score

中位数填充(Annual Income为例)

# 计算Annual Income的中位数,自动忽略缺失值
median_Income = data["Annual Income"].median()
median_Income

输出:1168386.0

# 使用计算出的中位数填充缺失值
# fillna()方法用于填充缺失值,inplace=True表示直接在原DataFrame上进行修改,否则会返回一个新的DataFrame
# 修改的是data,而不是data.csv
data['Annual Income'].fillna(median_Income, inplace=True)
# 检查Annual Income列是否还有缺失值
data['Annual Income'].isnull().sum()

输出:0

众数填充(Annual Income为例)

data = pd.read_csv('data.csv')  # 重新读取数据
mode = data['Annual Income'].mode()  # 众数
# mode()会返回数据中出现频率最高的所有值
mode

输出:

0     969475.0
1    1043651.0
2    1058376.0
3    1161660.0
4    1338113.0
Name: Annual Income, dtype: float64
# 返回了5个频次最高的值,这里保留第一个
mode = mode[0]
# 众数填补
data['Annual Income'].fillna(mode, inplace=True)
# 检查是否有缺失值
data['Annual Income'].isnull().sum()

输出:0

for循环填补所有数值型缺失值

# 使用tolist()方法,将numpy数组和pandas对象转化为列表list
# 方便使用for循环对每一列进行遍历
c = data.columns.tolist()
c

输出:输出结果是一个列表,可以使用for循环遍历

['Id',
 'Home Ownership',
 'Annual Income',
 'Years in current job',
 'Tax Liens',
 'Number of Open Accounts',
 'Years of Credit History',
 'Maximum Open Credit',
 'Number of Credit Problems',
 'Months since last delinquent',
 'Bankruptcies',
 'Purpose',
 'Term',
 'Current Loan Amount',
 'Current Credit Balance',
 'Monthly Debt',
 'Credit Score',
 'Credit Default']

# 循环遍历列表c中的每一项
for i in c:
  # 找出数据类型是数值型的列,是则继续
  if data[i].dtype != 'object':
    # 判断列中是否有缺失值,有则继续
    if data[i].isnull().sum() > 0:
      # 找中位数进行补全
      mean_value = data[i].median()
      data[i].fillna(mean_value, inplace=True)
data.isnull().sum()      

输出:

Id                                0
Home Ownership                    0
Annual Income                     0
Years in current job            371
Tax Liens                         0
Number of Open Accounts           0
Years of Credit History           0
Maximum Open Credit               0
Number of Credit Problems         0
Months since last delinquent      0
Bankruptcies                      0
Purpose                           0
Term                              0
Current Loan Amount               0
Current Credit Balance            0
Monthly Debt                      0
Credit Score                      0
Credit Default                    0
dtype: int64

可见,除了object类型的Years in current job,所有数值型列都已无缺失值。

@浙大疏锦行

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值