4.8 Pandas中的Dataframe 缺失值NaN数据处理（Python）

ibun.song

已于 2023-02-10 14:39:09 修改

阅读量3.6k

点赞数 1

分类专栏： Python 文章标签： python pandas 数据分析

于 2023-02-10 11:20:05 首次发布

本文链接：https://blog.csdn.net/qq_40805441/article/details/128964372

版权

Python 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

Pandas中的Dataframe 缺失值NaN数据处理

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd

data = {
    'name': ['NAME0', 'NAME1', 'NAME2', 'NAME3', 'NAME4', 'NAME5', 'NAME6', 'NAME7', 'NAME8', 'NAME9'],

    'age': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

    'weight': ["weight0", 101, 102, np.nan, np.nan, 105, np.nan, 107, 108, 109],

    'is_single_dog': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}

indexs = ['index0', 'index1', 'index2', 'index3', 'index4', 'index5', 'index6', 'index7', 'index8', 'index9']

df = pd.DataFrame(data, index=indexs)

print(df)

控制台输出结果：

         name  age   weight isMarried
index0  NAME0    0  weight0       yes
index1  NAME1    1      101       yes
index2  NAME2    2      102        no
index3  NAME3    3      NaN       yes
index4  NAME4    4      NaN        no
index5  NAME5    5      105        no
index6  NAME6    6      NaN        no
index7  NAME7    7      107       yes
index8  NAME8    8      108        no
index9  NAME9    9      109        no

二、Dataframe 缺失值NaN数据处理

1. 特殊值替换缺失值

# 用0替换NaN
df = df.where(df.notnull(), 0)

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3        0           yes
index4  NAME4    4        0            no
index5  NAME5    5      105            no
index6  NAME6    6        0            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

# 用''替换NaN
df = df.where(df.notnull(), '')

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3                    yes
index4  NAME4    4                     no
index5  NAME5    5      105            no
index6  NAME6    6                     no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

# 用None替换NaN
df = df.where(df.notnull(), None)

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3     None           yes
index4  NAME4    4     None            no
index5  NAME5    5      105            no
index6  NAME6    6     None            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

2. 空值处理函数fillna替换缺失值

fillna函数的属性：

属性	描述
value	标量值或字典型对象用于填充缺失值
method	插值方法，如果没有其他参数，默认是‘ffill’
axis	需要填充轴，默认axis=0（横轴）
inplace	修改被调用对象，而不是生成一个备份
limit	用于前向或后向填充时最大的填充范围

# 用0替换NaN
df = df.fillna(value=0)

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3        0           yes
index4  NAME4    4        0            no
index5  NAME5    5      105            no
index6  NAME6    6        0            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

# 用''替换NaN
df = df.fillna(value='')

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3                    yes
index4  NAME4    4                     no
index5  NAME5    5      105            no
index6  NAME6    6                     no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

# 用None替换NaN
df = df.fillna(value=None)

print(df)

控制台输出结果：

	ValueError: Must specify a fill 'value' or 'method'.

在这里插入图片描述

# 用None替换NaN 改成 用'None'替换NaN
df = df.fillna(value='None')

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3     None           yes
index4  NAME4    4     None            no
index5  NAME5    5      105            no
index6  NAME6    6     None            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

3. 使用邻值填充缺失值

# 用上一行的值填充，最大填充范围为1行
df = df.fillna(method = 'ffill', limit=1)

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3      102           yes
index4  NAME4    4      NaN            no
index5  NAME5    5      105            no
index6  NAME6    6      105            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

注意事项：如果某列有连续2行值为NaN，最大填充范围为1行，则第2行的值不会被充填，仍为NaN

# 用下一行的值填充，最大填充范围为2行
df = df.fillna(method = 'backfill', limit=2)

print(df)

控制台输出结果：

         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3      105           yes
index4  NAME4    4      105            no
index5  NAME5    5      105            no
index6  NAME6    6      107            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

4. 使用众数填充缺失值

# 用出现次数最多的值来填充NaN
df['weight'] = df['weight'].fillna(df['weight'].mode()[0])

print(df)

控制台输出结果：

UserWarning: Unable to sort modes: '<' not supported between instances of 'str' and 'int'
  warn(f"Unable to sort modes: {err}")
  
         name  age   weight is_single_dog
index0  NAME0    0  weight0           yes
index1  NAME1    1      101           yes
index2  NAME2    2      102            no
index3  NAME3    3      101           yes
index4  NAME4    4      101            no
index5  NAME5    5      105            no
index6  NAME6    6      101            no
index7  NAME7    7      107           yes
index8  NAME8    8      108            no
index9  NAME9    9      109            no

5. 使用平均数填充缺失值

# 用平均数的值来填充NaN
df['weight'] = df['weight'].fillna(df['weight']..mean())

print(df)

完整代码：

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd

data = {
    'name': ['NAME0', 'NAME1', 'NAME2', 'NAME3', 'NAME4', 'NAME5', 'NAME6', 'NAME7', 'NAME8', 'NAME9'],

    'age': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

    'weight': [101, 101, 102, np.nan, np.nan, 105, np.nan, 107, 108, 109],

    'is_single_dog': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}

indexs = ['index0', 'index1', 'index2', 'index3', 'index4', 'index5', 'index6', 'index7', 'index8', 'index9']

df = pd.DataFrame(data, index=indexs)

print(df)
print()

# 用平均数的值来填充NaN
df['weight'] = df['weight'].fillna(df['weight'].mean())

print(df)

控制台输出结果：

         name  age  weight is_single_dog
index0  NAME0    0   101.0           yes
index1  NAME1    1   101.0           yes
index2  NAME2    2   102.0            no
index3  NAME3    3     NaN           yes
index4  NAME4    4     NaN            no
index5  NAME5    5   105.0            no
index6  NAME6    6     NaN            no
index7  NAME7    7   107.0           yes
index8  NAME8    8   108.0            no
index9  NAME9    9   109.0            no

         name  age      weight is_single_dog
index0  NAME0    0  101.000000           yes
index1  NAME1    1  101.000000           yes
index2  NAME2    2  102.000000            no
index3  NAME3    3  104.714286           yes
index4  NAME4    4  104.714286            no
index5  NAME5    5  105.000000            no
index6  NAME6    6  104.714286            no
index7  NAME7    7  107.000000           yes
index8  NAME8    8  108.000000            no
index9  NAME9    9  109.000000            no