Pandas空值识别,空值填充,空值过滤方法超详细解读

目录

Pandas默认空值处理方式

空值识别函数

isnull()/notnull()

用于查看不同行或列的空值个数

用于查看不同行或列的空值占比

空值填充函数

fillna()

使用0来填充空值

使用每列均值来填充空值

使用每列前面一个值来填充空值

使用每列后面一个值来填充空值

使用每行前一个值来填充空值

使用每行后一个值来填充空值

interpolate()

空值过滤函数

使用isnull()/notnull+any()/all()进行过滤

筛选出不含空值的行

筛选出有空值的行

dropna()


Pandas默认空值处理方式

pandas中默认会将None值转为np.nan进行处理,np.nan为float类型数据

以下为本篇博客使用的带有空值的DataFrame,可以看到输出的DataFrame中的NaN就是np.nan

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1)
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
#     0   1     2     3     4     5
# 0  16  29  24.0  20.0  17.0  16.0
# 1  28  20   NaN  13.0  17.0  12.0
# 2  11  21  15.0  11.0  10.0  21.0
# 3  21  26  19.0  25.0  24.0   NaN
# 4  28  21  29.0   NaN   NaN  28.0
# 5  16  18  16.0  27.0  13.0  23.0

空值识别函数

isnull()/notnull()

isnull()的作用是将DataFrame中为空值的地方变为True,其余地方为False,notnull()则相反,不为空值的地方为True,其余为False

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1)
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull())
print(data.notnull())
#     0   1     2     3     4     5
# 0  16  29  24.0  20.0  17.0  16.0
# 1  28  20   NaN  13.0  17.0  12.0
# 2  11  21  15.0  11.0  10.0  21.0
# 3  21  26  19.0  25.0  24.0   NaN
# 4  28  21  29.0   NaN   NaN  28.0
# 5  16  18  16.0  27.0  13.0  23.0
#        0      1      2      3      4      5
# 0  False  False  False  False  False  False
# 1  False  False   True  False  False  False
# 2  False  False  False  False  False  False
# 3  False  False  False  False  False   True
# 4  False  False  False   True   True  False
# 5  False  False  False  False  False  False
#       0     1      2      3      4      5
# 0  True  True   True   True   True   True
# 1  True  True  False   True   True   True
# 2  True  True   True   True   True   True
# 3  True  True   True   True   True  False
# 4  True  True   True  False  False   True
# 5  True  True   True   True   True   True

用于查看不同行或列的空值个数

若需要显示不同行或列的空值个数,由于布尔类型数据在参与运算的时候True会默认转为1,False会默认转为0,所以只需要isnull()再加上sum()聚合函数即可,通过设置其axis参数来进行每行或每列的求和,值就为空值个数

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull().sum(axis=0))
print(data.isnull().sum(axis=1))
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
# a    0
# b    0
# c    1
# d    1
# e    1
# f    1
# dtype: int64
# A    0
# B    1
# C    0
# D    1
# E    2
# F    0
# dtype: int64

可以看到这里值为0的时候是沿行方向进行每一列的求和运算,为1的时候是沿列方向进行每一行的求和运算,得到的就为空值个数

用于查看不同行或列的空值占比

若要查看空值占比同样也只需要加上mean()聚合函数即可,mean()为和再除上个数,在布尔类型数据进行如此运算后,结果就为空值占比,同样也可以指定axis来选择行或列进行计算

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.isnull().mean(axis=0))
print(data.isnull().mean(axis=1))
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
# a    0.000000
# b    0.000000
# c    0.166667
# d    0.166667
# e    0.166667
# f    0.166667
# dtype: float64
# A    0.000000
# B    0.166667
# C    0.000000
# D    0.166667
# E    0.333333
# F    0.000000
# dtype: float64

空值填充函数

空值填充一般会用到fillna()函数

fillna()

fillna()可以选用指定值作为空值填充,通过配置value参数即可设置填充值,并且支持列索引对齐的广播赋值,注意这里不支持行索引对齐的广播赋值机制,即使改变axis的值也不可以,并且支持通过配置method参数进行指定方式的空值赋值,method可选的有‘ffill’和‘bfill’,分别指用前一个值来填充空值和用后一个值来填充空值

使用0来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(value=0))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   0.0  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   0.0
# E  28  21  29.0   0.0   0.0  28.0
# F  16  18  16.0  27.0  13.0  23.0

使用每列均值来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(value=data.mean(axis = 0)))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20  20.6  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0  20.0
# E  28  21  29.0  19.2  16.2  28.0
# F  16  18  16.0  27.0  13.0  23.0

使用每列前面一个值来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='ffill',axis=0))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20  24.0  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0  21.0
# E  28  21  29.0  25.0  24.0  28.0
# F  16  18  16.0  27.0  13.0  23.0

这里设置了axis为0表示沿行方向计算

使用每列后面一个值来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='bfill',axis=0))
# 
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20  15.0  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0  28.0
# E  28  21  29.0  27.0  13.0  28.0
# F  16  18  16.0  27.0  13.0  23.0

使用每行前一个值来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='ffill',axis=1))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#       a     b     c     d     e     f
# A  16.0  29.0  24.0  20.0  17.0  16.0
# B  28.0  20.0  20.0  13.0  17.0  12.0
# C  11.0  21.0  15.0  11.0  10.0  21.0
# D  21.0  26.0  19.0  25.0  24.0  24.0
# E  28.0  21.0  29.0  29.0  29.0  28.0
# F  16.0  18.0  16.0  27.0  13.0  23.0

这里axis为1表示沿列方向计算 

使用每行后一个值来填充空值
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.fillna(method='bfill',axis=1))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#       a     b     c     d     e     f
# A  16.0  29.0  24.0  20.0  17.0  16.0
# B  28.0  20.0  13.0  13.0  17.0  12.0
# C  11.0  21.0  15.0  11.0  10.0  21.0
# D  21.0  26.0  19.0  25.0  24.0   NaN
# E  28.0  21.0  29.0  28.0  28.0  28.0
# F  16.0  18.0  16.0  27.0  13.0  23.0

interpolate()

interpolate()可以进行线性插值,通过设置axis来选择进行行方向的线性插值还是列方向的线性插值

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.interpolate(axis=0))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20  19.5  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0  24.5
# E  28  21  29.0  26.0  18.5  28.0
# F  16  18  16.0  27.0  13.0  23.0
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.interpolate(axis=1))

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#       a     b     c          d          e     f
# A  16.0  29.0  24.0  20.000000  17.000000  16.0
# B  28.0  20.0  16.5  13.000000  17.000000  12.0
# C  11.0  21.0  15.0  11.000000  10.000000  21.0
# D  21.0  26.0  19.0  25.000000  24.000000  24.0
# E  28.0  21.0  29.0  28.666667  28.333333  28.0
# F  16.0  18.0  16.0  27.000000  13.000000  23.0

空值过滤函数

使用isnull()/notnull+any()/all()进行过滤

筛选出不含空值的行
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.loc[data.notnull().all(axis=1)])

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# C  11  21  15.0  11.0  10.0  21.0
# F  16  18  16.0  27.0  13.0  23.0

筛选出有空值的行
import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.loc[data.isnull().any(axis=1)])

#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c     d     e     f
# B  28  20   NaN  13.0  17.0  12.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0

dropna()

dropna()相当于是集成后的删除空值所在列或行的函数,

axis参数用于指定是删除行还是列,为0表示删除有空值的行,为1代表删除有空值的列,

how参数可以指定为‘any’或‘all’,any表示有空值就删,all表示全为空则删,

subset参数要求赋值一个列表,比如在进行行删除的时候,可以选择用于判断是否具有空值的列,反之在进行列删除的时候可以选择判断是否具有空值的行,

inplace参数表示是否覆盖原内存

import pandas as pd
import numpy as np

np.random.seed(42)
l1 = np.random.randint(10,30,(6,6))
data = pd.DataFrame(data = l1,index=list('ABCDEF'),columns=list('abcdef'))
data.iloc[1,2] = None
data.iloc[3,5] = None
data.iloc[4,4] = None
data.iloc[4,3] = np.nan
print(data)
print(data.dropna(axis=1,how='any',subset=['E','D']))
print(data.dropna(axis=0,how='any',subset=['e','d']))


#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# E  28  21  29.0   NaN   NaN  28.0
# F  16  18  16.0  27.0  13.0  23.0
#     a   b     c
# A  16  29  24.0
# B  28  20   NaN
# C  11  21  15.0
# D  21  26  19.0
# E  28  21  29.0
# F  16  18  16.0
#     a   b     c     d     e     f
# A  16  29  24.0  20.0  17.0  16.0
# B  28  20   NaN  13.0  17.0  12.0
# C  11  21  15.0  11.0  10.0  21.0
# D  21  26  19.0  25.0  24.0   NaN
# F  16  18  16.0  27.0  13.0  23.0

PandasPython中用于数据分析和处理的一个强大库,其中处理空值(NaN,即"Not a Number")是数据清洗中的常见任务。在Pandas中,可以使用多种方法填充DataFrame或Series中的空值。以下是一些常见的空值填充方法: 1. `fillna()` 方法:这是Pandas提供的直接用于填充空值方法。你可以指定一个常数值或者使用其他数据点的值来填充空值。例如,使用一个常数值填充所有空值: ```python import pandas as pd df = pd.DataFrame({ 'A': [1, 2, None, 4], 'B': [5, None, None, 8] }) df_filled = df.fillna(0) ``` 或者使用某个列的均值、中位数等来填充: ```python mean_value = df['A'].mean() df['A'].fillna(mean_value, inplace=True) ``` 2. `replace()` 方法:虽然主要用于替换DataFrame中的特定值,但也可以用来替换NaN值。例如: ```python df.replace(to_replace=None, value=0) ``` 3. `interpolate()` 方法:如果数据是按顺序排列的,`interpolate()` 方法可以根据前后数据点进行插值填充空值,适用于时间序列数据等场景: ```python df.interpolate() ``` 4. 使用其他列的数据填充:你可以用一个列的值来填充另一个列的空值,例如: ```python df['A'].fillna(df['B'], inplace=True) ``` 5. 向前或向后填充:`bfill()` 和 `ffill()` 方法分别用于向后和向前填充空值,即用最近的非空值填充空值: ```python df.bfill() # 向后填充 df.ffill() # 向前填充 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值