导包
import numpy as np
import pandas as pd
有两种丢失数据(空值):
1.None
None是Python自带的,是Python中的空对象。None不能参与到任何计算中。 object类型的运算要比int类型的运算慢得多
% timeit np. arange( 1e6 , dtype= object ) . sum ( )
114 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
% timeit np. arange( 1e6 , dtype= np. int32) . sum ( )
3.81 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2. np.nan
np.nan是浮点类型,能参与到计算中,但计算的结果总是NaN
type ( np. nan)
float
但可以使用np.nan*()函数来计算nan,此时会过滤掉nan
n = np. array( [ 1 , 2 , 3 , np. nan, 5 , 6 ] )
n
array([ 1., 2., 3., nan, 5., 6.])
np. sum ( n)
nan
np. nansum( n)
17.0
np. nan + 10
nan
3.Pandas中的None与NaN
(1)Pandas中None与np.nan都视作np.nan
data = np. random. randint( 0 , 100 , size= ( 5 , 5 ) )
df = pd. DataFrame( data= data, columns= list ( "ABCDE" ) )
df
A B C D E 0 60 81 1 75 82 1 9 66 92 75 75 2 50 98 24 60 77 3 12 54 16 62 11 4 2 62 20 31 67
df. loc[ 2 , "B" ] = np. nan
df. loc[ 3 , "C" ] = None
df
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
df. loc[ 2 , "B" ] , df. loc[ 3 , "C" ]
(nan, nan)
(2)Pandas中None与np.nan的操作
df
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
df. isnull( )
A B C D E 0 False False False False False 1 False False False False False 2 False True False False False 3 False False True False False 4 False False False False False
df. notnull( )
A B C D E 0 True True True True True 1 True True True True True 2 True False True True True 3 True True False True True 4 True True True True True
df. isnull( ) . any ( )
A False
B True
C True
D False
E False
dtype: bool
df. isnull( ) . all ( )
A False
B False
C False
D False
E False
dtype: bool
df. notnull( ) . all ( )
A True
B False
C False
D True
E True
dtype: bool
df. notnull( ) . any ( )
A True
B True
C True
D True
E True
dtype: bool
df. isnull( ) . any ( axis= 1 )
0 False
1 False
2 True
3 True
4 False
dtype: bool
df. notnull( ) . all ( axis= 1 )
0 True
1 True
2 False
3 False
4 True
dtype: bool
df
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
cond = df. isnull( ) . any ( axis= 1 )
display( ~ cond)
df[ ~ cond]
0 True
1 True
2 False
3 False
4 True
dtype: bool
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 4 2 62.0 20.0 31 67
cond = df. notnull( ) . all ( axis= 1 )
df[ cond]
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 4 2 62.0 20.0 31 67
cond = df. isnull( ) . any ( )
df. loc[ : , ~ cond]
A D E 0 60 75 82 1 9 75 75 2 50 60 77 3 12 62 11 4 2 31 67
cond = df. notnull( ) . all ( )
df. loc[ : , cond]
A D E 0 60 75 82 1 9 75 75 2 50 60 77 3 12 62 11 4 2 31 67
可以选择过滤的是行还是列(默认为行)
df
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
df. dropna( )
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 4 2 62.0 20.0 31 67
df. dropna( axis= 1 )
A D E 0 60 75 82 1 9 75 75 2 50 60 77 3 12 62 11 4 2 31 67
也可以选择过滤的方式 how = “all”
df. dropna( how= "any" )
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 4 2 62.0 20.0 31 67
df. dropna( how= "all" )
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
df. dropna( how= "all" , axis= 1 )
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
inplace=True 修改原数据
df2 = df. copy( )
df2
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 2 50 NaN 24.0 60 77 3 12 54.0 NaN 62 11 4 2 62.0 20.0 31 67
df2. dropna( inplace= True )
df2
A B C D E 0 60 81.0 1.0 75 82 1 9 66.0 92.0 75 75 4 2 62.0 20.0 31 67
(3)填充函数 Series/DataFrame
data = np. random. randint( 0 , 100 , size= ( 5 , 5 ) )
df = pd. DataFrame( data= data, columns= list ( "ABCDE" ) )
df
df. loc[ 2 , "B" ] = np. nan
df. loc[ 3 , "C" ] = None
df
A B C D E 0 60 83.0 94.0 82 24 1 24 93.0 37.0 14 45 2 52 NaN 72.0 36 27 3 51 70.0 NaN 68 13 4 11 51.0 57.0 20 94
df. fillna( value= 100 )
A B C D E 0 60 83.0 94.0 82 24 1 24 93.0 37.0 14 45 2 52 100.0 72.0 36 27 3 51 70.0 100.0 68 13 4 11 51.0 57.0 20 94
df2 = df. copy( )
df2. loc[ 1 , "B" ] = np. nan
df2. loc[ 2 , "C" ] = np. nan
df2
A B C D E 0 60 83.0 94.0 82 24 1 24 NaN 37.0 14 45 2 52 NaN NaN 36 27 3 51 70.0 NaN 68 13 4 11 51.0 57.0 20 94
df2. fillna( value= 100 , limit= 1 , inplace= True )
df2
A B C D E 0 60 83.0 94.0 82 24 1 24 100.0 37.0 14 45 2 52 NaN 100.0 36 27 3 51 70.0 NaN 68 13 4 11 51.0 57.0 20 94
可以选择向前填充还是向后填充
df
A B C D E 0 60 83.0 94.0 82 24 1 24 93.0 37.0 14 45 2 52 NaN 72.0 36 27 3 51 70.0 NaN 68 13 4 11 51.0 57.0 20 94
df. fillna( method= "ffill" )
A B C D E 0 60 83.0 94.0 82 24 1 24 93.0 37.0 14 45 2 52 93.0 72.0 36 27 3 51 70.0 72.0 68 13 4 11 51.0 57.0 20 94
df. fillna( method= "backfill" )
A B C D E 0 60 83.0 94.0 82 24 1 24 93.0 37.0 14 45 2 52 70.0 72.0 36 27 3 51 70.0 57.0 68 13 4 11 51.0 57.0 20 94
df. fillna( method= "ffill" , axis= 1 )
A B C D E 0 60.0 83.0 94.0 82.0 24.0 1 24.0 93.0 37.0 14.0 45.0 2 52.0 52.0 72.0 36.0 27.0 3 51.0 70.0 70.0 68.0 13.0 4 11.0 51.0 57.0 20.0 94.0
df. fillna( method= "backfill" , axis= 1 )
A B C D E 0 60.0 83.0 94.0 82.0 24.0 1 24.0 93.0 37.0 14.0 45.0 2 52.0 72.0 72.0 36.0 27.0 3 51.0 70.0 68.0 68.0 13.0 4 11.0 51.0 57.0 20.0 94.0