一、处理丢失数据
Pandas 使用 numpy.NaN 来表示丢失的数据,它不参与计算。
#以jupyter为例,导入模块
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib .pyplot as plt
#新建表格
dates = pd.date_range( '2019.01.01' , periods = 6)#生成时间序列
df = pd.DataFrame(np.random.randn(6,4), index = dates, columns = list('ABCD'))
#修改行,并添加新的列。如果不赋值,默认为空
dates = pd.date_range( '2019.01.01' , periods = 6)
df = pd.DataFrame(np.random.randn(6,4), index = dates, columns = list('ABCD'))
df
'''
A B C D
2019-01-01 0.388482 0.513801 0.919633 -0.126457
2019-01-02 1.686455 0.839041 -0.782675 -0.763016
2019-01-03 -1.390973 0.072897 1.169534 0.302198
2019-01-04 1.088851 0.466736 -0.568150 -0.967236
2019-01-05 -1.446803 -0.836174 0.296793 -1.080340
2019-01-06 -0.153770 -1.415490 1.378581 0.125360
'''
#新增空列,不赋值的情况下默认是nan
df1 = df.reindex(index = dates[0:4] , columns = list(df.columns) +[ 'E' ])
df1
'''
A B C D E
2019-01-01 -1.880641 0.722718 -1.520591 0.143024 NaN
2019-01-02 -0.571412 0.155547 -0.673400 -0.462290 NaN
2019-01-03 0.090011 0.218589 -0.615705 0.529711 NaN
2019-01-04 -0.397896 -0.214237 -1.788470 0.802196 NaN
'''
1.为空值数据赋值
df1.loc[dates[1:3], 'E'] = 1
df1
'''
A B C D E
2019-01-01 -1.643946 -0.164271 -0.330477 0.425510 NaN
2019-01-02 0.654953 -0.637758 1.356877 -0.696242 0.2
2019-01-03 -0.196444 -0.783897 2.368650 -0.109155 0.2
2019-01-04 -0.411827 0.558476 -0.465239 0.241478 NaN
'''
也可以这样做:
df1.loc['2019-01-02':'2019-01-03', 'E']=0.5
2.删除空数据行和列(dropna)
使用dropna()
df1.dropna()#默认删除行,axis=1删除列
'''
A B C D E
2019-01-02 0.654953 -0.637758 1.356877 -0.696242 0.5
2019-01-03 -0.196444 -0.783897 2.368650 -0.109155 0.5
'''
可以通过subset参数来删除E中含有空数据的全部行
df1.dropna(subset= ['E'])
'''
A B C D E
2019-01-02 0.654953 -0.637758 1.356877 -0.696242 0.5
2019-01-03 -0.196444 -0.783897 2.368650 -0.109155 0.5
'''
3.填充所有缺失数据(fillna)
df1.fillna(value=0.2)
'''
A B C D E
2019-01-01 -1.643946 -0.164271 -0.330477 0.425510 0.2
2019-01-02 0.654953 -0.637758 1.356877 -0.696242 0.5
2019-01-03 -0.196444 -0.783897 2.368650 -0.109155 0.5
2019-01-04 -0.411827 0.558476 -0.465239 0.241478 0.2
'''
4.判断是否有NaN值
NaN, Not a Number, NaN是浮点数的一个值,代表“不是数”. 它即不是无穷大, 也不是无穷小.
存在NaN,返回True
#判断df1中是否有空值,如果有返回True,如