第五章 数据清洗与整理

第五章 数据清洗与整理

5.1数据清洗

5.1.1处理缺失值

1.侦查缺失值
from pandas import DataFrame
import numpy as np 
df1 = DataFrame([[3,5,3],[1,6,np.nan],
 ['lili',np.nan,'pop'],[np.nan,'a','b']])
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3 non-null      object
 1   1       3 non-null      object
 2   2       3 non-null      object
dtypes: object(3)
memory usage: 224.0+ bytes
>>> df1.isnull()
       0      1      2
0  False  False  False
1  False  False   True
2  False   True  False
3   True  False  False
>>> df1.isnull().sum()
0    1
1    1
2    1
dtype: int64
>>> df1.isnull().sum().sum()
3
2.删除缺失值
>>> df1.dropna()
   0  1  2
0  3  5  3
>>> df2[4]=np.nan
>>> df2
   0  1   2   3   4
0  0  1   2   3 NaN
1  4  5   6   7 NaN
2  8  9  10  11 NaN
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  NaN  NaN  NaN NaN
>>> df2.dropna(how='all',axis=1)
     0    1    2    3
0  0.0  1.0  2.0  3.0
1  4.0  5.0  6.0  7.0
2  NaN  NaN  NaN  NaN
>>> df2.dropna(how='all')
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  NaN  NaN  NaN NaN
3.填充缺失值
>>> df2.fillna(0)
     0    1    2    3    4
0  0.0  1.0  2.0  3.0  0.0
1  4.0  5.0  6.0  7.0  0.0
2  0.0  0.0  0.0  0.0  0.0
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  NaN  NaN  NaN NaN
>>> df2.fillna({
   1:6,3:0})
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  6.0  NaN  0.0 NaN
>>> df2.fillna({
   1:6,3:0},inplace=True)
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  6.0  NaN  0.0 NaN
>>> df2.fillna(method='ffill')
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  4.0  6.0  6.0  0.0 NaN
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  NaN  6.0  NaN  0.0 NaN
>>> df2[0] = df2[0].fillna(df2[0].mean())
>>> df2
     0    1    2    3   4
0  0.0  1.0  2.0  3.0 NaN
1  4.0  5.0  6.0  7.0 NaN
2  2.0  6.0  NaN  0.0 NaN

5.1.2移除重复数据

>>> datac = {
   
...  'name':['张三', '李四', '张三', '小明'],
...  'sex':['female', 'male', 'female', 'male'],
...  'year':[2001, 2002, 2001, 2002],
...  'city':['北京', '上海', '北京', '北京']
...   }
>>> datac
{
   'name': ['张三', '李四', '张三', '小明'], 'sex': ['female', 'male', 'female', 'male'], 'year': [2001, 2002, 2001, 2002], 'city'
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值