数据清洗与准备

  1. 过滤缺失值:dropna

    )1.在Series上使用dropna,dropna会直接把缺失值所在的行过滤掉
    from pandas import Series,DataFrame
    from numpy import nan as NA
    data = Series(['a','b',NA,'d'])
    print(data.dropna())
    0    a
    1    b
    3    d
    dtype: object
    
    
    )2.DataFrame中,dropna默认情况下会删除包含缺失值的行
    data2 = DataFrame([[1,2,3],[4,NA,6],[7,8,9],[NA,NA,NA]])
    print(data2)
         0    1    2
    0  1.0  2.0  3.0
    1  4.0  NaN  6.0
    2  7.0  8.0  9.0
    3  NaN  NaN  NaN
    
    print(data2.dropna())
         0    1    2
    0  1.0  2.0  3.0
    2  7.0  8.0  9.0
    
    
    )3.参数how=all,将删除所有值为NA的行
    print(data2.dropna(how='all'))
         0    1    2
    0  1.0  2.0  3.0
    1  4.0  NaN  6.0
    2  7.0  8.0  9.0
    
    
    )4.参数axis=1,删除包含缺失值的列
    print(data2.dropna(axis=1))
    Empty DataFrame
    Columns: []
    Index: [0, 1, 2, 3]
        
    print(data2.dropna(how='all',axis=1))
         0    1    2
    0  1.0  2.0  3.0
    1  4.0  NaN  6.0
    2  7.0  8.0  9.0
    3  NaN  NaN  NaN
    
    
    )5.参数thresh=n,保留至少几个非NA值的行/列
    thresh:脱落
    df = DataFrame(np.random.randn(7,4))
    df.iloc[:4,:2]=NA
    df.iloc[:2,2]=NA
    print(df)
              0         1         2         3
    0       NaN       NaN       NaN  0.552708
    1       NaN       NaN       NaN -0.032440
    2       NaN       NaN -0.451361 -1.666976
    3       NaN       NaN  0.289092 -0.750890
    4 -0.508130 -1.409111  0.133071 -0.033718
    5  1.253351 -0.399418 -1.522084  0.264536
    
    print(df.dropna(thresh=3))
              0         1         2         3
    4 -0.508130 -1.409111  0.133071 -0.033718
    5  1.253351 -0.399418 -1.522084  0.264536
    6  0.794228 -0.781878 -0.872452  0.933511
    
    print(df.dropna(thresh=2))
              0         1         2         3
    2       NaN       NaN -0.102649 -1.341281
    3       NaN       NaN  1.699763  0.445703
    4 -0.248189  1.021283 -1.104852 -1.672537
    5  0.519167  1.364827 -0.119368 -0.688406
    6 -1.350202 -1.876677 -0.250996 -0.405626
    
  2. 补全缺失值:fillna

    )1.用常数来替代所有的NA
    df = DataFrame(np.random.randn(7,4))
    df.iloc[:4,:2]=NA
    df.iloc[:2,2]=NA
    print(df)
              0         1         2         3
    0       NaN       NaN       NaN -1.342383
    1       NaN       NaN       NaN  0.457186
    2       NaN       NaN  0.688483 -0.793584
    3       NaN       NaN -1.441301  0.850932
    4  0.517123  1.534299  0.587808 -0.074566
    5 -0.630781  1.533494 -0.682704 -2.778620
    6  0.735230 -0.268278  1.510993 -1.083807
    
    print(df.fillna(0))
              0         1         2         3
    0  0.000000  0.000000  0.000000 -1.342383
    1  0.000000  0.000000  0.000000  0.457186
    2  0.000000  0.000000  0.688483 -0.793584
    3  0.000000  0.000000 -1.441301  0.850932
    4  0.517123  1.534299  0.587808 -0.074566
    5 -0.630781  1.533494 -0.682704 -2.778620
    6  0.735230 -0.268278  1.510993 -1.083807
    
    )2.为不同的列设定不同的填充值
    print(df.fillna({0:1,1:2,2:3}))
              0         1         2         3
    0  1.000000  2.000000  3.000000  1.790328
    1  1.000000  2.000000  3.000000 -0.080825
    2  1.000000  2.000000 -0.305895 -0.956160
    3  1.000000  2.000000 -0.912206  1.343991
    4  0.634614 -1.312406  1.119154 -1.272266
    5 -0.959586 -1.988487  0.638590 -1.002639
    6 -1.338498  0.657485  0.667352  0.032378
    
  3. 删除重复值:duplicated(重复的)

    )1.duplicated:返回一个布尔值Series
    data = DataFrame({'k1':['one','two']*3+['two'],
                      'k2':[1,1,2,3,3,4,4]})
    print(data)
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    6  two   4
    
    print(data.duplicated())
    0    False
    1    False
    2    False
    3    False
    4    False
    5    False
    6     True
    dtype: bool
      
    )2.drop_duplicates:返回删除重复值的Dataframe
    print(data.drop_duplicates())
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    
    )3.基于某一列去除重复值:传入列名
    data['k3']=range(7)
    print(data)
        k1  k2  k3
    0  one   1   0
    1  two   1   1
    2  one   2   2
    3  two   3   3
    4  one   3   4
    5  two   4   5
    6  two   4   6
    
    print(data.drop_duplicates(['k1']))
        k1  k2  k3
    0  one   1   0
    1  two   1   1
    
    print(data.drop_duplicates(['k2']))
        k1  k2  k3
    0  one   1   0
    2  one   2   2
    3  two   3   3
    5  two   4   5
    
    )4.参数keep='last',保留最后一个观测到的重复值(默认是保留第一个观测到的重复值)
    print(data.drop_duplicates(['k1'],keep='last'),)
        k1  k2  k3
    4  one   3   4
    6  two   4   6
    
  4. 使用函数或映射进行数据转换:map

    Series(DataFrame的一列)的map方法接受一个函数或一个包含映射关系的字典型数组
    
    data = DataFrame({'name':['zxw','zhj']*4,
                      'grade':[71,0,75,0,100,0,124,0]})
    print(data)
      name  grade
    0  zxw     71
    1  zhj      0
    2  zxw     75
    3  zhj      0
    4  zxw    100
    5  zhj      0
    6  zxw    124
    7  zhj      0
    
    sex_to_people={
        'zxw':'nan','zhj':'nv'
    }
    data['sex'] = data['name'].map(sex_to_people)
    print(data)
    
  5. 替换值:replace

    data = Series([1,2,4,9,16])
    print(data)
    0     1
    1     2
    2     4
    3     9
    4    16
    dtype: int64
        
    )1.替换单个值
    print(data.replace(1,4))
    0     4
    1     2
    2     4
    3     9
    4    16
    dtype: int64
        
    )2.将多个值替换成同一个值  
    print(data.replace([4,9],0))
    0     1
    1     2
    2     0
    3     0
    4    16
    
    )3.利用列表将多个值替换成不同的值
    print(data.replace([4,9],[-4,-9]))
    0     1
    1     2
    2    -4
    3    -9
    4    16
    
    )4.利用字典将多个值替换成不同的值
    print(data.replace({16:-16,9:-9}))
    0     1
    1     2
    2     4
    3    -9
    4   -16
    
  6. 重命名轴索引:rename

    data = DataFrame(np.arange(16).reshape((4,4)),
                     index = ['one','two','three','four'])
    print(data)
            0   1   2   3
    one     0   1   2   3
    two     4   5   6   7
    three   8   9  10  11
    four   12  13  14  15
    
    print(data.rename(index={'one':'11'},
                      columns={0:'one'}))
           one   1   2   3
    11       0   1   2   3
    two      4   5   6   7
    three    8   9  10  11
    four    12  13  14  15
    
  7. 离散化和分箱:cutqcut(将一些数据进行分组,放入离散的框里面)

    ages = [20,22,25,27,21,23,37,31,61,45,41,32]
    bins = [18,25,35,60,100]
    c = pd.cut(ages,bins)
    )1.print(c):展示数值所在的区间
    [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
    Length: 12
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
                                                                        
    )2.print(c.codes):展示数值所在的组序号
     [0 0 0 1 0 0 2 1 3 2 2 1]                                                             
    
    )3.print(pd.value_counts(c))统计不同组的数据个数
    (18, 25]     5
    (35, 60]     3
    (25, 35]     3
    (60, 100]    1                                                                         
    
    )4.参数right指定区间哪边是封闭的
    ages = [20,22,25,27,21,23,37,31,61,45,41,32]
    bins = [18,25,35,60,100]
    c = pd.cut(ages,bins,right=False)
    print(c)
    [[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
    Length: 12
    Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
    
    )5.labels:传入自定义的箱名
    ages = [20,22,25,27,21,23,37,31,61,45,41,32]
    bins = [18,25,35,60,100]
    name =['少年','青年','壮年','老年']
    c = pd.cut(ages,bins,right=False,labels=name)
    print(c)
    [少年, 少年, 青年, 青年, 少年, ..., 青年, 老年, 壮年, 壮年, 青年]
    Length: 12
    Categories (4, object): [少年 < 青年 < 壮年 < 老年]                                                         
    )6.qcut:基于样本的中位数进行分箱,每一个等距的箱子类别中的数值数量是一样的。  
    data = np.random.randn(20)
    c = pd.qcut(data,4)
    print(pd.value_counts(c))
    (0.689, 2.849]       5
    (-0.0909, 0.689]     5
    (-1.052, -0.0909]    5
    (-1.757, -1.052]     5
    dtype: int64 
    
  8. 检测和过滤异常值)

    )1.describe():打印每一列常见的信息。
    data2 = DataFrame([[1,2,3],[4,NA,6],[7,8,9],[NA,NA,NA]])
    print(data2.describe())
             0         1    2
    count  3.0  2.000000  3.0
    mean   4.0  5.000000  6.0
    std    3.0  4.242641  3.0
    min    1.0  2.000000  3.0
    25%    2.5  3.500000  4.5
    50%    4.0  5.000000  6.0
    75%    5.5  6.500000  7.5
    max    7.0  8.000000  9.0
    
    )2.找出一列中绝对值大于1的数
    abs:绝对值 
    data = pd.DataFrame(np.random.randn(10,4))
    col = data[2]
    print(col[np.abs(col)>1])
    4    1.079749
    7    1.628768
    8    1.082464
    9    1.110299
    Name: 2, dtype: float64
    
    )3.np.sign(data):根据数据中的正负分别生成1-1的数值
         0    1    2    3
    0  1.0  1.0  1.0  1.0
    1  1.0  1.0  1.0 -1.0
    2 -1.0 -1.0 -1.0  1.0
    3  1.0 -1.0 -1.0  1.0
    4 -1.0 -1.0 -1.0 -1.0
    5 -1.0  1.0 -1.0 -1.0
    6  1.0 -1.0  1.0  1.0
    7  1.0  1.0  1.0  1.0
    8 -1.0  1.0 -1.0 -1.0
    9 -1.0 -1.0  1.0 -1.0    
    
  9. 置换和随机抽样:permutationtakesample(permutation:排列,组合和置换)

    1.permutation:根据你想要的轴长度产生一个新顺序的整数数组
      take:将新顺序应用到索引中
    df = pd.DataFrame(np.arange(20).reshape((5,4)))
    ss = np.random.permutation(5)
    print(ss)
    print(df.take(ss))
        0   1   2   3
    3  12  13  14  15
    4  16  17  18  19
    2   8   9  10  11
    0   0   1   2   3
    1   4   5   6   7
    
    2.sample:找出随机子集,顺序不定
    print(df.sample(n=4))
        0   1   2   3
    0   0   1   2   3
    1   4   5   6   7
    2   8   9  10  11
    4  16  17  18  19
    
    replace=True:允许重复抽样
    print(df.sample(n=4,replace=True))
        0   1   2   3
    3  12  13  14  15
    3  12  13  14  15
    3  12  13  14  15
    4  16  17  18  19
    
  10. 计算指标/虚拟变量:get_dummies(dummy:虚设,假的)

    1.选择DataFrame的某一列,生成它的位置矩阵
    df = pd.DataFrame({'key1':['a','b','c','c','b','a'],
                       'data1':range(6)})
    print(pd.get_dummies(df['data1']))
       0  1  2  3  4  5
    0  1  0  0  0  0  0
    1  0  1  0  0  0  0
    2  0  0  1  0  0  0
    3  0  0  0  1  0  0
    4  0  0  0  0  1  0
    5  0  0  0  0  0  1
    
    2.data[ ['data1'] ]:会保留列名,相当于只有一列的DataFrame
      data['data1']:相当于Series。
    
    3.prefix:给列加上前缀
      join:与其他数据合并,DataFrame的方法。
    dummy = pd.get_dummies(df['key1'],prefix='key')
    print(dummy)
       key_a  key_b  key_c
    0      1      0      0
    1      0      1      0
    2      0      0      1
    3      0      0      1
    4      0      1      0
    5      1      0      0
    
    print(df[['data1']].join(dummy))
    print(dummy.join(df['data1']))
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值