Python pandas基础2

最新推荐文章于 2022-03-23 19:53:59 发布

春夏秋冬又一年

最新推荐文章于 2022-03-23 19:53:59 发布

阅读量1.2k

点赞数

分类专栏： python 文章标签： python pandas

本文链接：https://blog.csdn.net/huangxia73/article/details/38086161

版权

python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本文根据《Python for data analysis》整理

1 数据运算和校准

pandas的重要特征就是，能够在有不同索引的数据对象（如obj1,obj2）间运算，运算结果是两个数据对象的索引并集构成。示例如下：

（1）拥有不同索引数据对象间运算

In [126]: s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [127]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [128]: s1        

In [129]: s2

Out[128]:           Out[129]:   

a    7.3            a   -2.1   

 c   -2.5           c    3.6    

d    3.4            e   -1.5    

e    1.5            f    4.0                       

                    g    3.1

两数相加结果（可以看到，索引没有交集的时候，结果集中的值为 NaN）：

In [130]: s1 + s2 

Out[130]: a    5.2 

               c    1.1 

               d    NaN 

               e    0.0 

               f    NaN 

               g    NaN

使用add方法的fill_value填充参数，（方法是：先用填充参数将小DataFrame（或Series）填充成与大DataFrame（或Series）规模一致的，再做运算）

In [136]: df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
In [137]: df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
In [138]: df1              In [139]: df2      

 Out[138]:                   Out[139]:            

  a  b   c   d                a   b   c   d   e
 0  0  1   2   3            0   0   1   2   3   4

 1  4  5   6   7            1   5   6   7   8   9

 2  8  9  10  11            2  10  11  12  13  14                      

                            3  15  16  17  18  19

<pre class="python" name="code">In [141]: df1.add(df2, fill_value=0) 

Out[141]:     a   b   c   d   e 

          0   0   2   4   6   4  
          1   9  11  13   15  9  
          2  18  20  22  24  14 
          3  15  16  17  18  19

（2）在DataFrame和Series之间做运算。

会将Series扩展成（按行拓展或列拓展都可以）与DataFrame一致的维度，再做运算

  In [143]: arr = np.arange(12.).reshape((3, 4))
  In [144]: arr 

  Out[144]: array([[  0.,   1.,   2.,   3.],    

                   [  4.,   5.,   6.,   7.],
                   [  8.,   9.,  10.,  11.]])
In [145]: arr[0] 

Out[145]: array([ 0.,  1.,  2.,  3.])

In [146]: arr - arr[0]

Out[146]: array([[ 0.,  0.,  0.,  0.],     

                 [ 4.,  4.,  4.,  4.],      

                 [ 8.,  8.,  8.,  8.]])

DataFrame和Series之间的算术运算默认情况下是根据Series的索引（index）去匹配DataFrame的列：

In [147]: frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [148]: series = frame.ix[0]
In [149]: frame                         In [150]: series 

Out[149]:                                 Out[150]:              

        b   d   e                         b    0         

Utah    0   1   2                         d    1         

Ohio    3   4   5                         e    2          

Texas   6   7   8        

Oregon  9  10  11

In [151]: frame - series 

Out[151]:          b  d  e 

           Utah    0  0  0 

           Ohio    3  3  3 

           Texas   6  6  6 

           Oregon  9  9  9

（3）使用函数和mapping映射

numpy的通用函数（逐元素数组方法）如：abs、exp，可以直接对DataFrame数据对象使用

另外一个使用频繁的函数是DataFrame的“apply”函数（对数组的行或列的第一维度应用另外一个函数）：

  In [161]: f = lambda x: x.max() - x.min()
 In [162]: frame.apply(f)                                       In [163]: frame.apply(f, axis=1) 

 Out[162]:                                                           Out[163]:                     

     b    1.802165                                                            Utah      0.998382              

     d    1.684034                                                            Ohio      2.521511              

     e    2.689627                                                            Texas     0.676115                                              

                                                                              Oregon    2.542656

大部分常用函数如：sum，mean都是DataFrame的，所以使用apply函数显得没有必要。注意：apply函数不必返回一个纯量，可以返回一个Series对象

In [164]: def f(x):   

               return Series([x.min(), x.max()], index=['min', 'max'])
In [165]: frame.apply(f)
Out[165]:             

                b         d         e 

    min    -0.555730  0.281746 -1.296221 

    max  1.246435  1.965781  1.393406

同样可以对DataFrame对象使用逐元素处理函数（使用applymap）：

In [166]: format = lambda x: '%.2f' % x
In [167]: frame.applymap(format) 

Out[167]:             

                 b     d      e 

    Utah    -0.20  0.48  -0.52 

    Ohio    -0.56  1.97   1.39 

   Texas    0.09  0.28   0.77 

  Oregon   1.25  1.01  -1.30

2 sort（排序）和ranking(对象在行或列中的位置)

（1）使用"sort_index"排序

对于Series

In [169]: obj = Series(range(4), index=['d', 'a', 'b', 'c'])
In [170]: obj.sort_index()
Out[170]: a    1 
               b    2 
               c    3 
               d    0

对于DataFrame，可以使用 sort_index,同时可以使用 axis参数

In [171]: frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [172]: frame.sort_index()        

In [173]: frame.sort_index(axis=1) 

Out[172]:                           Out[173]:                                

          d  a  b  c                       a  b  c  d                 

one    4  5  6  7                   three  1  2  3  0                 

three  0  1  2  3                   one    5  6  7  4

sort_index 默认是按照升序排列，可以使用ascending参数改变（和SQL很像）。

In [174]: frame.sort_index(axis=1, ascending=False) 
Out[174]:        
          d  c  b  a 
three  0  3  2  1 
one    4  7  6  5

（2）按照值排序

对于Series,使用order，注意排序时默认NaN值放在最后

In [177]: obj = Series([4, np.nan, 7, np.nan, -3, 2])
In [178]: obj.order() 
Out[178]: 4    -3 
          5     2 
          0     4 
          2     7 
          1   NaN 
          3   NaN

对于DataFrame，可以想按照某一列或多列排序，使用sort_index（此时DataFrame中的列值可以看做某个Series的Index）和by

In [179]: frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [182]: frame.sort_index(by=['a', 'b']) 
Out[182]:    a  b 
          2  0 -3 
          0  0  4
          3  1  2 
          1  1  7

（3）rank

通过下面的示例可以看出rank的意义：<1>返回元素在Series或DataFrame中的位置

<2>从1开始

<3>如果有并列位置，则默认是并列位置的均值。如3个数并列第3，那么他们的rank=（3+4+5）/3 =4

In [183]: obj = Series([7, -5, 7, 4, 2, 0, 4])
In [184]: obj.rank() 
Out[184]: 0    6.5 
1    1.0 
2    6.5 
3    4.5 
4    3.0 
5    2.0 
6    4.5

可以通过参数ascending 来改变顺序，通过method改变rank对于并列值的取值，下例是去最大rank作为并列rank的值

In [186]: obj.rank(ascending=False, method='max') 
Out[186]:      0    2 
               1    7 
               2    2 
               3    4 
               4    5 
               5    6 
               6    4

对于DataFrame

In [187]: frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],  'c': [-2, 5, 8, -2.5]})
In [188]: frame                 In [189]: frame.rank(axis=1) 

Out[188]:                       Out[189]:                      

        a    b    c                         a  b  c                  

     0  0  4.3 -2.0                       0  2  3  1                  

     1  1  7.0  5.0                       1  1  3  2                 

     2  0 -3.0  8.0                       2  2  1  3                  

     3  1  2.0 -2.5                       3  2  3  1

(4)对于重复轴(axis)索引

使用index属性的is_unique属性查看是否存在重复索引

In [190]: obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [191]: obj 
Out[191]:      a    0 
               a    1 
               b    2 
               b    3 
               c    4

In [192]: obj.index.is_unique 
Out[192]: False

如果某个索引存在多个值，将会全部被列出

In [195]: df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
In [196]: df 

Out[196]:           0         1         2 

              a  0.274992  0.228913  1.352917 

              a  0.886429 -2.001637 -0.371843 

              b  1.669025 -0.438570 -0.539741 

              b  0.476985  3.248944 -1.021228
In [197]: df.ix['b'] 

Out[197]:             0         1         2

                b  1.669025 -0.438570 -0.539741 

                b  0.476985  3.248944 -1.021228