python pandas基础3

最新推荐文章于 2022-04-08 15:32:44 发布

春夏秋冬又一年

最新推荐文章于 2022-04-08 15:32:44 发布

阅读量3k

点赞数

分类专栏： python 文章标签： python pandas

本文链接：https://blog.csdn.net/huangxia73/article/details/38225489

版权

python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

声明：本文根据《python for data analysis》整理

1 描述性统计计算

（1）sum方法

 In [198]: df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'],  columns=['one', 'two'])
 In [199]: df 

 Out[199]:     one  two 

             a  1.40  NaN 

             b  7.10 -4.5 

             c   NaN  NaN 

             d  0.75 -1.3

使用sum 方法计算总和

In [200]: df.sum() 

Out[200]: one    9.25 

               two   -5.80

给sum方法传入 axis参数可以改变统计的维度

In [201]: df.sum(axis=1) 

Out[201]: a    1.40 

               b    2.60 

               c     NaN 

               d   -0.55

对于NaN值，可以使用参数skipnan来选择保留（默认是忽略的）

In [202]: df.mean(axis=1, skipna=False) 

Out[202]: a      NaN 

          b    1.300 

          c      NaN 

         d   -0.275

（2）其他方法如：idxmax和idamin则返回结果的索引

In [203]: df.idxmax() 

Out[203]: one    b 

                two    d

accumulations:

In [204]: df.cumsum() 

Out[204]: one two 

        a 1.40 NaN 

        b 8.50 -4.5 

        c NaN NaN 

        d 9.25 -5.8

描述方法describe（让我想起了R语言的 summary....）

对于数字型DataFrame

In [205]: df.describe() 

Out[205]:             one       two 

          count  3.000000  2.000000 

          mean   3.083333 -2.900000 

          std    3.493685  2.262742 

          min    0.750000 -4.500000 

          25%    1.075000 -3.700000 

          50%    1.400000 -2.900000 

          75%    4.250000 -2.100000 

          max    7.100000 -1.300000

对于字符型 DataFrame

In [206]: obj = Series(['a', 'a', 'b', 'c'] * 4)
In [207]: obj.describe() 

Out[207]: count     16 

          unique    3 

          top       a 

          freq      8

（3）类似的方法和参数如下表：

Method Description

count Number of non-NA values describe Compute set of summary statistics for Series or each DataFrame column

min, max Compute minimum and maximum values

argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively

idxmin, idxmax Compute index values at which minimum or maximum value obtained, respectively

quantile Compute sample quantile ranging from 0 to 1

sum Sum of values

mean Mean of values median Arithmetic median (50% quantile) of values

mad Mean absolute deviation from mean value

var Sample variance of values

std Sample standard deviation of values

skew Sample skewness (3rd moment) of values

kurt Sample kurtosis (4th moment) of values cumsum Cumulative sum of values

cummin, cummax Cumulative minimum or maximum of values, respectively

cumprod Cumulative product of values

diff Compute 1st arithmetic difference (useful for time series)

pct_change Compute percent changes

2. 相关性和协方差

他们都是按照成对的参数来计算结果的。

下面的代码用来获取雅虎金融数据，部分公司股票数据

import pandas.io.data as web
all_data = {} 

for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:    

                 all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')

price = DataFrame({tic: data['Adj Close']                   

                for tic, data in all_data.iteritems()}) 

volume = DataFrame({tic: data['Volume']                    

                for tic, data in all_data.iteritems()})

计算改变率百分比，使用pct_change

In [209]: returns = price.pct_change()
In [210]: returns.tail()

Out[210]:              AAPL      GOOG       IBM      MSFT Date                                              

          2009-12-24  0.034339  0.011117  0.004420  0.002747 

          2009-12-28  0.012294  0.007098  0.013282  0.005479 

          2009-12-29 -0.011861 -0.005571 -0.003474  0.006812 

          2009-12-30  0.012147  0.005376  0.005468 -0.013532 

          2009-12-31 -0.004300 -0.004416 -0.012609 -0.015432

（1）对于Series，corr用来计算交叉的、非NaN数据、由索引关联的两组Series的相关性, cov来计算协方差

In [211]: returns.MSFT.corr(returns.IBM) Out[211]: 0.49609291822168838
In [212]: returns.MSFT.cov(returns.IBM) Out[212]: 0.00021600332437329015

（2）DataFrame的corr和cov返回的是全部数据的相关性和协方差

In [213]: returns.corr() 

Out[213]:           AAPL      GOOG       IBM      MSFT 

             AAPL  1.000000  0.470660  0.410648  0.424550 

             GOOG  0.470660  1.000000  0.390692  0.443334 

             IBM   0.410648  0.390692  1.000000  0.496093 

             MSFT  0.424550  0.443334  0.496093  1.000000

In [214]: returns.cov() 

Out[214]:           AAPL      GOOG       IBM      MSFT 

          AAPL  0.001028  0.000303  0.000252  0.000309 

        GOOG  0.000303  0.000580  0.000142  0.000205 

           IBM   0.000252  0.000142  0.000367  0.000216 

          MSFT  0.000309  0.000205  0.000216  0.000516

3 唯一值(使用unique)，值记数（相当于频率统计，使用value_counts()），集合性

假若我们有Series

In [217]: obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

统计obj 中出现的不同值，使用的是unique

In [218]: uniques = obj.unique()
In [219]: uniques 

Out[219]: array([c, a, d, b], dtype=object)

如果想统计不同值出现的次数，则使用value_counts

In [220]: obj.value_counts() 

Out[220]: c    3 

          a    3 

          b    2 

          d    1

isin 可以用来描述集合向量的包含关系

In [222]: mask = obj.isin(['b', 'c'])
In [223]: mask                                In [224]: obj[mask] 

Out[223]:                                        Out[224]:          

0     True                                                0    c             

1    False                                                5    b             

2    False                                                6    b             

3    False                                                7    c             

4    False                                                8    c            

5     True                                                  

6     True

4 处理缺失值

在Pandas中把所有缺失值都以NaN表示,python内置的None值也会被当做NaN处理。

In [229]: string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [232]: string_data[0] = None
In [233]: string_data.isnull() 
Out[233]:      0     True 
               1     False 
               2     True 
               3    False

（1）处理缺失值：删除

使用dropna()，或者使用data[data.notnull()]

In [234]: from numpy import nan as NA
In [235]: data = Series([1, NA, 3.5, NA, 7])
In [236]: data.dropna() 
Out[236]: 
0 1.0 

2 3.5 

4 7.0

由于dropna()方法会删除所有包含了空值（或缺失值）的行和列，我们可以使用参数 how来指定删除，如how='all'代表只删除全部一行都是缺失值(nan)值的

In [242]: data.dropna(how='all') 

Out[242]:     0    1   2 

             0   1  6.5   3 

             1   1  NaN NaN 

             3 NaN  6.5   3

我们也可以传入参数 axis=1来删除指定列

In [243]: data[4] = NA
In [244]: data                                         In [245]: data.dropna(axis=1, how='all') 

Out[244]:                                              Out[245]:                                   

    0    1   2   4                                                 0    1   2                          

0   1  6.5   3 NaN                                             0   1  6.5   3                          

1   1  NaN NaN NaN                                             1   1  NaN NaN                          

2 NaN  NaN NaN NaN                                             2 NaN  NaN NaN                          

3 NaN  6.5   3 NaN                                             3 NaN  6.5   3

（2）处理缺失值：填充

使用方法fillna

数据对象 df:

  In [248]: df                                   

           0                1               2                 

 0    -0.577087       NaN       NaN        

1      0.523772       NaN       NaN        

2    -0.713544       NaN       NaN                                       

3    -1.860761       NaN  0.560145                                       

4    -1.265934       NaN -1.063512                                       

5     0.332883 -2.359419 -0.199543                                       

6    -1.541996 -0.970736 -1.307030

使用 0 填充NaN值

In [250]: df.fillna(0) 

Out[250]:           0         1         2 

            0 -0.577087  0.000000  0.000000 

           1  0.523772  0.000000  0.000000 

           2 -0.713544  0.000000  0.000000 

           3 -1.860761  0.000000  0.560145 

           4 -1.265934  0.000000 -1.063512 

           5  0.332883 -2.359419 -0.199543 

           6 -1.541996 -0.970736 -1.307030

你也可以使用 dict（字典）对象作为填充策略，指定某一列的NaN数据用什么数字填充（下列代码：对1列的NaN用0.5填充，对3列（不存在）用-1填充）

In [251]: df.fillna({1: 0.5, 3: -1}) 

Out[251]:           0         1         2 

           0 -0.577087  0.500000       NaN 

          1  0.523772  0.500000       NaN 

          2 -0.713544  0.500000       NaN 

          3 -1.860761  0.500000  0.560145 

          4 -1.265934  0.500000 -1.063512 

          5  0.332883 -2.359419 -0.199543 

          6 -1.541996 -0.970736 -1.307030

还可以使用带有返回值的方法来作为填充参数，我们还可以指定 axis来指定行或列

In [259]: data = Series([1., NA, 3.5, NA, 7])
In [260]: data.fillna(data.mean()) 

Out[260]:      0    1.000000 

               1    3.833333 

               2    3.500000 

               3    3.833333 

               4    7.000000

5 层次索引

层次索引是pandas的重要部分，它提供了以低纬度处理高纬度数据的视角。

（1）简单使用

  In [261]: data = Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

  In [262]: data 

  Out[262]:      a   1    0.670216   

                     2    0.852965   

                     3   -0.955869 

                 b   1   -0.023493   

                     2   -2.304234   

                     3   -0.652469 

                 c   1   -1.218302   

                     2   -1.332610 

                 d   2    1.074623  

                     3    0.723642

（2）使用层次索引可以更精确的选择数据子集

In [265]: data['b':'c']       

Out[265]:                                         

 b   1   -0.023493                             

     2   -2.304234                                

     3   -0.652469                            

 c   1   -1.218302                         

     2   -1.332610

还可以选择更深层次的子集

In [267]: data[:, 2] 

Out[267]: a    0.852965

          b   -2.304234 

          c   -1.332610 

          d    1.074623

（3）使用stack方法和unstack 分别将层次索引的数据变成DataFrame类型数据和还原成层次索引数据

In [268]: data.unstack() 

Out[268]:           1         2         3 

           a  0.670216  0.852965 -0.955869 

           b -0.023493 -2.304234 -0.652469 

           c -1.218302 -1.332610       NaN 

          d       NaN  1.074623  0.723642

（4）任何维度都可以作为层次索引

In [270]: frame = DataFrame(np.arange(12).reshape((4, 3)),    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
In [271]: frame 

Out[271]:            Ohio       Colorado     

                    Green  Red     Green 

           a   1     0      1         2      

               2     3      4         5 

           b   1     6      7         8  

               2     9      10        11