pandas官方文档cookbook（6）中Split&Pivot&Apply翻译

最新推荐文章于 2021-10-02 16:19:48 发布

三分之一给你

最新推荐文章于 2021-10-02 16:19:48 发布

阅读量578

点赞数

分类专栏： pandas 0.20.3 documentation 翻译文章标签： python 文档数据

pandas 0.20.3 documentation 翻译专栏收录该内容

9 篇文章

订阅专栏

文档版本：0.20.3
这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。
Pandas（pd）和Numpy（np）是唯一两个默认导入的包。其余的包会显示导入给新用户看。
若有翻译不当的地方，请多多指教。

这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题，作者从中进行提炼与总结。

分割

通过对行的逻辑描绘分割数据框，使其成为一个数据框的列表。

In [125]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
   .....:                         'Data' : np.random.randn(9)})
   .....: 

In [126]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]

In [127]: dfs[0]
Out[127]: 
    Case      Data
0    A  0.174068
1    A -0.439461
2    A -0.741343
3    B -0.079673

In [128]: dfs[1]
Out[128]: 
  Case      Data
4    A -0.922875
5    A  0.303638
6    B -0.917368

In [129]: dfs[2]
Out[129]: 
  Case      Data
7    A -1.624062
8    A -0.758514

枢纽表

部分的加总和小记

In [130]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
   .....:                          'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
   .....:                          'Sales' : [13,6,16,8,4,3,1]})
   .....: 

In [131]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)

In [132]: table.stack('City')
Out[132]: 
                    Sales
Province City            
AL       All         12.0
         Calgary      8.0
         Edmonton     4.0
BC       All         16.0
         Vancouver   16.0
MN       All          3.0
         Winnipeg     3.0
...                   ...
All      Calgary      8.0
         Edmonton     4.0
         Montreal     6.0
         Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0

[20 rows x 1 columns]

像R中的plyr的频率表

In [133]: grades = [48,99,75,80,42,80,72,68,36,78]

In [134]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
   .....:                     'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
   .....:                     'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
   .....:                     'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
   .....:                     'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
   .....:                     'Passed': ['yes' if x > 50 else 'no' for x in grades],
   .....:                     'Employed': [True,True,True,False,False,False,False,True,True,False],
   .....:                     'Grade': grades})
   .....: 

In [135]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
   .....:                     'Passed': lambda x: sum(x == 'yes'),
   .....:                     'Employed' : lambda x : sum(x),
   .....:                     'Grade' : lambda x : sum(x) / len(x)})
   .....: 
Out[135]: 
      Participated  Passed  Employed      Grade
ExamYear                                           
2007                 3       2         3  74.000000
2008                 3       3         0  68.500000
2009                 3       2         2  60.666667

创建一个年和月的交叉列表

In [136]: df = pd.DataFrame({'value': np.random.randn(36)},
   .....:                   index=pd.date_range('2011-01-01', freq='M', periods=36))
   .....: 

In [137]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
   .....:                values='value', aggfunc='sum')
   .....: 
Out[137]: 
        2011      2012      2013
1  -0.560859  0.120930  0.516870
2  -0.589005 -0.210518  0.343125
3  -1.070678 -0.931184  2.137827
4  -1.681101  0.240647  0.452429
5   0.403776 -0.027462  0.483103
6   0.609862  0.033113  0.061495
7   0.387936 -0.658418  0.240767
8   1.815066  0.324102  0.782413
9   0.705200 -1.403048  0.628462
10 -0.668049 -0.581967 -0.880627
11  0.242501 -1.233862  0.777575
12  0.313421 -3.520876 -0.779367

Apply

通过对嵌套列表的数据框进行循环处理使其转化为不含列表的多重索引结构

In [138]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])

In [139]: def SeriesFromSubList(aList):
   .....:    return pd.Series(aList)
   .....: 

In [140]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))

从一个从Series的标量被返回之前对多个列进行循环应用用于计算Series的函数

In [141]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
   .....:                   index=pd.date_range('2001-01-01',periods=2000),
   .....:                   columns=['A','B']); df
   .....: 
Out[141]: 
                   A         B
2001-01-01  0.000032 -0.000004
2001-01-02 -0.000001  0.000207
2001-01-03  0.000120 -0.000220
2001-01-04 -0.000083 -0.000165
2001-01-05 -0.000047  0.000156
2001-01-06  0.000027  0.000104
2001-01-07  0.000041 -0.000101
...              ...       ...
2006-06-17 -0.000034  0.000034
2006-06-18  0.000002  0.000166
2006-06-19  0.000023 -0.000081
2006-06-20 -0.000061  0.000012
2006-06-21 -0.000111  0.000027
2006-06-22 -0.000061 -0.000009
2006-06-23  0.000074 -0.000138

[2000 rows x 2 columns]

In [142]: def gm(aDF,Const):
   .....:    v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
   .....:    return (aDF.index[0],v.iloc[-1])
   .....: 

In [143]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[143]: 
2001-01-01   -0.001373
2001-01-02   -0.001705
2001-01-03   -0.002885
2001-01-04   -0.002987
2001-01-05   -0.002384
2001-01-06   -0.004700
2001-01-07   -0.005500
            ...   
2006-04-28   -0.002682
2006-04-29   -0.002436
2006-04-30   -0.002602
2006-05-01   -0.001785
2006-05-02   -0.001799
2006-05-03   -0.000605
2006-05-04   -0.000541
Length: 1950, dtype: float64

对多重列进行循环应用返回标量的函数

In [144]: rng = pd.date_range(start = '2014-01-01',periods = 100)

In [145]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
   .....:                    'Close' : np.random.randn(len(rng)),
   .....:                    'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
   .....: 
Out[145]: 
               Close      Open  Volume
2014-01-01 -0.653039  0.011174    1581
2014-01-02  1.314205  0.214258    1707
2014-01-03 -0.341915 -1.046922    1768
2014-01-04 -1.303586 -0.752902     836
2014-01-05  0.396288 -0.410793     694
2014-01-06 -0.548006  0.648401     796
2014-01-07  0.481380  0.737320     265
...              ...       ...     ...
2014-04-04 -2.548128  0.120378     564
2014-04-05  0.223346  0.231661    1908
2014-04-06  1.228841  0.952664    1090
2014-04-07  0.552784 -0.176090    1813
2014-04-08 -0.795389  1.781318    1103
2014-04-09 -0.018815 -0.753493    1456
2014-04-10  1.138197 -1.047997    1193

[100 rows x 3 columns]

In [146]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())

In [147]: window = 5

In [148]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);

In [149]: s.round(2)
Out[149]: 
2014-01-06   -0.03
2014-01-07    0.07
2014-01-08   -0.40
2014-01-09   -0.81
2014-01-10   -0.63
2014-01-11   -0.86
2014-01-12   -0.36
          ... 
2014-04-04   -1.27
2014-04-05   -1.36
2014-04-06   -0.73
2014-04-07    0.04
2014-04-08    0.21
2014-04-09    0.07
2014-04-10    0.25
Length: 95, dtype: float64

pandas的cookbook中翻译到现在还剩下一些作图、读写文件的部分没有翻译，打算弃坑以后有兴趣再来填。