文档版本:0.20.3
这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。
Pandas(pd)和Numpy(np)是唯一两个默认导入的包。其余的包会显示导入给新用户看。
若有翻译不当的地方,请多多指教。
这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题,作者从中进行提炼与总结。
分割
通过对行的逻辑描绘分割数据框,使其成为一个数据框的列表。
In [125]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
.....: 'Data' : np.random.randn(9)})
.....:
In [126]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]
In [127]: dfs[0]
Out[127]:
Case Data
0 A 0.174068
1 A -0.439461
2 A -0.741343
3 B -0.079673
In [128]: dfs[1]
Out[128]:
Case Data
4 A -0.922875
5 A 0.303638
6 B -0.917368
In [129]: dfs[2]
Out[129]:
Case Data
7 A -1.624062
8 A -0.758514
枢纽表
部分的加总和小记
In [130]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
.....: 'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
.....: 'Sales' : [13,6,16,8,4,3,1]})
.....:
In [131]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)
In [132]: table.stack('City')
Out[132]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
Vancouver 16.0
MN All 3.0
Winnipeg 3.0
... ...
All Calgary 8.0
Edmonton 4.0
Montreal 6.0
Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
[20 rows x 1 columns]
像R中的plyr的频率表
In [133]: grades = [48,99,75,80,42,80,72,68,36,78]
In [134]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
.....: 'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
.....: 'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
.....: 'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
.....: 'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
.....: 'Passed': ['yes' if x > 50 else 'no' for x in grades],
.....: 'Employed': [True,True,True,False,False,False,False,True,True,False],
.....: 'Grade': grades})
.....:
In [135]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
.....: 'Passed': lambda x: sum(x == 'yes'),
.....: 'Employed' : lambda x : sum(x),
.....: 'Grade' : lambda x : sum(x) / len(x)})
.....:
Out[135]:
Participated Passed Employed Grade
ExamYear
2007 3 2 3 74.000000
2008 3 3 0 68.500000
2009 3 2 2 60.666667
创建一个年和月的交叉列表
In [136]: df = pd.DataFrame({'value': np.random.randn(36)},
.....: index=pd.date_range('2011-01-01', freq='M', periods=36))
.....:
In [137]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
.....: values='value', aggfunc='sum')
.....:
Out[137]:
2011 2012 2013
1 -0.560859 0.120930 0.516870
2 -0.589005 -0.210518 0.343125
3 -1.070678 -0.931184 2.137827
4 -1.681101 0.240647 0.452429
5 0.403776 -0.027462 0.483103
6 0.609862 0.033113 0.061495
7 0.387936 -0.658418 0.240767
8 1.815066 0.324102 0.782413
9 0.705200 -1.403048 0.628462
10 -0.668049 -0.581967 -0.880627
11 0.242501 -1.233862 0.777575
12 0.313421 -3.520876 -0.779367
Apply
通过对嵌套列表的数据框进行循环处理使其转化为不含列表的多重索引结构
In [138]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])
In [139]: def SeriesFromSubList(aList):
.....: return pd.Series(aList)
.....:
In [140]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))
从一个从Series的标量被返回之前对多个列进行循环应用用于计算Series的函数
In [141]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
.....: index=pd.date_range('2001-01-01',periods=2000),
.....: columns=['A','B']); df
.....:
Out[141]:
A B
2001-01-01 0.000032 -0.000004
2001-01-02 -0.000001 0.000207
2001-01-03 0.000120 -0.000220
2001-01-04 -0.000083 -0.000165
2001-01-05 -0.000047 0.000156
2001-01-06 0.000027 0.000104
2001-01-07 0.000041 -0.000101
... ... ...
2006-06-17 -0.000034 0.000034
2006-06-18 0.000002 0.000166
2006-06-19 0.000023 -0.000081
2006-06-20 -0.000061 0.000012
2006-06-21 -0.000111 0.000027
2006-06-22 -0.000061 -0.000009
2006-06-23 0.000074 -0.000138
[2000 rows x 2 columns]
In [142]: def gm(aDF,Const):
.....: v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
.....: return (aDF.index[0],v.iloc[-1])
.....:
In [143]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[143]:
2001-01-01 -0.001373
2001-01-02 -0.001705
2001-01-03 -0.002885
2001-01-04 -0.002987
2001-01-05 -0.002384
2001-01-06 -0.004700
2001-01-07 -0.005500
...
2006-04-28 -0.002682
2006-04-29 -0.002436
2006-04-30 -0.002602
2006-05-01 -0.001785
2006-05-02 -0.001799
2006-05-03 -0.000605
2006-05-04 -0.000541
Length: 1950, dtype: float64
对多重列进行循环应用返回标量的函数
In [144]: rng = pd.date_range(start = '2014-01-01',periods = 100)
In [145]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
.....: 'Close' : np.random.randn(len(rng)),
.....: 'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
.....:
Out[145]:
Close Open Volume
2014-01-01 -0.653039 0.011174 1581
2014-01-02 1.314205 0.214258 1707
2014-01-03 -0.341915 -1.046922 1768
2014-01-04 -1.303586 -0.752902 836
2014-01-05 0.396288 -0.410793 694
2014-01-06 -0.548006 0.648401 796
2014-01-07 0.481380 0.737320 265
... ... ... ...
2014-04-04 -2.548128 0.120378 564
2014-04-05 0.223346 0.231661 1908
2014-04-06 1.228841 0.952664 1090
2014-04-07 0.552784 -0.176090 1813
2014-04-08 -0.795389 1.781318 1103
2014-04-09 -0.018815 -0.753493 1456
2014-04-10 1.138197 -1.047997 1193
[100 rows x 3 columns]
In [146]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())
In [147]: window = 5
In [148]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);
In [149]: s.round(2)
Out[149]:
2014-01-06 -0.03
2014-01-07 0.07
2014-01-08 -0.40
2014-01-09 -0.81
2014-01-10 -0.63
2014-01-11 -0.86
2014-01-12 -0.36
...
2014-04-04 -1.27
2014-04-05 -1.36
2014-04-06 -0.73
2014-04-07 0.04
2014-04-08 0.21
2014-04-09 0.07
2014-04-10 0.25
Length: 95, dtype: float64
pandas的cookbook中翻译到现在还剩下一些作图、读写文件的部分没有翻译,打算弃坑以后有兴趣再来填。