一篇比较好的pandas指南,适合已经熟悉pandas,并想掌握一些进阶用法的读者,不适合对pandas完全不了解的新人。文章大部分是Stack Overflow常见问题集合。
pandas 官网 原文连接: https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html
我会在原文基础上进行增删改,添加一些注释。
上一篇:pandas进阶用法(一)筛选条件、多重索引、缺失值
分组
与聚合不同,传递给 DataFrame 子集的 apply 可回调,可以访问所有列
- 使用apply
#df
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
# 提取 size 列最大的动物列表
In [106]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[106]:
animal
cat L
dog M
fish M
dtype: object
- 使用get_group
In [107]: gb = df.groupby(['animal'])
In [108]: gb.get_group('cat')
Out[108]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
- 为同一分组的不同内容使用 Apply 函数
In [109]: def GrowUp(x):
.....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
.....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
.....: avg_weight += sum(x[x['size'] == 'L'].weight)
.....: avg_weight /= len(x)
.....: return pd.Series(['L', avg_weight, True],
.....: index=['size', 'weight', 'adult'])
.....:
In [110]: expected_df = gb.apply(GrowUp)
In [111]: expected_df
Out[111]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
- apply 扩展
In [112]: S = pd.Series([i / 100.0 for i in range(1, 11)])
In [113]: def cum_ret(x, y):
.....: return x * (1 + y)
.....:
In [114]: def red(x):
.....: return functools.reduce(cum_ret, x, 1.0)
.....:
In [115]: S.expanding().apply(red, raw=True)
Out[115]:
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
7 1.419367
8 1.547110
9 1.701821
dtype: float64
- 用分组里的剩余值的平均值进行替换
In [116]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, -1, 1, 2]})
In [117]: gb = df.groupby('A')
In [118]: def replace(g):
.....: mask = g < 0
.....: return g.where(mask, g[~mask].mean())
.....:
In [119]: gb.transform(replace)
Out[119]:
B
0 1.0
1 -1.0
2 1.5
3 1.5
- 按聚合数据排序
In [120]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
.....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
.....: 'flag': [False, True] * 3})
.....:
In [121]: code_groups = df.groupby('code')
In [122]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
In [123]: sorted_df = df.loc[agg_n_sort_order.index]
In [124]: sorted_df
Out[124]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
- 创建多个聚合列
In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq='2min')
In [126]: ts = pd.Series(data=list(range(10)), index=rng)
In [127]: def MyCust(x):
.....: if len(x) > 2:
.....: return x[1] * 1.234
.....: return pd.NaT
.....:
In [128]: mhc = {'Mean': np.mean, 'Max': np.max, 'Custom': MyCust}
In [129]: ts.resample("5min").apply(mhc)
Out[129]:
Mean 2014-10-07 00:00:00 1
2014-10-07 00:05:00 3.5
2014-10-07 00:10:00 6
2014-10-07 00:15:00 8.5
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
dtype: object
In [130]: ts
Out[130]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
- 为 DataFrame 创建值计数列
#df
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [133]: df['Counts'] = df.groupby(['Color']).transform(len)
In [134]: df
Out[134]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
- 基于索引唯一某列不同分组的值
#df
line_race beyer
Last Gunfighter 10 99
Last Gunfighter 10 102
Last Gunfighter 8 103
Paynter 10 103
Paynter 10 88
Paynter 8 100
In [137]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
In [138]: df
Out[138]:
line_race beyer beyer_shifted
Last Gunfighter 10 99 NaN
Last Gunfighter 10 102 99.0
Last Gunfighter 8 103 102.0
Paynter 10 103 NaN
Paynter 10 88 103.0
Paynter 8 100 88.0
- 选择每组最大值的行
#df
no
host service
other mail 1
web 2
that mail 1
this mail 2
web 1
In [140]: mask = df.groupby(level=0).agg('idxmax')#返回最大值的index
# no
#host
#other (other, web)
#that (that, mail)
#this (this, mail)
In [141]: df_count = df.loc[mask['no']].reset_index()#reset_index将原先的index转换为column
In [142]: df_count
Out[142]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
- 形如Python的 itertools.groupby 分组
In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
In [144]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[144]:
{1: Int64Index([0], dtype='int64'),
2: Int64Index([1], dtype='int64'),
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5], dtype='int64'),
5: Int64Index([6], dtype='int64'),
6: Int64Index([7, 8], dtype='int64')}
In [145]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[145]:
0 0
1 1
2 0
3 1
4 2
5 3
6 0
7 1
8 2
Name: A, dtype: int64
分割
按指定逻辑,将不同的行,分割成 DataFrame 列表。
- 以B为边界分成三组
In [146]: df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A','A'],'Data': np.random.randn(9)})
In [147]: dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum()
.....: .rolling(window=3, min_periods=1).median())))[-1]
.....:
#dfs是一个元组,包含三个dataframe
In [148]: dfs[0]
Out[148]:
Case Data
0 A 0.276232
1 A -1.087401
2 A -0.673690
3 B 0.113648
In [149]: dfs[1]
Out[149]:
Case Data
4 A -1.478427
5 A 0.524988
6 B 0.404705
In [150]: dfs[2]
Out[150]:
Case Data
7 A 0.577046
8 A -1.715002
这里一个小技巧,将布尔值转化为0和1,就是和1相乘。但这里其实不用和1相乘,直接运行累加函数.cumsum()也是相同的效果
>>> (df['Case'] == 'B').cumsum()
>>>
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
Name: Case, dtype: int64
透视
- 部分汇总与小计
#df
Province City Sales
0 ON Toronto 13
1 QC Montreal 6
2 BC Vancouver 16
3 AL Calgary 8
4 AL Edmonton 4
5 MN Winnipeg 3
6 ON Windsor 1
In [152]: table = pd.pivot_table(
df,
values=["Sales"],
index=["Province"],
columns=["City"],
aggfunc=np.sum,
margins=True,
)
#table,df的透视图并获得按省和按城市的小计
Sales
City Calgary Edmonton Montreal Toronto Vancouver Windsor Winnipeg All
Province
AL 8.0 4.0 NaN NaN NaN NaN NaN 12
BC NaN NaN NaN NaN 16.0 NaN NaN 16
MN NaN NaN NaN NaN NaN NaN 3.0 3
ON NaN NaN NaN 13.0 NaN 1.0 NaN 14
QC NaN NaN 6.0 NaN NaN NaN NaN 6
All 8.0 4.0 6.0 13.0 16.0 1.0 3.0 51
In [153]: table.stack('City')
Out[153]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
Vancouver 16.0
... ...
All Montreal 6.0
Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
[20 rows x 1 columns]
- 类似 R 的 plyr 频率表
#df
ID Gender ExamYear Class Participated Passed Employed Grade
0 x0 F 2007 algebra yes no True 48
1 x1 M 2007 stats yes yes True 99
2 x2 F 2007 bio yes yes True 75
3 x3 M 2008 algebra yes yes False 80
4 x4 F 2008 algebra no no False 42
5 x5 M 2008 stats yes yes False 80
6 x6 F 2008 stats yes yes False 72
7 x7 M 2009 algebra yes yes True 68
8 x8 M 2009 bio yes no True 36
9 x9 M 2009 bio yes yes False 78
>>>df.groupby("ExamYear").agg(
{
"Participated": lambda x: x.value_counts()["yes"],
"Passed": lambda x: sum(x == "yes"),
"Employed": lambda x: sum(x),
"Grade": lambda x: sum(x) / len(x),
}
)
>>>
Participated Passed Employed Grade
ExamYear
2007 3 2 3 74.000000
2008 3 3 0 68.500000
2009 3 2 2 60.666667
这里x.value_counts()["yes"]
和sum(x == "yes")
作用是一样的,可以相互替换。
- 跨列表创建年月:
In [157]: df = pd.DataFrame(
{"value": np.random.randn(36)},
index=pd.date_range("2011-01-01", freq="M", periods=36),
)
In [158]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
.....: values='value', aggfunc='sum')
.....:
Out[158]:
2011 2012 2013
1 -1.039268 -0.968914 2.565646
2 -0.370647 -1.294524 1.431256
3 -1.157892 0.413738 1.340309
4 -1.344312 0.276662 -1.170299
5 0.844885 -0.472035 -0.226169
6 1.075770 -0.013960 0.410835
7 -0.109050 -0.362543 0.813850
8 1.643563 -0.006154 0.132003
9 -1.469388 -0.923061 -0.827317
10 0.357021 0.895717 -0.076467
11 -0.674600 0.805244 -1.187678
12 -1.776904 -1.206412 1.130127
Apply
- 把嵌入列表转换为多层索引 DataFrame
#df
A B
I [2, 4, 8, 16] [a, b, c]
II [100, 200] [jj, kk]
III [10, 20, 30] [ccc]
In [160]: def SeriesFromSubList(aList):
.....: return pd.Series(aList)
.....:
In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList)
.....: for ind, row in df.iterrows()})
#如果不用def亦可:
In [161]: df_orgz = pd.concat({ind: row.apply(lambda y: Series(y))
.....: for ind, row in df.iterrows()})
In [162]: df_orgz
Out[162]:
0 1 2 3
I A 2 4 8 16.0
B a b c NaN
II A 100 200 NaN NaN
B jj kk NaN NaN
III A 10 20 30 NaN
B ccc NaN NaN NaN
- 使用 rolling_apply,并返回Series
对多列执行滑动窗口处理并apply函数,并在返回系列中的标量之前函数计算Series
# df
A B
2001-01-01 -0.000144 -0.000141
2001-01-02 0.000161 0.000102
2001-01-03 0.000057 0.000088
2001-01-04 -0.000221 0.000097
2001-01-05 -0.000201 -0.000041
... ... ...
2006-06-19 0.000040 -0.000235
2006-06-20 -0.000123 -0.000021
2006-06-21 -0.000113 0.000114
2006-06-22 0.000136 0.000109
2006-06-23 0.000027 0.000030
[2000 rows x 2 columns]
In [165]: def gm(df, const):
.....: v = ((((df.A + df.B) + 1).cumprod()) - 1) * const
.....: return v.iloc[-1]
.....:
In [166]: s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
.....: for i in range(len(df) - 50)})
.....:
In [167]: s
Out[167]:
2001-01-01 0.000930
2001-01-02 0.002615
2001-01-03 0.001281
2001-01-04 0.001117
2001-01-05 0.002772
...
2006-04-30 0.003296
2006-05-01 0.002629
2006-05-02 0.002081
2006-05-03 0.004247
2006-05-04 0.003928
Length: 1950, dtype: float64
- 使用 rolling_apply,并返回Scalar
对多列执行滚动 Apply,函数返回标量值(成交价加权平均价)
In [168]: rng = pd.date_range(start='2014-01-01', periods=100)
In [169]: df = pd.DataFrame({'Open': np.random.randn(len(rng)),
.....: 'Close': np.random.randn(len(rng)),
.....: 'Volume': np.random.randint(100, 2000, len(rng))},
.....: index=rng)
.....:
In [170]: df
Out[170]:
Open Close Volume
2014-01-01 -1.611353 -0.492885 1219
2014-01-02 -3.000951 0.445794 1054
2014-01-03 -0.138359 -0.076081 1381
2014-01-04 0.301568 1.198259 1253
2014-01-05 0.276381 -0.669831 1728
... ... ... ...
2014-04-06 -0.040338 0.937843 1188
2014-04-07 0.359661 -0.285908 1864
2014-04-08 0.060978 1.714814 941
2014-04-09 1.759055 -0.455942 1065
2014-04-10 0.138185 -1.147008 1453
[100 rows x 3 columns]
In [171]: def vwap(bars):
.....: return ((bars.Close * bars.Volume).sum() / bars.Volume.sum())
.....:
In [172]: window = 5
In [173]: s = pd.concat([(pd.Series(vwap(df.iloc[i:i + window]),
.....: index=[df.index[i + window]]))
.....: for i in range(len(df) - window)])
.....:
In [174]: s.round(2)
Out[174]:
2014-01-06 0.02
2014-01-07 0.11
2014-01-08 0.10
2014-01-09 0.07
2014-01-10 -0.29
...
2014-04-06 -0.63
2014-04-07 -0.02
2014-04-08 -0.03
2014-04-09 0.34
2014-04-10 0.29
Length: 95, dtype: float64