一篇比较好的pandas指南,适合已经熟悉pandas,并想掌握一些进阶用法的读者,不适合对pandas完全不了解的新人。文章大部分是Stack Overflow常见问题集合。
pandas 官网 原文连接: https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html
我会在原文基础上进行增删改,添加一些注释。
常见用法
if-then
- 对一列数据执行 if-then / if-then-else 操作,把计算结果赋值给一列或多列。
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
对其中一列筛选,并操作另一列:
In [3]: df.loc[df.AAA >= 5, 'BBB'] = -1
In [4]: df
Out[4]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
2 6 -1 -30
3 7 -1 -50
操作两列:
In [5]: df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555
In [6]: df
Out[6]:
AAA BBB CCC
0 4 10 100
1 5 555 555
2 6 555 555
3 7 555 555
可以通过相反逻辑的代码来实现else,如下
df.loc[df.AAA < 5, ["BBB", "CCC"]] = 2000
或者使用mask和pandas的where来实现,下述表示在mask中,True
的地方转化为-1000
In [9]: df_mask = pd.DataFrame({'AAA': [True] * 4,
...: 'BBB': [False] * 4,
...: 'CCC': [True, False] * 2})
...:
In [10]: df.where(df_mask, -1000)
Out[10]:
AAA BBB CCC
0 4 -1000 2000
1 5 -1000 -1000
2 6 -1000 555
3 7 -1000 -1000
也可以使用numpyd where()函数实现
In [13]: df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
In [14]: df
Out[14]:
AAA BBB CCC logic
0 4 10 100 low
1 5 20 50 low
2 6 30 -30 high
3 7 40 -50 high
Splitting
通过布尔值切分
In [17]: df[df.AAA <= 5]
Out[17]:
AAA BBB CCC
0 4 10 100
1 5 20 50
In [18]: df[df.AAA > 5]
Out[18]:
AAA BBB CCC
2 6 30 -30
3 7 40 -50
条件设置
还是原来的数据df:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
- 通过“和”符号、或符号筛选,返回Series:
In [21]: df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
#切选bbb列小于25且ccc列大于-40的aaa列值
Out[21]:
0 4
1 5
Name: AAA, dtype: int64
In [22]: df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']
#切选bbb列小于25或ccc列大于-40的aaa列值
Out[22]:
0 4
1 5
2 6
3 7
Name: AAA, dtype: int64
- 筛选后赋值原Dataframe
In [23]: df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1
#切选bbb列小于25或ccc列大于-40的aaa列值,并赋值为0.1
In [24]: df
Out[24]:
AAA BBB CCC
0 0.1 10 100
1 5.0 20 50
2 0.1 30 -30
3 0.1 40 -50
用 argsort 选择最接近指定值的行
In [27]: aValue = 43.0
In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
Out[28]:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
用二进制运算符动态减少条件列表
#原始数据
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [31]: Crit1 = df.AAA <= 5.5
In [32]: Crit2 = df.BBB == 10.0
In [33]: Crit3 = df.CCC > -40.0
#方法1:通过直接硬编码
In [34]: AllCrit = Crit1 & Crit2 & Crit3
#方法2:通过动态生成条件列表,适用于非常多的条件
In [35]: import functools
In [36]: CritList = [Crit1, Crit2, Crit3]
In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)
In [38]: df[AllCrit]
Out[38]:
AAA BBB CCC
0 4 10 100
有关functools.reduce的官方文档
Selection
- 行标签与值作为条件
用逆运算符 (~)提取掩码,获取反向的内容
In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Out[41]:
AAA BBB CCC
0 4 10 100
2 6 30 -30
In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
Out[52]:
AAA BBB CCC
1 5 20 50
3 7 40 -50
- 标签切片与位置切片
数据:
AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
三种切片方法:
#法1:使用loc根据标签切片,包含结尾
n [43]: df.loc['bar':'kar'] #或通用写成df["bar":"kar"]
Out[43]:
AAA BBB CCC
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
# 使用iloc根据位置切片,不包含结尾
In [44]: df.iloc[0:3] #或通用写成df[0:3]
Out[44]:
AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
注意,如果索引由整数组成但不从 0 开始,或不是逐步递增的,会引发歧义,可以通过iloc和loc进行区分。
df2
AAA BBB CCC
1 4 10 100
2 5 20 50
3 6 30 -30
4 7 40 -50
df2.iloc[1:3] # Position-oriented
Out[48]:
AAA BBB CCC
2 5 20 50
3 6 30 -30
In [49]: df2.loc[1:3] # Label-oriented
Out[49]:
AAA BBB CCC
1 4 10 100
2 5 20 50
3 6 30 -30
- 用 applymap 动态生成新列
AAA BBB CCC
0 1 1 2
1 2 1 1
2 1 2 3
3 3 2 1
In [55]: source_cols = df.columns # Or some subset would work too
In [56]: new_cols = [str(x) + "_cat" for x in source_cols]#生成新列名
In [57]: categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}
In [58]: df[new_cols] = df[source_cols].applymap(categories.get)
In [59]: df
Out[59]:
AAA BBB CCC AAA_cat BBB_cat CCC_cat
0 1 1 2 Alpha Alpha Beta
1 2 1 1 Beta Alpha Alpha
2 1 2 3 Alpha Beta Charlie
3 3 2 1 Charlie Beta Alpha
- 使用groupby时保留其他列
In [61]: df
Out[61]:
AAA BBB
0 1 2
1 1 1
2 1 3
3 2 4
4 2 5
5 2 1
6 3 2
7 3 3
#方法1:用 idxmin() 提取对应索引
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]:
AAA BBB
1 1 1
5 2 1
6 3 2
#方法2:先排序,再提取每组的第一个值
In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]:
AAA BBB
0 1 1
1 2 1
2 3 2
这里需注意,虽然提取的结果是一样的,但两者索引不同
多重索引
很多dataframe在处理后,是以multiindex形式存在的,如describe,pivottable。同样可以通过多重索引进行删减改查的操作。
- 创建
row One_X One_Y Two_X Two_Y
0 0 1.1 1.2 1.11 1.22
1 1 1.1 1.2 1.11 1.22
2 2 1.1 1.2 1.11 1.22
# 设置索引标签
In [66]: df = df.set_index('row')
In [67]: df
Out[67]:
One_X One_Y Two_X Two_Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
# 多层索引的列
In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_'))
....: for c in df.columns])
....:
In [69]: df
Out[69]:
One Two
X Y X Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
# 先 stack,然后 Reset 索引
In [70]: df = df.stack(0).reset_index(1)
In [71]: df
Out[71]:
level_1 X Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
# 修整标签,注意自动添加了标签 `level_1`
In [72]: df.columns = ['Sample', 'All_X', 'All_Y']
In [73]: df
Out[73]:
Sample All_X All_Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
- 运算
通过广播进行运算
In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C']
....: for y in ['O', 'I']])
....:
In [75]: df = pd.DataFrame(np.random.randn(2, 6), index=['n', 'm'], columns=cols)
In [76]: df
Out[76]:
A B C
O I O I O I
n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215
m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804
In [77]: df = df.div(df['C'], level=1)
In [78]: df
Out[78]:
A B C
O I O I O I
n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0
m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0
- 切片
如下数据框架df
MyData
AA one 11
six 22
BB one 33
two 44
six 55
1:用 xs 切片多层索引
#提取第一层
# 注意:level 与 axis 是可选项,默认为 0
In [83]: df.xs('BB', level=0, axis=0)
Out[83]:
MyData
one 33
two 44
six 55
#提取第二层
In [84]: df.xs('six', level=1, axis=0)
Out[84]:
MyData
AA 22
BB 55
方法2:
构造一个行索引和列索引都是多重索引的数据集
In [85]: import itertools
In [86]: index = list(itertools.product(['Ada', 'Quinn', 'Violet'],
....: ['Comp', 'Math', 'Sci']))
In [87]: headr = list(itertools.product(['Exams', 'Labs'], ['I', 'II']))
In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])
In [89]: cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named
In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
In [91]: df = pd.DataFrame(data, indx, cols)
In [92]: df
Out[92]:
Exams Labs
I II I II
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
进行切片
In [93]: All = slice(None)
In [94]: df.loc['Violet']
Out[94]:
Exams Labs
I II I II
Course
Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
In [95]: df.loc[(All, 'Math'), All]
Out[95]:
Exams Labs
I II I II
Student Course
Ada Math 71 73 75 74
Quinn Math 74 76 78 77
Violet Math 77 79 81 80
In [96]: df.loc[(slice('Ada', 'Quinn'), 'Math'), All]
Out[96]:
Exams Labs
I II I II
Student Course
Ada Math 71 73 75 74
Quinn Math 74 76 78 77
In [97]: df.loc[(All, 'Math'), ('Exams')]
Out[97]:
I II
Student Course
Ada Math 71 73
Quinn Math 74 76
Violet Math 77 79
In [98]: df.loc[(All, 'Math'), (All, 'II')]
Out[98]:
Exams Labs
II II
Student Course
Ada Math 73 74
Quinn Math 76 77
Violet Math 79 80
- 排序
用多层索引按指定列或列序列表排序,和单层相同,只要传入一个元组即可
In [99]: df.sort_values(by=('Labs', 'II'), ascending=False)
Out[99]:
Exams Labs
I II I II
Student Course
Violet Sci 78 81 81 81
Math 77 79 81 80
Comp 76 77 78 79
Quinn Sci 75 78 78 78
Math 74 76 78 77
Comp 73 74 75 76
Ada Sci 72 75 75 75
Math 71 73 75 74
Comp 70 71 72 73
- 层级问题
官网举例了两个stackoverflow的问题,在此总结如下
- 为多层索引添加一层
数据
# Vals
# A B
# a1 b1 -1.632460
# b2 0.596027
# a2 b3 -0.619130
# a3 b4 -0.002009
#通过concat增加一层索引
pd.concat([df], keys=['Foo'], names=['Firstlevel'])
#或者
pd.concat({'Foo': df}, names=['Firstlevel'])
df
# Vals
# FirstLevel A B
# Foo a1 b1 -1.632460
# b2 0.596027
# a2 b3 -0.619130
# a3 b4 -0.002009
- 将多层索引展开
df.columns = df.columns.get_level_values(0)
Exams Exams Labs Labs
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
#如果需要将列名重命名,可通过join重组:
df.columns = [' '.join(col).strip() for col in df.columns.values]
Exams I Exams II Labs I Labs II
Student Course
Ada Comp 70 71 72 73
Math 71 73 75 74
Sci 72 75 75 75
Quinn Comp 73 74 75 76
Math 74 76 78 77
Sci 75 78 78 78
Violet Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
缺失值
- 根据时间序列补充缺失值
In [100]: df = pd.DataFrame(np.random.randn(6, 1),
.....: index=pd.date_range('2013-08-01', periods=6, freq='B'),
.....: columns=list('A'))
.....:
In [101]: df.loc[df.index[3], 'A'] = np.nan
In [102]: df
Out[102]:
A
2013-08-01 0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06 NaN
2013-08-07 -0.424972
2013-08-08 0.567020
In [103]: df.reindex(df.index[::-1]).ffill()
Out[103]:
A
2013-08-08 0.567020
2013-08-07 -0.424972
2013-08-06 -0.424972
2013-08-05 -1.039575
2013-08-02 -0.706771
2013-08-01 0.721555
- 替换
一个stackoverflow的案例,使用df.replace()
df = pd.DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
如果直接进行正则匹配,则会返回一个元组
>>>ref = df['A']
>>>ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
0 [(someproduct_step2, ;)]
Name: A, dtype: object
如果只想要一个列的话(结果应为someproduct_step2
):
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
#或者在0.11.1之后
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
一个官网更简单清晰的例子解释replace里使用正则
>>>df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
'B': ['abc', 'bar', 'xyz']})
>>>df.replace(to_replace=r'^ba.$', value='new', regex=True)
A B
0 new abc
1 foo new
2 bait xyz
#或者传入字典指定列
>>>df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
A B
0 new abc
1 foo bar
2 bait xyz
#或者更简单的
>>>df.replace(regex=r'^ba.$', value='new')
A B
0 new abc
1 foo new
2 bait xyz
#正则多个条件
>>>df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
A B
0 new abc
1 xyz new
2 bait xyz
#正则多个条件更改为同一值
>>>df.replace(regex=[r'^ba.$', 'foo'], value='new')
A B
0 new abc
1 new new
2 bait xyz