3.8 合并数据集:合并与连接
pd的基本特性之一就是高性能的内存式数据连接join与合并merge操作。pd的主接口是merge函数。
3.8.1 关系代数
合并的理论基础是关系代数
3.8.2 数据连接的类型
merge实现三种数据连接类型:一对一,多对一,多对多。
import pandas as pd import numpy as np class display(object): """Display HTML representation of multiple objects""" template = """<div style="float: left; padding: 10px;"> <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1} </div>""" def __init__(self, *args): self.args = args def _repr_html_(self): return '\n'.join(self.template.format(a, eval(a)._repr_html_()) for a in self.args) def __repr__(self): return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)
一对一连接
是最简单的数据合并类型,与3.7节介绍的按列合并十分相似。
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']}) df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012, 2014]}) display('df1', 'df2')
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
若要将上边两个DF合并为一个,用merge函数:
df3 = pd.merge(df1, df2) df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
merge函数自动将两个DF共有的列employee作为键进行连接,生成一个新DF,原来DF的行索引自动丢弃,自动生成新行索引。
多对一连接
这种连接中,在需要连接的两个列中,有一列的值有重复。通过多对一连接的结果DF会保留重复值。如:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']}) display('df3', 'df4', 'pd.merge(df3, df4)')
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3, df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR |