5、根据索引合并
DataFrame中用于合并的键是它的索引。在这种情况下,你可以传递left_index=True或right_index=True(或者都传)来表示索引需要用来作为合并的键:
df1 = pd.DataFrame({'col1':range(5), 'col2':['a', 'b', 'c', 'd', 'e']})
df1
Out[91]:
col1 col2
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
df2 = pd.DataFrame({'col2_2':['mingming', 'yueyue']})
df2
Out[93]:
col2_2
0 mingming
1 yueyue
pd.merge(df1, df2, left_on='col1', right_index=True)
Out[94]:
col1 col2 col2_2
0 0 a mingming
1 1 b yueyue
5.1、多层索引进行合并
df1 = pd.DataFrame({'key1':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'key2':['2000', '2001' , '2002', '2001', '2002'], 'data': np.arange(5)})
df1
Out[98]:
key1 key2 data
0 Ohio 2000 0
1 Ohio 2001 1
2 Ohio 2002 2
3 Nevada 2001 3
4 Nevada 2002 4
df2 = pd.DataFrame(np.arange(12).reshape((6, 2)), index=[['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Ohio'], ['2000', '2001' , '2002', '2001', '2002', '2003']], columns=['col1', 'col2'])
df2
Out[104]:
col1 col2
Ohio 2000 0 1
2001 2 3
2002 4 5
Nevada 2001 6 7
2002 8 9
Ohio 2003 10 11
pd.merge(df1, df2, left_on=['key1', 'key2'], right_index=True)
Out[106]:
key1 key2 data col1 col2
0 Ohio 2000 0 0 1
1 Ohio 2001 1 2 3
2 Ohio 2002 2 4 5
3 Nevada 2001 3 6 7
4 Nevada 2002 4 8 9
pd.merge(df1, df2, left_on=['key1', 'key2'], right_index=True, how='outer')
Out[107]:
key1 key2 data col1 col2
0 Ohio 2000 0.0 0 1
1 Ohio 2001 1.0 2 3
2 Ohio 2002 2.0 4 5
3 Nevada 2001 3.0 6 7
4 Nevada 2002 4.0 8 9
4 Ohio 2003 NaN 10 11
5.2、两边都是用索引进行合并
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), index=['a', 'b', 'c'], columns=['ohio', 'nevada'])
df1
Out[110]:
ohio nevada
a 0 1
b 2 3
c 4 5
df2 = pd.DataFrame(np.arange(8).reshape((4,2)), index=['d', 'b', 'c','e'], columns=['missouri', 'alabama'])
df2
Out[112]:
missouri alabama
d 0 1
b 2 3
c 4 5
e 6 7
pd.merge(df1, df2, left_index=True, right_index=True)
Out[113]:
ohio nevada missouri alabama
b 2 3 2 3
c 4 5 4 5
6、延轴向连接
另一种数据组合操作可互换地称为拼接、绑定或堆叠。NumPy的concatenate函数可以在NumPy数组上实现该功能:
arr = np.arange(16).reshape((4, 4))
np.concatenate([arr, arr], axis=1)
Out[118]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11],
[12, 13, 14, 15, 12, 13, 14, 15]])
6.1、在axis=0轴上连接 Series
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
Out[127]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
6.2、在axis=1轴上连接 Series
pd.concat([s1,s2,s3], axis=1)
Out[128]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
你可以看到排序后的索引合集(‘outer’ join外连接)。你也可以传入join=‘inner’:
s4 = pd.concat([s1, s3])
s4
Out[130]:
a 0
b 1
f 5
g 6
dtype: int64
pd.concat([s1, s4], axis=1)
Out[131]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
# join='inner’的选项,'f’和’g’标签消失了。
pd.concat([s1, s4], axis=1, join='inner')
Out[132]:
0 1
a 0 0
b 1 1
6.3、 使用keys关键字,在连接结果的Series上设置多层索引
df = pd.concat([s1, s4], keys=['updown', 'down'])
type(df)
Out[147]: pandas.core.series.Series
df.index
Out[146]:
MultiIndex([('updown', 'a'),
('updown', 'b'),
( 'down', 'a'),
( 'down', 'b'),
( 'down', 'f'),
( 'down', 'g')],
)
df
Out[148]:
updown a 0
b 1
down a 0
b 1
f 5
g 6
df.unstack()
Out[149]:
a b f g
updown 0.0 1.0 NaN NaN
down 0.0 1.0 5.0 6.0
6.4、 连接Dataframe
df1 = pd.DataFrame(np.arange(8).reshape((4, 2)), columns=['up', 'down'], index=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), columns=['up', 'down'], index=['e', 'f', 'g'])
df1
Out[159]:
up down
a 0 1
b 2 3
c 4 5
d 6 7
df2
Out[160]:
up down
e 0 1
f 2 3
g 4 5
pd.concat([df1, df2])
Out[158]:
up down
a 0 1
b 2 3
c 4 5
d 6 7
e 0 1
f 2 3
g 4 5
pd.concat([df1, df2], axis=1)
Out[161]:
up down up down
a 0.0 1.0 NaN NaN
b 2.0 3.0 NaN NaN
c 4.0 5.0 NaN NaN
d 6.0 7.0 NaN NaN
e NaN NaN 0.0 1.0
f NaN NaN 2.0 3.0
g NaN NaN 4.0 5.0
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
Out[162]:
level1 level2
up down up down
a 0.0 1.0 NaN NaN
b 2.0 3.0 NaN NaN
c 4.0 5.0 NaN NaN
d 6.0 7.0 NaN NaN
e NaN NaN 0.0 1.0
f NaN NaN 2.0 3.0
g NaN NaN 4.0 5.0
使用ignore_index=True ,可以忽略索引,直接连接(注意,下面的索引被重建了)
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
df1
Out[166]:
a b c d
0 -0.544383 0.110923 -1.150994 0.375698
1 -0.600639 -0.291694 -0.601707 1.852278
2 -0.013497 -1.057711 0.822545 -1.220844
df2
Out[167]:
b d a
0 0.208864 -1.959670 -1.328186
1 0.196861 0.738467 0.171368
pd.concat([df1, df2])
Out[168]:
a b c d
0 -0.544383 0.110923 -1.150994 0.375698
1 -0.600639 -0.291694 -0.601707 1.852278
2 -0.013497 -1.057711 0.822545 -1.220844
0 -1.328186 0.208864 NaN -1.959670
1 0.171368 0.196861 NaN 0.738467
pd.concat([df1, df2], ignore_index=True)
Out[169]:
a b c d
0 -0.544383 0.110923 -1.150994 0.375698
1 -0.600639 -0.291694 -0.601707 1.852278
2 -0.013497 -1.057711 0.822545 -1.220844
3 -1.328186 0.208864 NaN -1.959670
4 0.171368 0.196861 NaN 0.738467