第7章 数据规整化:清理、转换、合并、重塑(2)
DataFrame的连接键可能位于索引中,可用merge的参数left_index=True或right_index=True(或者两个同时传)来说明索引应该被用为连接键。例:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
left1 = DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
right1 = DataFrame({'group_val':[3.5,7]},index=['a','b'])
#注意index的位置,{}内为DataFrame的一个参数data,index是另一个参数
其中left1为
key | value |
---|---|
0 | a |
1 | b |
2 | a |
3 | a |
4 | b |
5 | c |
right1为
group_val |
---|
a |
b |
pd.merge(left1,right1,left_on='key',right_index=True)
#left_on='key'表示merge方法中左侧DataFrame键是key列,right_index=True表示右侧DataFrame中的键在索引中,默认how="inner"求交集
输出结果如下
key | values | group_val |
---|---|---|
0 | a | 0 |
2 | a | 2 |
3 | a | 3 |
1 | b | 1 |
4 | b | 4 |
pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
#修改连接方式为外连接
Output
key | values | group_val |
---|---|---|
0 | a | 0 |
2 | a | 2 |
3 | a | 3 |
1 | b | 1 |
4 | b | 4 |
5 | c | 5 |
同时使用双方的索引如下
left2 = DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','c','e'],columns=['Ohio','Nevada'])
right2 = DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['b','c','d','e'],columns=['Missouri','Alabama'])
In [24]:left2
Out[24]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [25]:right2
Out[25]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
Out[26]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
层次化索引数据,较为复杂。例
lefth = DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key2':[2000,2001,2002,2001,2002],'data':np.arange(5.)})
righth = DataFrame(np.arange(12).reshape(6,2),index = [['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns = ['event1','event2'])
In [17]:righth
Out[17]:
data key1 key2
0 0.0 Ohio 2000
1 1.0 Ohio 2001
2 2.0 Ohio 2002
3 3.0 Nevada 2001
4 4.0 Nevada 2002
In [18]:righth
Out[18]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
##合并时必须list形式指明用作合并键的多个列
##注意索引的重复值
##内连接
In [19]:pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
Out[19]:
data key1 key2 event1 event2
0 0.0 Ohio 2000 4 5
0 0.0 Ohio 2000 6 7
1 1.0 Ohio 2001 8 9
2 2.0 Ohio 2002 10 11
3 3.0 Nevada 2001 0 1
##外连接
In [20]:pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
Out[20]:
data key1 key2 event1 event2
0 0.0 Ohio 2000 4.0 5.0
0 0.0 Ohio 2000 6.0 7.0
1 1.0 Ohio 2001 8.0 9.0
2 2.0 Ohio 2002 10.0 11.0
3 3.0 Nevada 2001 0.0 1.0
4 4.0 Nevada 2002 NaN NaN
4 NaN Nevada 2000 2.0 3.0
DataFrame.join()
可以更加方便的实现按索引合并,可用于合并多个带有相同或相似索引的DataFrame对象而不管它们之间有没有重叠的列。
left2和right2的例子可以改为。
left2.join(right2,how='outer')
Out[27]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
###由于历史原因,join方法默认实在连接键上做左连接,即默认how='left',也支持参数DataFrame的索引和调用者DataFrame的某列之间连接
left1.join(right1,on='key')#没有显式定义how,how='left'
#right1为参数,合并的是键,后面on='key'指调用者left1合并的列
Out[28]:
key values group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
###对于简单的索引合并,可以传入一组DataFrame
another = DataFrame([[7,8],[9,10],[11,12],[16,17]],index=['a','c','e','f'],columns=['New York','Oregon'])
In [33]: another
Out[33]:
New York Oregon
a 7 8
c 9 10
e 11 12
f 16 17
In [34]: left2.join([right2,another]) #默认how='left'
Out[34]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7 8
c 3.0 4.0 9.0 10.0 9 10
e 5.0 6.0 13.0 14.0 11 12
In [35]:left2.join([right2,another],how='outer')
Out[35]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
b NaN NaN 7.0 8.0 NaN NaN
c 3.0 4.0 9.0 10.0 9.0 10.0
d NaN NaN 11.0 12.0 NaN NaN
e 5.0 6.0 13.0 14.0 11.0 12.0
f NaN NaN NaN NaN 16.0 17.0
轴向连接
合并需要一个键,连接不需要。
numpy中有一个用于合并numpy数组的concatenation函数
arr = np.arange(12).reshape(3,4)
In [37]:arr
Out[37]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.concatenate([arr,arr],axis=1)#默认axis=0
Out[38]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
pandas中函数concat提供类似功能
例如对于Series对象来说
s1 = Series([0,1],index=['a','b'])
Out[43]:
a 0
b 1
dtype: int64
s2 = Series([2,3,4],index=['c','d','e'])
Out[44]:
c 2
d 3
e 4
dtype: int64
s3 = Series([5,6],index=['f','g'])
Out[45]:
f 5
g 6
dtype: int64
pd.concat([s1,s2,s3])#注意[],三个Series输入为第一个参数
#默认axis=0
Out[46]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
pd.concat([s1,s2,s3],axis=1)
Out[47]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
s4 = pd.concat([s1*5,s3])
Out[49]:
a 0
b 5
f 5
g 6
dtype: int64
pd.concat([s1,s4],axis=1)
Out[50]:
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6
##设定join的值可以求交集
pd.concat([s1,s4],axis=1,join='inner')
Out[51]:
0 1
a 0 0
b 1 5
###join_axes指定索引
pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
##注意join_axes参数中两层[]括号,一层报错
Out[10]:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 5.0
e NaN NaN
##keys参数区分连接的片段,建立层次化索引
result = pd.concat([s1,s2,s3],keys=['one','two','three'])#默认axis=0
In [12]:result
Out[12]:
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
###指定axis=1时,keys成为列头
pd.concat([s1,s2,s3],keys=['one','two','three'],axis=1)
Out[13]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
##和pd.concat([s1,s2,s3],axis=1)的区别在于改变了自动填充的列名
######对于DataFrame类型来说类似
df1 = DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
Out[16]:
one two
a 0 1
b 2 3
c 4 5
df2 = DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
three four
a 5 6
c 7 8
pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[19]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
####如果传入字典,则字典的键就被视为keys值
pd.concat({'level1':df1,'level2':df2},axis=1)
Out[20]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
###对于无意义行索引的处理,ignore_index
df3 = DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
Out[22]:
a b c d
0 0.653379 1.585120 1.030270 -0.210029
1 -0.515544 0.816960 -0.466465 -0.970533
2 2.899299 0.980801 0.392106 0.411685
df4 = DataFrame(np.random.randn(2,3),columns=['b','d','a'])
Out[24]:
b d a
0 -0.647891 -0.516086 -0.770869
1 0.452534 0.655894 1.819712
pd.concat([df3,df4])
Out[25]:
a b c d
0 0.653379 1.585120 1.030270 -0.210029
1 -0.515544 0.816960 -0.466465 -0.970533
2 2.899299 0.980801 0.392106 0.411685
0 -0.770869 -0.647891 NaN -0.516086
1 1.819712 0.452534 NaN 0.655894
#####注意对索引号的处理,没有ignore_index=True时,索引号是两个DataFrame的叠加,加上之后索引正常
pd.concat([df3,df4],ignore_index=True)
Out[26]:
a b c d
0 0.653379 1.585120 1.030270 -0.210029
1 -0.515544 0.816960 -0.466465 -0.970533
2 2.899299 0.980801 0.392106 0.411685
3 -0.770869 -0.647891 NaN -0.516086
4 1.819712 0.452534 NaN 0.655894
concat函数参数表
参数 | 说明 |
---|---|
obj | 参与连接的pandas对象的列表或字典,必须参数 |
axis | 指明连接的轴 |
join | ‘inner’,’outer’,默认是’outer’ |
join_axes | 指明索引,不进行交、并集运算 |
keys | 与连接对象有关的值,可以形成层次化索引,可以是列表、数组、远足数组、数组列表 |
levels | 指定层次化索引各级别上的索引,如果设置keys的话,即在keys上一层索引 |
names | 分层级别的名称,如果keys和levels设置 |
verify_integrity | 检查结果对象新轴上的重复情况,如果发现则引发异常,默认为False(允许重复) |
ignore_index | 不保留连接轴上的索引,产生一组新的索引range(total_length) |