8 数据规整:连接、联合与重塑-多重索引、数据合并
1.分层索引
基础结构
- Series的多层索引以及利用多层索引进行部分索引
data=pd.Series(np.random.randn(9),index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1,3,1,2,2,3]])
data
Out[11]:
a 1 1.330320
2 0.168819
3 -0.778886
b 1 -0.684871
3 1.255565
c 1 -0.064847
2 1.329889
d 2 -1.139215
3 0.247478
dtype: float64
data.index
Out[12]:
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
data['b']
Out[13]:
1 -0.684871
3 1.255565
dtype: float64
data['b':'c']
Out[14]:
b 1 -0.684871
3 1.255565
c 1 -0.064847
2 1.329889
dtype: float64
data[['b','d']]
Out[15]:
b 1 -0.684871
3 1.255565
d 2 -1.139215
3 0.247478
dtype: float64
data.loc[:,2]
Out[16]:
a 0.168819
c 1.329889
d -1.139215
dtype: float64
- unstack()方法与stack()方法
data.unstack()
Out[17]:
1 2 3
a 1.330320 0.168819 -0.778886
b -0.684871 NaN 1.255565
c -0.064847 1.329889 NaN
d NaN -1.139215 0.247478
data.unstack().stack()
Out[18]:
a 1 1.330320
2 0.168819
3 -0.778886
b 1 -0.684871
3 1.255565
c 1 -0.064847
2 1.329889
d 2 -1.139215
3 0.247478
dtype: float64
type(data.unstack())
Out[19]: pandas.core.frame.DataFrame
- Dataframe的多层索引
frame=pd.DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame
Out[22]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
Out[27]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame['Ohio']
Out[28]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
- 创建一个多层索引:pd.MultiIndex.from_arrays
pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
Out[30]:
MultiIndex([( 'Ohio', 'Green'),
( 'Ohio', 'Red'),
('Colorado', 'Green')],
names=['state', 'color'])
1.1重排序和层级排序
- swaplevel方法:接收两个层级序号或层级名称,进行层级变更
- sort_index()方法:在单一层级上对数据进行排序
frame
Out[31]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.swaplevel('key1','key2')
Out[32]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
frame.sort_index(level=1)
Out[33]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.swaplevel(0,1).sort_index(level=0)
Out[34]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
1.2按层级进行汇总统计
很多Series和Dataframe的描述性和汇总性统计有一个level选项,通过level选项可以指定想要在某个特定的轴上进行聚合。
frame.sum(level='key2')
Out[35]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
frame.sum(level='color',axis=1)
Out[36]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
1.3使用DataFrame的列进行索引
这部分内容用大白话说就是:将dataframe的某一列设为索引、将dataframe的某一列取消索引
- set_index函数设置dataframe的一列或多列为索引。参数drop=False,可以将索引列仍然保留在列中
- reset_index就是set_index的反操作,分层索引的索引层级会被移动到列中
frame=pd.DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
frame
Out[38]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
frame.set_index(['c','d'],drop=False)
Out[40]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
frame2=frame.set_index(['c','d'])
frame2
Out[42]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
frame2.reset_index()
Out[43]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
2.联合与合并数据集merge\join\concat
pd.merge根据一个键或多个键进行连接。依据是某(些)列
pd.concat使对象在轴上进行粘合。依据是轴
pd.combine_first:用一个对象中的值填充另一个对象的缺失值
2.1数据库风格merge
merge参数
参数 | 描述 |
---|---|
left | 合并操作时左边的DataFrame |
right | 合并操作时右边的DataFrame |
how | 连接类型’inner’(默认) 交集 ‘outer’:并集 ‘left’:左连接,包含所有左表内容 ‘right’:右连接,包含所有右表内容 |
on | 需要连接的列名。必须在左右表中都有 |
left_on | 左表中起连接键作用的列 |
right_on | 右表中起连接键作用的列 |
left_index | 使用左表的行索引作为它的连接键 |
right_index | 使用右表的行索引作为它的连接键 |
sort | 通过连接键按字母顺序进行排序,默认为True |
suffixes | 左右两表中有相同的列名,且不作为连接键时,用后缀加以分别。默认是(’ _x’,’ _y’) |
copy | 如果是False,则在某些情况下避免将数据复制到结果数据结构中。默认情况下总是复制 |
indicator | 添加一个特殊的列_merge,指示每一行的来源。‘left_only’ , ‘right_only’ , ‘both’ |
on 参数 使用示例
df1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
df1
Out[46]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2
Out[47]:
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1,df2)
Out[48]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
pd.merge(df1,df2,on='key')
Out[49]:
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
left_on \ right_on 参数 使用示例
df3=pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
df4=pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[54]:
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
how 参数 使用示例
df1
Out[57]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2
Out[58]:
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1,df2,how='outer')
Out[59]:
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0
df1=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
df2=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
df1
Out[62]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
df2
Out[63]:
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
pd.merge(df1,df2,on='key',how='outer')
Out[64]:
key data1 data2
0 b 0.0 1.0
1 b 0.0 3.0
2 b 1.0 1.0
3 b 1.0 3.0
4 b 5.0 1.0
5 b 5.0 3.0
6 a 2.0 0.0
7 a 2.0 2.0
8 a 4.0 0.0
9 a 4.0 2.0
10 c 3.0 NaN
11 d NaN 4.0
pd.merge(df1,df2,on='key',how='inner')
Out[65]:
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
多键合并 示例
left=pd.DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one'],'lval':[1,2,3]})
right=pd.DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two'],'lval':[4,5,6,7]})
left
Out[68]:
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
right
Out[69]:
key1 key2 lval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
pd.merge(left,right,on=['key1','key2'],how='outer')
Out[70]:
key1 key2 lval_x lval_y
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
suffixes参数 使用示例
left
Out[72]:
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
right
Out[73]:
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
pd.merge(left,right,on='key1')
Out[74]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
pd.merge(left,right,on='key1',suffixes=('_left','_right'))
Out[75]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
2.2根据索引合并 merge&join
方法1.利用merge的left_index\right_index来根据索引合并两个dataframe。left_index=True或right_index=True表示索引需要用来作为合并的键
merge的left_index\right_index参数示例
left1=pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
right1=pd.DataFrame({'group_val':[3.5,6]},index=['a','b'])
left1
Out[81]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
right1
Out[82]:
group_val
a 3.5
b 6.0
pd.merge(left1,right1,left_on='key',right_index=True)
Out[83]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 6.0
4 b 4 6.0
pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
Out[84]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 6.0
4 b 4 6.0
5 c 5 NaN
merge的left_index\right_index参数(多层索引情况)示例
lefth=pd.DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key2':[2000,2001,2002,2001,2002],'data':np.arange(5.)})
righth=pd.DataFrame(np.arange(12).reshape(6,2),index=[['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns=['event1','event2'])
lefth
Out[88]:
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
righth
Out[89]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
Out[90]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4 5
0 Ohio 2000 0.0 6 7
1 Ohio 2001 1.0 8 9
2 Ohio 2002 2.0 10 11
3 Nevada 2001 3.0 0 1
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
Out[91]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
merge使用两边的索引合并示例
left=pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','c','e'],columns=['Ohio','Nevada'])
right=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['d','c','d','e'],columns=['Missouri','Alabama'])
left
Out[95]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
right
Out[97]:
Missouri Alabama
d 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
pd.merge(left,right,left_index=True,right_index=True,how='outer')
Out[98]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
c 3.0 4.0 9.0 10.0
d NaN NaN 7.0 8.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
方法2.利用join来根据索引合并两个dataframe
默认是根据索引来合并的
left
Out[5]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
right
Out[6]:
Missouri Alabama
d 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
left.join(right,how='outer')
Out[7]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
c 3.0 4.0 9.0 10.0
d NaN NaN 7.0 8.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
也支持根据列来合并啦
left1
Out[9]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
right1
Out[10]:
group_val
a 3.5
b 6.0
left1.join(right1,on='key')
Out[11]:
key value group_val
0 a 0 3.5
1 b 1 6.0
2 a 2 3.5
3 a 3 3.5
4 b 4 6.0
5 c 5 NaN
对于简单的索引合并,可以合并多余2个的数据框
left
Out[12]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
right
Out[13]:
Missouri Alabama
d 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
another=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],index=['a','c','e','f'],columns=['New York','Oregon'])
another
Out[15]:
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
left.join([right,another])
Out[17]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
2.3沿轴向连接concat
1.对Numpy数组 concatenate
arr=np.arange(12).reshape(3,4)
arr
Out[20]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.concatenate([arr,arr],axis=1)
Out[22]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
np.concatenate([arr,arr])
Out[23]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
2.对Series或DataFrame:concat()
concat函数的参数
参数 | 描述 |
---|---|
objs | 需要连接的pandas对象列表或字典 |
axis | 连接的轴向 |
join | ‘inner’或’outer’ |
join_axes | 用于指定其轴的特定索引,可以替代内/外连接;由于版本原因,这个参数已经无法使用了,可以用reindex函数执行他的任务 |
keys | 与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或数组,也可以是元组的数组,也可以是数组的列表(如果向levels参数传入多层索引) |
leels | 在键值传递时,该参数用于指定多层索引的层级 |
names | 如果传入了keys/levels参数,该参数用于多层索引的层级名称 |
verify_integrity | 检查连接对象的忻州是否有有重复,如果是,则引发异常;默认False,允许重复 |
ignore_index | 不沿着连接轴保留索引,而产生一段新的(长度为total_length)索引 |
#对Series
s1=pd.Series([0,1],index=['a','b'])
s2=pd.Series([2,3,4],index=['c','d','e'])
s3=pd.Series([5,6],index=['f','g'])
pd.concat([s1,s2,s3])
Out[7]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
#在axis=1轴上连接
pd.concat([s1,s2,s3],axis=1)
Out[8]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
s4=pd.concat([s1,s3])
s4
Out[10]:
a 0
b 1
f 5
g 6
dtype: int64
pd.concat([s1,s4],axis=1)
Out[11]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
#join参数的使用
pd.concat([s1,s4],axis=1,join='inner')
Out[12]:
0 1
a 0 0
b 1 1
#keys参数的使用
result=pd.concat([s1,s2,s3],keys=['one','two','three'])
result
Out[15]:
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result.unstack()
Out[16]:
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])
Out[17]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
#对DataFrame
df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
df1
Out[20]:
one two
a 0 1
b 2 3
c 4 5
df2
Out[21]:
three four
a 5 6
c 7 8
pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[22]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
pd.concat({'level1':df1,'levle2':df2},axis=1)
Out[25]:
level1 levle2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'])
Out[26]:
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
df1=pd.DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
df2=pd.DataFrame(np.random.randn(2,3),columns=['b','d','e'])
df1
Out[29]:
a b c d
0 1.267888 1.449852 1.116767 -0.598256
1 0.468823 0.159083 -0.834131 -0.138603
2 1.068021 -0.428810 -0.068083 -1.146126
df2
Out[30]:
b d e
0 -0.702756 0.544601 -0.499210
1 0.685556 -0.924991 -0.863705
pd.concat([df1,df2])
Out[31]:
a b c d e
0 1.267888 1.449852 1.116767 -0.598256 NaN
1 0.468823 0.159083 -0.834131 -0.138603 NaN
2 1.068021 -0.428810 -0.068083 -1.146126 NaN
0 NaN -0.702756 NaN 0.544601 -0.499210
1 NaN 0.685556 NaN -0.924991 -0.863705
#对索引重排列
pd.concat([df1,df2],ignore_index=True)
Out[32]:
a b c d e
0 1.267888 1.449852 1.116767 -0.598256 NaN
1 0.468823 0.159083 -0.834131 -0.138603 NaN
2 1.068021 -0.428810 -0.068083 -1.146126 NaN
3 NaN -0.702756 NaN 0.544601 -0.499210
4 NaN 0.685556 NaN -0.924991 -0.863705
2.4联合重叠数据combine_first
b.combine_first(a)
所谓联合重叠,就好像是把两张表叠放在一起一样。如果某个表存在空值,空值的部分会被另一个表位置对应的数值呈现出来。
如果一个格子上的两张表都有数值,则用主动调用函数的表上的数值。(b)
注意:这个做法也是会和索引对应的
a=pd.Series([np.nan,2.5,0.0,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b=pd.Series([0.,np.nan,2.,np.nan,np.nan,5.],index=['a','b','c','d','e','f'])
a
Out[36]:
f NaN
e 2.5
d 0.0
c 3.5
b 4.5
a NaN
dtype: float64
b
Out[37]:
a 0.0
b NaN
c 2.0
d NaN
e NaN
f 5.0
dtype: float64
np.where(pd.isnull(a),b,a)
Out[38]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
#索引对应
b.combine_first(a)
Out[39]:
a 0.0
b 4.5
c 2.0
d 0.0
e 2.5
f 5.0
dtype: float64
df1=pd.DataFrame({'a':[1.,np.nan,5.,np.nan],
'b':[np.nan,2.,np.nan,6.],
'c':range(2,18,4)})
df2=pd.DataFrame({'a':[5.,4.,np.nan,3.,7.],'b':[np.nan,3.,4.,6.,8.]})
df1
Out[43]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2
Out[44]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
Out[45]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
3 重塑或透视
重排列表格型数据
3.1使用多层索引进行重塑stack()\unstack()
其实就是数据形态上的转变,即堆叠与展开
数据并没有发生变化。
data=pd.DataFrame(np.arange(6).reshape(2,3),index=pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
data
Out[47]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
result=data.stack()
result
Out[49]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
result.index
Out[51]:
MultiIndex([( 'Ohio', 'one'),
( 'Ohio', 'two'),
( 'Ohio', 'three'),
('Colorado', 'one'),
('Colorado', 'two'),
('Colorado', 'three')],
names=['state', 'number'])
result.unstack()
Out[52]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
result.unstack(0)
Out[53]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
result.unstack('state')
Out[54]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
s1=pd.Series([0,1,2,3],index=['a','b','c','d'])
s2=pd.Series([4,5,6],index=['c','d','e'])
data2=pd.concat([s1,s2],keys=['one','two'])
data2
Out[58]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack()
Out[59]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
data2.unstack().stack()
Out[60]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
#堆叠的时候不过滤缺失值
data2.unstack().stack(dropna=False)
Out[62]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
result
Out[65]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
df=pd.DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
df
Out[67]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.unstack('state')
Out[68]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
df.unstack('state').stack('side')
Out[70]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
3.2将“长”透视为“宽” pivot()
首先要理解这里“将长透视为宽”是个什么含义:
所谓"长":就是形如以下的Dataframe
列1_标记 | 列2_类型变量 | 列3_值 |
---|---|---|
01 | A | 1 |
01 | B | 2 |
02 | A | 3 |
02 | B | 4 |
重点看这个列2,这一列都是类别变量。
所谓“宽”:就把类别变量的所有类别提出来,成为列索引,形如下面:
列1_标记 | A | B |
---|---|---|
01 | 1 | 2 |
02 | 3 | 4 |
这样做可以快速聚焦于一个指标(一列),比较清晰。
可以用pd.pivot()来实现
参数index用来指定作为行索引的那一列(也就是某一列的所有类别作为行索引),参数columns用来指定作为列索引的那一列(也就是某一列的所有类别作为列索引),参数values表示对应的值
pd.pivot?
Signature:
pd.pivot(
data: 'DataFrame',
index: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
columns: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
values: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
) -> 'DataFrame'
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df
foo bar baz zoo
0 one A 1 x
1 one B 2 y
2 one C 3 z
3 two A 4 q
4 two B 5 w
5 two C 6 t
df.pivot(index='foo', columns='bar', values='baz')
bar A B C
foo
one 1 2 3
two 4 5 6
df.pivot(index='foo', columns='bar')['baz']
bar A B C
foo
one 1 2 3
two 4 5 6
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t
You could also assign a list of column names or a list of index names.
df = pd.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
df
lev1 lev2 lev3 lev4 values
0 1 1 1 1 0
1 1 1 2 2 1
2 1 2 1 3 2
3 2 1 2 4 3
4 2 1 1 5 4
5 2 2 2 6 5
df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
lev2 1 2
lev3 1 2 1 2
lev1
1 0.0 1.0 2.0 NaN
2 4.0 3.0 NaN 5.0
df.pivot(index=["lev1", "lev2"], columns=["lev3"],values="values")
lev3 1 2
lev1 lev2
1 1 0.0 1.0
2 2.0 NaN
2 1 4.0 3.0
2 NaN 5.0
3.3将“宽”透视为“长” melt()
就是与“长”透视为“宽相反”,利用pd.melt()来实现
id_vars:用来做标识的变量
value_vars:透视为长的那(些)列
Parameters
----------
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that
are not set as `id_vars`.
var_name : scalar
Name to use for the 'variable' column. If None it uses
``frame.columns.name`` or 'variable'.
value_name : scalar, default 'value'
Name to use for the 'value' column.
col_level : int or str, optional
If columns are a MultiIndex then use this level to melt.
ignore_index : bool, default True
If True, original index is ignored. If False, the original index is retained.
Index labels will be repeated as necessary.
.. versionadded:: 1.1.0
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df
A B C
0 a 1 2
1 b 3 4
2 c 5 6
pd.melt(df, id_vars=['A'], value_vars=['B'])
A variable value
0 a B 1
1 b B 3
2 c B 5
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
The names of 'variable' and 'value' columns can be customized:
pd.melt(df, id_vars=['A'], value_vars=['B'],
var_name='myVarname', value_name='myValname')
A myVarname myValname
0 a B 1
1 b B 3
2 c B 5
Original index values can be kept around:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
A variable value
0 a B 1
1 b B 3
2 c B 5
0 a C 2
1 b C 4
2 c C 6
If you have multi-index columns:
df.columns = [list('ABC'), list('DEF')]
df
A B C
D E F
0 a 1 2
1 b 3 4
2 c 5 6
pd.melt(df, col_level=0, id_vars=['A'], value_vars=['B'])
A variable value
0 a B 1
1 b B 3
2 c B 5
pd.melt(df, id_vars=[('A', 'D')], value_vars=[('B', 'E')])
(A, D) variable_0 variable_1 value
0 a B E 1
1 b B E 3
2 c B E 5