啃书 《利用python进行数据分析》 第八章数据规整:多级索引、数据集合并merge\concat\join\combine_first、透视stack\unstack\pivot\melt

8 数据规整:连接、联合与重塑-多重索引、数据合并

1.分层索引


基础结构

  • Series的多层索引以及利用多层索引进行部分索引
data=pd.Series(np.random.randn(9),index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1,3,1,2,2,3]])

data
Out[11]: 
a  1    1.330320
   2    0.168819
   3   -0.778886
b  1   -0.684871
   3    1.255565
c  1   -0.064847
   2    1.329889
d  2   -1.139215
   3    0.247478
dtype: float64

data.index
Out[12]: 
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

data['b']
Out[13]: 
1   -0.684871
3    1.255565
dtype: float64

data['b':'c']
Out[14]: 
b  1   -0.684871
   3    1.255565
c  1   -0.064847
   2    1.329889
dtype: float64

data[['b','d']]
Out[15]: 
b  1   -0.684871
   3    1.255565
d  2   -1.139215
   3    0.247478
dtype: float64

data.loc[:,2]
Out[16]: 
a    0.168819
c    1.329889
d   -1.139215
dtype: float64
  • unstack()方法与stack()方法
data.unstack()
Out[17]: 
          1         2         3
a  1.330320  0.168819 -0.778886
b -0.684871       NaN  1.255565
c -0.064847  1.329889       NaN
d       NaN -1.139215  0.247478

data.unstack().stack()
Out[18]: 
a  1    1.330320
   2    0.168819
   3   -0.778886
b  1   -0.684871
   3    1.255565
c  1   -0.064847
   2    1.329889
d  2   -1.139215
   3    0.247478
dtype: float64

type(data.unstack())
Out[19]: pandas.core.frame.DataFrame
  • Dataframe的多层索引
frame=pd.DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])

frame
Out[22]: 
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

frame.index.names=['key1','key2']


frame.columns.names=['state','color']

frame
Out[27]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame['Ohio']
Out[28]: 
color      Green  Red
key1 key2            
a    1         0    1
     2         3    4
b    1         6    7
     2         9   10
  • 创建一个多层索引:pd.MultiIndex.from_arrays

pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
Out[30]: 
MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

1.1重排序和层级排序

  • swaplevel方法:接收两个层级序号或层级名称,进行层级变更
  • sort_index()方法:在单一层级上对数据进行排序
frame
Out[31]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame.swaplevel('key1','key2')
Out[32]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1                   
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

frame.sort_index(level=1)
Out[33]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

frame.swaplevel(0,1).sort_index(level=0)
Out[34]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1                   
1    a        0   1        2
     b        6   7        8
2    a        3   4        5
     b        9  10       11

1.2按层级进行汇总统计

很多Series和Dataframe的描述性和汇总性统计有一个level选项,通过level选项可以指定想要在某个特定的轴上进行聚合。

frame.sum(level='key2')
Out[35]: 
state  Ohio     Colorado
color Green Red    Green
key2                    
1         6   8       10
2        12  14       16

frame.sum(level='color',axis=1)
Out[36]: 
color      Green  Red
key1 key2            
a    1         2    1
     2         8    4
b    1        14    7
     2        20   10

1.3使用DataFrame的列进行索引

这部分内容用大白话说就是:将dataframe的某一列设为索引、将dataframe的某一列取消索引

  • set_index函数设置dataframe的一列或多列为索引。参数drop=False,可以将索引列仍然保留在列中
  • reset_index就是set_index的反操作,分层索引的索引层级会被移动到列中
frame=pd.DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})

frame
Out[38]: 
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3



frame.set_index(['c','d'],drop=False)
Out[40]: 
       a  b    c  d
c   d              
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

frame2=frame.set_index(['c','d'])

frame2
Out[42]: 
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

frame2.reset_index()
Out[43]: 
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

2.联合与合并数据集merge\join\concat

pd.merge根据一个键或多个键进行连接。依据是某(些)列

pd.concat使对象在轴上进行粘合。依据是

pd.combine_first:用一个对象中的值填充另一个对象的缺失值

2.1数据库风格merge

merge参数

参数描述
left合并操作时左边的DataFrame
right合并操作时右边的DataFrame
how连接类型’inner’(默认) 交集 ‘outer’:并集 ‘left’:左连接,包含所有左表内容 ‘right’:右连接,包含所有右表内容
on需要连接的列名。必须在左右表中都有
left_on左表中起连接键作用的列
right_on右表中起连接键作用的列
left_index使用左表的行索引作为它的连接键
right_index使用右表的行索引作为它的连接键
sort通过连接键按字母顺序进行排序,默认为True
suffixes左右两表中有相同的列名,且不作为连接键时,用后缀加以分别。默认是(’ _x’,’ _y’)
copy如果是False,则在某些情况下避免将数据复制到结果数据结构中。默认情况下总是复制
indicator添加一个特殊的列_merge,指示每一行的来源。‘left_only’ , ‘right_only’ , ‘both’

on 参数 使用示例

df1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})

df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})

df1
Out[46]: 
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6

df2
Out[47]: 
  key  data2
0   a      0
1   b      1
2   d      2

pd.merge(df1,df2)
Out[48]: 
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      6      1
3   a      2      0
4   a      4      0
5   a      5      0

pd.merge(df1,df2,on='key')
Out[49]: 
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      6      1
3   a      2      0
4   a      4      0
5   a      5      0

left_on \ right_on 参数 使用示例

df3=pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})

df4=pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})

pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[54]: 
  lkey  data1 rkey  data2
0    b      0    b      1
1    b      1    b      1
2    b      6    b      1
3    a      2    a      0
4    a      4    a      0
5    a      5    a      0

how 参数 使用示例

df1
Out[57]: 
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6

df2
Out[58]: 
  key  data2
0   a      0
1   b      1
2   d      2

pd.merge(df1,df2,how='outer')
Out[59]: 
  key  data1  data2
0   b    0.0    1.0
1   b    1.0    1.0
2   b    6.0    1.0
3   a    2.0    0.0
4   a    4.0    0.0
5   a    5.0    0.0
6   c    3.0    NaN
7   d    NaN    2.0

df1=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})

df2=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})

df1
Out[62]: 
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

df2
Out[63]: 
  key  data2
0   a      0
1   b      1
2   a      2
3   b      3
4   d      4

pd.merge(df1,df2,on='key',how='outer')
Out[64]: 
   key  data1  data2
0    b    0.0    1.0
1    b    0.0    3.0
2    b    1.0    1.0
3    b    1.0    3.0
4    b    5.0    1.0
5    b    5.0    3.0
6    a    2.0    0.0
7    a    2.0    2.0
8    a    4.0    0.0
9    a    4.0    2.0
10   c    3.0    NaN
11   d    NaN    4.0

pd.merge(df1,df2,on='key',how='inner')
Out[65]: 
  key  data1  data2
0   b      0      1
1   b      0      3
2   b      1      1
3   b      1      3
4   b      5      1
5   b      5      3
6   a      2      0
7   a      2      2
8   a      4      0
9   a      4      2

多键合并 示例

left=pd.DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one'],'lval':[1,2,3]})

right=pd.DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two'],'lval':[4,5,6,7]})

left
Out[68]: 
  key1 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3

right
Out[69]: 
  key1 key2  lval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7

pd.merge(left,right,on=['key1','key2'],how='outer')
Out[70]: 
  key1 key2  lval_x  lval_y
0  foo  one     1.0     4.0
1  foo  one     1.0     5.0
2  foo  two     2.0     NaN
3  bar  one     3.0     6.0
4  bar  two     NaN     7.0

suffixes参数 使用示例


left
Out[72]: 
  key1 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3

right
Out[73]: 
  key1 key2  rval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7

pd.merge(left,right,on='key1')
Out[74]: 
  key1 key2_x  lval key2_y  rval
0  foo    one     1    one     4
1  foo    one     1    one     5
2  foo    two     2    one     4
3  foo    two     2    one     5
4  bar    one     3    one     6
5  bar    one     3    two     7

pd.merge(left,right,on='key1',suffixes=('_left','_right'))
Out[75]: 
  key1 key2_left  lval key2_right  rval
0  foo       one     1        one     4
1  foo       one     1        one     5
2  foo       two     2        one     4
3  foo       two     2        one     5
4  bar       one     3        one     6
5  bar       one     3        two     7

2.2根据索引合并 merge&join

方法1.利用merge的left_index\right_index来根据索引合并两个dataframe。left_index=True或right_index=True表示索引需要用来作为合并的键

merge的left_index\right_index参数示例

left1=pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})

right1=pd.DataFrame({'group_val':[3.5,6]},index=['a','b'])

left1
Out[81]: 
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

right1
Out[82]: 
   group_val
a        3.5
b        6.0

pd.merge(left1,right1,left_on='key',right_index=True)
Out[83]: 
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        6.0
4   b      4        6.0

pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
Out[84]: 
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        6.0
4   b      4        6.0
5   c      5        NaN

merge的left_index\right_index参数(多层索引情况)示例

lefth=pd.DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key2':[2000,2001,2002,2001,2002],'data':np.arange(5.)})

righth=pd.DataFrame(np.arange(12).reshape(6,2),index=[['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns=['event1','event2'])

lefth
Out[88]: 
     key1  key2  data
0    Ohio  2000   0.0
1    Ohio  2001   1.0
2    Ohio  2002   2.0
3  Nevada  2001   3.0
4  Nevada  2002   4.0

righth
Out[89]: 
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11

pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
Out[90]: 
     key1  key2  data  event1  event2
0    Ohio  2000   0.0       4       5
0    Ohio  2000   0.0       6       7
1    Ohio  2001   1.0       8       9
2    Ohio  2002   2.0      10      11
3  Nevada  2001   3.0       0       1

pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
Out[91]: 
     key1  key2  data  event1  event2
0    Ohio  2000   0.0     4.0     5.0
0    Ohio  2000   0.0     6.0     7.0
1    Ohio  2001   1.0     8.0     9.0
2    Ohio  2002   2.0    10.0    11.0
3  Nevada  2001   3.0     0.0     1.0
4  Nevada  2002   4.0     NaN     NaN
4  Nevada  2000   NaN     2.0     3.0

merge使用两边的索引合并示例


left=pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','c','e'],columns=['Ohio','Nevada'])

right=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['d','c','d','e'],columns=['Missouri','Alabama'])

left
Out[95]: 
   Ohio  Nevada
a   1.0     2.0
c   3.0     4.0
e   5.0     6.0

right
Out[97]: 
   Missouri  Alabama
d       7.0      8.0
c       9.0     10.0
d      11.0     12.0
e      13.0     14.0

pd.merge(left,right,left_index=True,right_index=True,how='outer')
Out[98]: 
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
c   3.0     4.0       9.0     10.0
d   NaN     NaN       7.0      8.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0

方法2.利用join来根据索引合并两个dataframe

默认是根据索引来合并的

left
Out[5]: 
   Ohio  Nevada
a   1.0     2.0
c   3.0     4.0
e   5.0     6.0

right
Out[6]: 
   Missouri  Alabama
d       7.0      8.0
c       9.0     10.0
d      11.0     12.0
e      13.0     14.0

left.join(right,how='outer')
Out[7]: 
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
c   3.0     4.0       9.0     10.0
d   NaN     NaN       7.0      8.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0

也支持根据列来合并啦


left1
Out[9]: 
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

right1
Out[10]: 
   group_val
a        3.5
b        6.0

left1.join(right1,on='key')
Out[11]: 
  key  value  group_val
0   a      0        3.5
1   b      1        6.0
2   a      2        3.5
3   a      3        3.5
4   b      4        6.0
5   c      5        NaN

对于简单的索引合并,可以合并多余2个的数据框

left
Out[12]: 
   Ohio  Nevada
a   1.0     2.0
c   3.0     4.0
e   5.0     6.0

right
Out[13]: 
   Missouri  Alabama
d       7.0      8.0
c       9.0     10.0
d      11.0     12.0
e      13.0     14.0

another=pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],index=['a','c','e','f'],columns=['New York','Oregon'])

another
Out[15]: 
   New York  Oregon
a       7.0     8.0
c       9.0    10.0
e      11.0    12.0
f      16.0    17.0

left.join([right,another])
Out[17]: 
   Ohio  Nevada  Missouri  Alabama  New York  Oregon
a   1.0     2.0       NaN      NaN       7.0     8.0
c   3.0     4.0       9.0     10.0       9.0    10.0
e   5.0     6.0      13.0     14.0      11.0    12.0

2.3沿轴向连接concat

1.对Numpy数组 concatenate

arr=np.arange(12).reshape(3,4)

arr
Out[20]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

np.concatenate([arr,arr],axis=1)
Out[22]: 
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

np.concatenate([arr,arr])
Out[23]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

2.对Series或DataFrame:concat()

concat函数的参数

参数描述
objs需要连接的pandas对象列表或字典
axis连接的轴向
join‘inner’或’outer’
join_axes用于指定其轴的特定索引,可以替代内/外连接;由于版本原因,这个参数已经无法使用了,可以用reindex函数执行他的任务
keys与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或数组,也可以是元组的数组,也可以是数组的列表(如果向levels参数传入多层索引)
leels在键值传递时,该参数用于指定多层索引的层级
names如果传入了keys/levels参数,该参数用于多层索引的层级名称
verify_integrity检查连接对象的忻州是否有有重复,如果是,则引发异常;默认False,允许重复
ignore_index不沿着连接轴保留索引,而产生一段新的(长度为total_length)索引
#对Series
s1=pd.Series([0,1],index=['a','b'])

s2=pd.Series([2,3,4],index=['c','d','e'])

s3=pd.Series([5,6],index=['f','g'])

pd.concat([s1,s2,s3])
Out[7]: 
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

#在axis=1轴上连接
pd.concat([s1,s2,s3],axis=1)
Out[8]: 
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

s4=pd.concat([s1,s3])

s4
Out[10]: 
a    0
b    1
f    5
g    6
dtype: int64

pd.concat([s1,s4],axis=1)
Out[11]: 
     0  1
a  0.0  0
b  1.0  1
f  NaN  5
g  NaN  6


#join参数的使用
pd.concat([s1,s4],axis=1,join='inner')
Out[12]: 
   0  1
a  0  0
b  1  1

#keys参数的使用
result=pd.concat([s1,s2,s3],keys=['one','two','three'])

result
Out[15]: 
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

result.unstack()
Out[16]: 
         a    b    c    d    e    f    g
one    0.0  1.0  NaN  NaN  NaN  NaN  NaN
two    NaN  NaN  2.0  3.0  4.0  NaN  NaN
three  NaN  NaN  NaN  NaN  NaN  5.0  6.0

pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])
Out[17]: 
   one  two  three
a  0.0  NaN    NaN
b  1.0  NaN    NaN
c  NaN  2.0    NaN
d  NaN  3.0    NaN
e  NaN  4.0    NaN
f  NaN  NaN    5.0
g  NaN  NaN    6.0

#对DataFrame
df1=pd.DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])

df2=pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])

df1
Out[20]: 
   one  two
a    0    1
b    2    3
c    4    5

df2
Out[21]: 
   three  four
a      5     6
c      7     8

pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[22]: 
  level1     level2     
     one two  three four
a      0   1    5.0  6.0
b      2   3    NaN  NaN
c      4   5    7.0  8.0


pd.concat({'level1':df1,'levle2':df2},axis=1)
Out[25]: 
  level1     levle2     
     one two  three four
a      0   1    5.0  6.0
b      2   3    NaN  NaN
c      4   5    7.0  8.0

pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'])
Out[26]: 
upper level1     level2     
lower    one two  three four
a          0   1    5.0  6.0
b          2   3    NaN  NaN
c          4   5    7.0  8.0



df1=pd.DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])

df2=pd.DataFrame(np.random.randn(2,3),columns=['b','d','e'])

df1
Out[29]: 
          a         b         c         d
0  1.267888  1.449852  1.116767 -0.598256
1  0.468823  0.159083 -0.834131 -0.138603
2  1.068021 -0.428810 -0.068083 -1.146126

df2
Out[30]: 
          b         d         e
0 -0.702756  0.544601 -0.499210
1  0.685556 -0.924991 -0.863705

pd.concat([df1,df2])
Out[31]: 
          a         b         c         d         e
0  1.267888  1.449852  1.116767 -0.598256       NaN
1  0.468823  0.159083 -0.834131 -0.138603       NaN
2  1.068021 -0.428810 -0.068083 -1.146126       NaN
0       NaN -0.702756       NaN  0.544601 -0.499210
1       NaN  0.685556       NaN -0.924991 -0.863705


#对索引重排列
pd.concat([df1,df2],ignore_index=True)
Out[32]: 
          a         b         c         d         e
0  1.267888  1.449852  1.116767 -0.598256       NaN
1  0.468823  0.159083 -0.834131 -0.138603       NaN
2  1.068021 -0.428810 -0.068083 -1.146126       NaN
3       NaN -0.702756       NaN  0.544601 -0.499210
4       NaN  0.685556       NaN -0.924991 -0.863705

2.4联合重叠数据combine_first

b.combine_first(a)

所谓联合重叠,就好像是把两张表叠放在一起一样。如果某个表存在空值,空值的部分会被另一个表位置对应的数值呈现出来。

如果一个格子上的两张表都有数值,则用主动调用函数的表上的数值。(b)

注意:这个做法也是会和索引对应的

a=pd.Series([np.nan,2.5,0.0,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])

b=pd.Series([0.,np.nan,2.,np.nan,np.nan,5.],index=['a','b','c','d','e','f'])

a
Out[36]: 
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

b
Out[37]: 
a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64

np.where(pd.isnull(a),b,a)
Out[38]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

    #索引对应
b.combine_first(a)
Out[39]: 
a    0.0
b    4.5
c    2.0
d    0.0
e    2.5
f    5.0
dtype: float64



df1=pd.DataFrame({'a':[1.,np.nan,5.,np.nan],
'b':[np.nan,2.,np.nan,6.],
'c':range(2,18,4)})


df2=pd.DataFrame({'a':[5.,4.,np.nan,3.,7.],'b':[np.nan,3.,4.,6.,8.]})

df1
Out[43]: 
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

df2
Out[44]: 
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

df1.combine_first(df2)
Out[45]: 
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

3 重塑或透视

重排列表格型数据

3.1使用多层索引进行重塑stack()\unstack()

其实就是数据形态上的转变,即堆叠与展开

数据并没有发生变化。

data=pd.DataFrame(np.arange(6).reshape(2,3),index=pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))

data
Out[47]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

result=data.stack()


result
Out[49]: 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32



result.index
Out[51]: 
MultiIndex([(    'Ohio',   'one'),
            (    'Ohio',   'two'),
            (    'Ohio', 'three'),
            ('Colorado',   'one'),
            ('Colorado',   'two'),
            ('Colorado', 'three')],
           names=['state', 'number'])

result.unstack()
Out[52]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

result.unstack(0)
Out[53]: 
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

result.unstack('state')
Out[54]: 
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5


s1=pd.Series([0,1,2,3],index=['a','b','c','d'])

s2=pd.Series([4,5,6],index=['c','d','e'])

data2=pd.concat([s1,s2],keys=['one','two'])

data2
Out[58]: 
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack()
Out[59]: 
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

data2.unstack().stack()
Out[60]: 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64


#堆叠的时候不过滤缺失值
data2.unstack().stack(dropna=False)
Out[62]: 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64



result
Out[65]: 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

df=pd.DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))

df
Out[67]: 
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

df.unstack('state')
Out[68]: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10



df.unstack('state').stack('side')
Out[70]: 
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7

3.2将“长”透视为“宽” pivot()

首先要理解这里“将长透视为宽”是个什么含义:

所谓"长":就是形如以下的Dataframe

列1_标记列2_类型变量列3_值
01A1
01B2
02A3
02B4

重点看这个列2,这一列都是类别变量。

所谓“宽”:就把类别变量的所有类别提出来,成为列索引,形如下面:

列1_标记AB
0112
0234

这样做可以快速聚焦于一个指标(一列),比较清晰。

可以用pd.pivot()来实现

参数index用来指定作为行索引的那一列(也就是某一列的所有类别作为行索引),参数columns用来指定作为列索引的那一列(也就是某一列的所有类别作为列索引),参数values表示对应的值

pd.pivot?
Signature:
pd.pivot(
    data: 'DataFrame',
    index: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
    columns: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
    values: Union[Hashable, NoneType, Sequence[Union[Hashable, NoneType]]] = None,
) -> 'DataFrame'


df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
                           'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t

df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6

df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6

df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

You could also assign a list of column names or a list of index names.

df = pd.DataFrame({
       "lev1": [1, 1, 1, 2, 2, 2],
       "lev2": [1, 1, 2, 1, 1, 2],
       "lev3": [1, 2, 1, 2, 1, 2],
       "lev4": [1, 2, 3, 4, 5, 6],
       "values": [0, 1, 2, 3, 4, 5]})
df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5

df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0

df.pivot(index=["lev1", "lev2"], columns=["lev3"],values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0


3.3将“宽”透视为“长” melt()

就是与“长”透视为“宽相反”,利用pd.melt()来实现

id_vars:用来做标识的变量

value_vars:透视为长的那(些)列

Parameters
----------
id_vars : tuple, list, or ndarray, optional
    Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
    Column(s) to unpivot. If not specified, uses all columns that
    are not set as `id_vars`.
var_name : scalar
    Name to use for the 'variable' column. If None it uses
    ``frame.columns.name`` or 'variable'.
value_name : scalar, default 'value'
    Name to use for the 'value' column.
col_level : int or str, optional
    If columns are a MultiIndex then use this level to melt.
ignore_index : bool, default True
    If True, original index is ignored. If False, the original index is retained.
    Index labels will be repeated as necessary.

    .. versionadded:: 1.1.0
            
            
            
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6

pd.melt(df, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

The names of 'variable' and 'value' columns can be customized:

pd.melt(df, id_vars=['A'], value_vars=['B'],
        var_name='myVarname', value_name='myValname')
   A myVarname  myValname
0  a         B          1
1  b         B          3
2  c         B          5

Original index values can be kept around:

pd.melt(df, id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
0  a        C      2
1  b        C      4
2  c        C      6

If you have multi-index columns:

df.columns = [list('ABC'), list('DEF')]
df
   A  B  C
   D  E  F
0  a  1  2
1  b  3  4
2  c  5  6

pd.melt(df, col_level=0, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

pd.melt(df, id_vars=[('A', 'D')], value_vars=[('B', 'E')])
  (A, D) variable_0 variable_1  value
0      a          B          E      1
1      b          B          E      3
2      c          B          E      5
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值