3.8 高级处理—合并
3.8.1 pd.concat()实现合并——按方向 3.8.2 pd.merge()实现合并——按索引
如果数据由多张表组成,那么有时需要将不同的内容合并在一起分析
3.8.1 pd.concat()实现合并(按方向)
pd.concat([data1, data2], axis=0)
按行或列索引进行合并,axis=1为水平拼接(按列索引的方向),axis默认为0(按行索引的方向,竖直拼接)
比如将刚才处理好的one-hot编码与原数据合并
stock. head( )
trade_date close open high low pre_close change pct_chg vol amount 0 20200313 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436.0 393019665.2 1 20200312 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457.0 328209202.4 2 20200311 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970.0 378766619.0 3 20200310 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648.0 425017184.8 4 20200309 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736.0 438143854.6
stock_change. head( )
涨跌幅_(-600, -300] 涨跌幅_(-300, 0] 涨跌幅_(0, 300] 涨跌幅_(300, 600] 涨跌幅_(600, 900] 0 0 1 0 0 0 1 0 1 0 0 0 2 0 1 0 0 0 3 0 0 1 0 0 4 0 1 0 0 0
pd. concat( [ stock_change, stock] , axis= 1 )
涨跌幅_(-600, -300] 涨跌幅_(-300, 0] 涨跌幅_(0, 300] 涨跌幅_(300, 600] 涨跌幅_(600, 900] trade_date close open high low pre_close change pct_chg vol amount 0 0 1 0 0 0 20200313 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436.0 3.930197e+08 1 0 1 0 0 0 20200312 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457.0 3.282092e+08 2 0 1 0 0 0 20200311 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970.0 3.787666e+08 3 0 0 1 0 0 20200310 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648.0 4.250172e+08 4 0 1 0 0 0 20200309 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736.0 4.381439e+08 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 6997 0 1 0 0 0 19910719 136.7000 137.6600 138.5400 136.6600 137.1700 -0.4700 -0.3426 10823.0 5.242826e+03 6998 0 0 1 0 0 19910718 137.1700 137.1700 137.1700 135.8100 135.8100 1.3600 1.0014 847.0 4.644160e+02 6999 0 0 1 0 0 19910717 135.8100 135.8100 135.8100 135.3900 134.4700 1.3400 0.9965 660.0 3.975240e+02 7000 0 0 1 0 0 19910716 134.4700 134.3900 134.4700 133.1400 133.1400 1.3300 0.9989 2796.0 1.328502e+03 7001 0 0 1 0 0 19910715 133.1400 133.9000 134.1000 131.8700 132.8000 0.3400 0.2560 11938.0 5.534900e+03
7002 rows × 15 columns
pd. concat( [ stock, stock_change] , axis= 0 )
trade_date close open high low pre_close change pct_chg vol amount 涨跌幅_(-600, -300] 涨跌幅_(-300, 0] 涨跌幅_(0, 300] 涨跌幅_(300, 600] 涨跌幅_(600, 900] 0 20200313.0 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436.0 393019665.2 NaN NaN NaN NaN NaN 1 20200312.0 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457.0 328209202.4 NaN NaN NaN NaN NaN 2 20200311.0 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970.0 378766619.0 NaN NaN NaN NaN NaN 3 20200310.0 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648.0 425017184.8 NaN NaN NaN NaN NaN 4 20200309.0 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736.0 438143854.6 NaN NaN NaN NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 6997 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 0.0 0.0 0.0 6998 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 6999 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 7000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 7001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0
14004 rows × 15 columns
上例中,不同的字段在合并后的表中都出现,不存在值的字段处得值为NaN
3.8.2 pd.merge()实现合并(按索引)
pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=(’_x’,‘y’), copy=True, indicator=False, validate=None)
可以指定按照两组数据的共同键值对合并或左右各自 left:A DataFrame object right: Another DataFrame object how: 默认为inner on: [索引]——指定依据哪个索引合并 left_on=None, right_on=None: 指定左右键
how = ‘inner’:内连接,只保留键的值相同的行 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vRq588Ni-1586742394676)(attachment:image.png)]
how = ‘left’: 以左边的表为主连接 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5VnX61g8-1586742394678)(attachment:image.png)]
how = ‘outer’: 外连接,只要键的值存在,相应的行就会被保留 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w94Lif3l-1586742394678)(attachment:image.png)]
left = pd. DataFrame( { 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] ,
'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ] ,
'B' : [ 'B0' , 'B1' , 'B2' , 'B3' ] } )
right = pd. DataFrame( { 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ] ,
'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] ,
'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ] ,
'D' : [ 'D0' , 'D1' , 'D2' , 'D3' ] } )
left
key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right
key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
pd. merge( left, right, how= 'inner' , on= [ 'key1' , 'key2' ] )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2
pd. merge( left, right, how= 'left' , on= [ 'key1' , 'key2' ] )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN
pd. merge( left, right, how= 'outer' , on= [ 'key1' , 'key2' ] )
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3