I have the following codes:
import pandas as pd
rep1 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP1',[1.00,23.22,11.12]),('RP1',["A","B","C"]) ], orient='columns')
rep2 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP2',[3.33,77.22,18.12]),('RP2',["G","I","K"]) ], orient='columns')
rep3 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'k']), ('Gene', ['foo', 'bar', 'kux']), ('RP3',[99.99,98.29,8.10]),('RP2',["M","P","J"]) ], orient='columns')
tmp = []
tmp.append(rep1)
tmp.append(rep2)
tmp.append(rep3)
Which produces the following list of data frames.
In [56]: tmp
Out[56]:
[ Probe Gene RP1 RP1
0 x foo 1.00 A
1 y bar 23.22 B
2 z qux 11.12 C, Probe Gene RP2 RP2
0 x foo 3.33 G
1 y bar 77.22 I
2 z qux 18.12 K, Probe Gene RP3 RP2
0 x foo 99.99 M
1 y bar 98.29 P
2 k kux 8.10 J]
Each of the data frames above has the following characteristics:
Always contain 4 columns
Second and last column has identical names
First two columns always named Probe and Gene
Content of Probe and Gene always appear consistently, ie.
'x' always goes with 'foo'.
I'm trying to merge those DFs in the list so that it produces this:
Probe Gene RP1 RP2 RP3 RP1 RP2 RP3
0 x foo 1.00 3.33 99.99 A G M
1 y bar 23.22 77.22 98.29 B I P
2 z qux 11.12 18.12 NA C K NA
3 k kux NA NA 8.10 NA NA J
I tried this code but failed:
In [67]: reduce(pd.merge,tmp)
MergeError: Left data columns not unique: Index([u'Probe', u'Gene', u'RP1', u'RP1'], dtype='object')
What's the right way to do it?
解决方案
You could dedupe the column names. Here's a kind of hacky way:
In [11]: list(rep1.columns[0:2]) + [rep1.columns[2] + "_value"] + [rep1.columns[2] + "_letter"]
Out[11]: ['Probe', 'Gene', 'RP1_value', 'RP1_letter']
In [12]: for rep in tmp:
.....: rep.columns = list(rep.columns[0:2]) + [rep.columns[2] + "_value"] + [rep.columns[2] + "_letter"]
In [13]: reduce(pd.merge,tmp)
Out[13]:
Probe Gene RP1_value RP1_letter RP2_value RP2_letter RP3_value RP3_letter
0 x foo 1.00 A 3.33 G 99.99 M
1 y bar 23.22 B 77.22 I 98.29 P
You also need to specify it as an outer merge (to get the NaN rows):
In [21]: reduce(lambda x, y: pd.merge(x, y, how='outer'),tmp)
Out[21]:
Probe Gene RP1_value RP1_letter RP2_value RP2_letter RP3_value RP3_letter
0 x foo 1.00 A 3.33 G 99.99 M
1 y bar 23.22 B 77.22 I 98.29 P
2 z qux 11.12 C 18.12 K NaN NaN
3 k kux NaN NaN NaN NaN 8.10 J