I'm wondering what the best way to parse long form data into wide for is in python. I've previously been doing this sort of task in R but it really is taking to long as my files can be upwards of 1 gb. Here is some dummy data:
Sequence Position Strand Score
Gene1 0 + 1
Gene1 1 + 0.25
Gene1 0 - 1
Gene1 1 - 0.5
Gene2 0 + 0
Gene2 1 + 0.1
Gene2 0 - 0
Gene2 1 - 0.5
But I'd like to have it in the wide form where I've summed the scores over the strands at each position. Here is output I hope for:
Sequence 0 1
Gene1 2 0.75
Gene2 0 0.6
Any help on how to attack such a problem conceptually would be really helpful.
解决方案
Both of these solutions seem like overkill when you can do it with pandas in a one-liner:
In [7]: df
Out[7]:
Sequence Position Strand Score
0 Gene1 0 + 1.00
1 Gene1 1 + 0.25
2 Gene1 0 - 1.00
3 Gene1 1 - 0.50
4 Gene2 0 + 0.00
5 Gene2 1 + 0.10
6 Gene2 0 - 0.00
7 Gene2 1 - 0.50
In [8]: df.groupby(['Sequence', 'Position']).Score.sum().unstack('Position')
Out[8]:
Position 0 1
Sequence
Gene1 2 0.75
Gene2 0 0.60
If you cannot load the file into memory then an out-of-core solution in the other answers will work too.