I have a file called snp.txt that looks like this:
chrom chromStart chromEnd name strand observed
chr1 259 260 rs72477211 + A/G single
chr1 433 433 rs56289060 + -/C insertion
chr1 491 492 rs55998931 + C/T single
chr1 518 519 rs62636508 + C/G single
chr1 582 583 rs58108140 + A/G single
I have a second file gene.txt
chrom chromStart chromEnd tf_title tf_score
chr1 200 270 NFKB1 123
chr1 420 440 IRF4 234
chr1 488 550 BCL3 231
chr1 513 579 TCF12 12
chr1 582 583 BAD170 89
The final output I want is: output.txt
chrom chromStart chromEnd name strand observed tf_title tf_score
chr1 259 260 rs72477211 + A/G NFKB1 123
chr1 433 433 rs56289060 + -/C IRF4 234
chr1 491 492 rs55998931 + C/T BCL3 231
chr1 518 519 rs62636508 + C/G TCF12 12
chr1 582 583 rs58108140 + A/G BAD170 89
The key thing I want to be able to do is to look at gene.txt and check if the rsnumber in the name column of snp.txt is in the same region established by chrom, chromStart and chromEnd.
For example:
In the first row of snp.txt
the rsid rs72477211 is on chr1 between positions 259 and 260.
Now in gene.txt, NFKB1 is also on chr1 but between positions 200 and 270,
this means that rsid rs72477211 is located the NFKB1 region, so this is noted in output txt.
I am unable to do this in using pandas merge function and I'm not sure where to even start.
the files are extremely large so a loop would be highly inefficient.
Can someone please help? Thanks!
解决方案
If it fits in memory, you can merge the two dataframes with an outer method base only on chrom column, then filter your result by doing the range inclusion math:
df = snp.merge(gene, how='outer', on='chrom')
df = df[(df.chromStart_x>=df.chromStart_y) & (df.chromEnd_x<=df.chromEnd_y)]
You can eventually remove the duplicate columns:
del test['chromStart_y']
del test['chromEnd_y']