fasta序列:
>ORF type:5prime_partial len:117 (+) 2T_between1k2k_c100001_f2p26_1105:1-351(+)
CCCCAGGACATGAAGGGTGCCTCTCGAAGCCCCGAAGACAGCAGTCCGGATGCCGCCCGC
ATCCGAGTCAAGCGCTACCGCCAGAGCATGAACAACTTCCAGGGCCTCCGGAGCTTTGGC
TGCCGCTTCGGGACGTGCACGGTGCAGAAGCTGGCACACCAGATCTACCAGTTCACAGAT
AAGGACAAGGACAACGTCGCCCCCAGGAGCAAGATCAGCCCCCAGGGCTACGGCCGCCGG
CGCCGGCGCTCCCTGCCCGAGGCCGGCCCGGGTCGGACTCTGGTGTCTTCTAAGCCACAA
GCACACGGGGCTCCAGCCCCCCCGAGTGGAAGTGCTCCCCACTTTCTTTAG
tmp3.txt:
PBfusion.1 2T_between5k10k_c30267_f1p4_5060
PBfusion.1 2T_between5k10k_c123520_f1p4_5838
PBfusion.2 2T_between5k10k_c163523_f1p1_4968
PBfusion.3 2T_between5k10k_c153833_f1p86_5871
PBfusion.4 2T_between3k6k_c71197_f1p41_3951
PBfusion.5 2T_between5k10k_c121317_f1p276_6597
PBfusion.6 2T_between3k6k_c38282_f1p39_5894
PBfusion.7 2T_between1k2k_c73839_f1p13_1888
需要将tmp3中第2列的基因的序列调取出来,并合并第一列的编号:
f2 = open('tmp3.txt','r')
f1 = open('2T_transcripts.cds.fa','r')
f3 = open('fasta_parsed.txt','w')
import re
AI_DICT = {}
for line in f2:
id = re.split('\t',line)
AI_DICT[id[1][0:-1]] = id[0]
skip = 0
for line in f1:
if line[0] == '>':
_splitline = re.split(':| ',line)
accessorIDWithArrow = _splitline[6]
accessorID = accessorIDWithArrow
#print (accessorID)
if accessorID in AI_DICT.keys():
f3.write(AI_DICT[accessorID]+"\n")
f3.write(line)
skip = 0
else:
skip = 1
else:
if not skip:
f3.write(line)
f1.close()
f2.close()
f3.close()