我在做数千条序列系统发育时,可能有很多seq虽然名字不同但是序列一样,或者做某些分析时候需要做到没有简并碱基(datamonkey选择压力),那么需要筛选。
以下为我写的python代码:
#!/usr/bin/env python
fasta_file=open('pickcat.fas','r')
out_file = open('delambigous.derepeat.pickcat.fas','w')
seq=''
header = ''
UniqueSeq=[]
nucleotide={'A','G','T','C','a','t','c','g'}
#UniqueHeader=[]
for line in fasta_file:
line = line.strip()
if line[0] ==
'>' and seq =='':
# process the first line of the input file
header = line
elif line [0] !=
'>' and nucleotide|set(line) == nucleotide:
#del set(line) not in
ATGCagtc
# join the lines with sequence
seq = line
elif line [0] !=
'>' and nucleotide|set(line) != nucleotide:
header,seq= '',''
elif line[0] ==
'>' and seq != '':
# in subsequent lines starting with
'>',
# write the previous header and sequence
# to the output file. Then re-initialize
# the header and seq variables for the next
record
if seq not in UniqueSeq: #
and header not in UniqueHeader:
UniqueSeq.append(seq) # and
UniqueHeader.append(header)
out_file.write(header+'\n'
+ seq+'\n')
seq,header= '',line
#'header = line' this BUG find me how
hard!!!!
fasta_file.close()
out_file.close()
这个脚本同时去除了简并碱基和序列名一致或者序列信息一致的
用的方法
去除一致序列
1.列表.append建立seq信息库;
2.查询读入的seq在不在信息库里面,不在就说明是新seq,立马append;如果在信息库,说明有重复,则不输出序列
去除简并碱基
把set(seq)和{'A','G','T','C','a','t','c','g'}并集,看结果是否=={'A','G','T','C','a','t','c','g'},从而判断是否有简并碱基,有则该seq和header不输出