python筛选一样的_python如何筛选去除不想要的序列

最新推荐文章于 2023-06-08 20:17:29 发布

小林手

最新推荐文章于 2023-06-08 20:17:29 发布

阅读量441

点赞数 1

文章标签： python筛选一样的

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_36324695/article/details/113674057

版权

我在做数千条序列系统发育时，可能有很多seq虽然名字不同但是序列一样，或者做某些分析时候需要做到没有简并碱基(datamonkey选择压力)，那么需要筛选。

以下为我写的python代码：

#!/usr/bin/env python

fasta_file=open('pickcat.fas','r')

out_file = open('delambigous.derepeat.pickcat.fas','w')

seq=''

header = ''

UniqueSeq=[]

nucleotide={'A','G','T','C','a','t','c','g'}

#UniqueHeader=[]

for line in fasta_file:

line = line.strip()

if line[0] ==

'>' and seq =='':

# process the first line of the input file

header = line

elif line [0] !=

'>' and nucleotide|set(line) == nucleotide:

#del set(line) not in

ATGCagtc

# join the lines with sequence

seq = line

elif line [0] !=

'>' and nucleotide|set(line) != nucleotide:

header,seq= '',''

elif line[0] ==

'>' and seq != '':

# in subsequent lines starting with

'>',

# write the previous header and sequence

# to the output file. Then re-initialize

# the header and seq variables for the next

record

if seq not in UniqueSeq: #

and header not in UniqueHeader:

UniqueSeq.append(seq) # and

UniqueHeader.append(header)

out_file.write(header+'\n'

+ seq+'\n')

seq,header= '',line

#'header = line' this BUG find me how

hard!!!!

fasta_file.close()

out_file.close()

这个脚本同时去除了简并碱基和序列名一致或者序列信息一致的

用的方法

去除一致序列

1.列表.append建立seq信息库；

2.查询读入的seq在不在信息库里面，不在就说明是新seq，立马append；如果在信息库，说明有重复，则不输出序列

去除简并碱基

把set(seq)和{'A','G','T','C','a','t','c','g'}并集，看结果是否=={'A','G','T','C','a','t','c','g'}，从而判断是否有简并碱基，有则该seq和header不输出

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。