最近开始做rosalind.info上的生物信息相关的一些python习题练手,辅助自己看python for bioinformatics这本书。前面的一些题目还比较基础,做起来也比较顺利,今天做到这么一题,一直答案做不对,最后看到了Question区有人的评论,恍然大悟,只在我的代码的基础上加上了5个字母,就做出了正确答案。
我把这道题的英文原题直接粘到这里:
Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.
Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that ‘A’ occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).
A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
A T C C A G C T | |
G G G C A A C T | |
A T G G A T C T | |
DNA Strings | A A G C A A C C |
T T G G A A C T | |
A T G C C A T T | |
A T G G C A C T | |
A 5 1 0 0 5 5 0 0 Profile C 0 0 1 4 2 0 6 1 G 1 1 6 3 0 1 0 0 T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
简单来说,就是给定一个fasta文件,其中含有若干条长度相等的序列,需要统计所有序列每个位置上ATCG的数量,并且选出每个位置上出现频率最高的碱基组成一条新的序列。注意题目中最后的一个括号中提示如果多个consensus string同时存在,可以任意输出一条序列。
以下是我最初写的代码:
'''
Rosalind Problems: [CONS] Consensus and Profile
'''
def consensus(fasta_file):
counter = {'A': [], 'C': [], 'G': [], 'T': []}
with open(fasta_file) as file:
data = file.read().splitlines()
#print(type(data)) <list>
for line in data:
if '>' in line:
data.pop(data.index(line))
#print(data[0][0])
for i in range(len(data[0])):
for keys in counter:
counter[keys].append(0)
#print(counter)
for j in range(len(data)):
if data[j][i] == 'A':
counter['A'][i] += 1
elif data[j][i] == 'C':
counter['C'][i] += 1
elif data[j][i] == 'G':
counter['G'][i] += 1
else:
counter['T'][i] += 1
print(counter)
length = len(counter['A'])
#print(length) >>>8
consensus_list = []
for i in range(length):
most = max(counter['A'][i], counter['C'][i], counter['G'][i], counter['T'][i])
#print(most)
for j in counter.keys():
if counter[j][i] == most:
consensus_list.append(j)
break #this is to prevent output multiple nucleotides in one spot
consensus_string = ''.join(consensus_list)
print(consensus_string)
for nt in ['A', 'C', 'G', 'T']:
print(nt + ':', end = ' ')
for i in range(len(counter['A'])):
print(counter[nt][i], end = ' ')
print('\n')
consensus("E:/Denglab/学习/Rosalind Problems/output.fasta")
运行时,好像对于题目给的样式fasta文件没有问题。但是一旦download系统生成的序列较长的文件的时候,总是不能通过测试。在提交两次之后,点开Question界面看了看大家的讨论,很多人反映的问题是系统给的fasta文件的序列(偶数行)是分成多行的,这个我也利用另一个写的python脚本解决了,直到我看到了这一条评论:
Be careful to add a single nucleotide at each position of the consensus sequence. For example, for i th position, you might have all four letters representing equally, make sure to add only one of the letters in this position. Otherwise, your consensus will be longer than profile and original sequence. This took me 8 attempts and I hope it takes less for you. Keep up!
猛地我就回去改了改题目给的参考fasta,原本是这样的:
>Rosalind_1 ATCCAGCT >Rosalind_2 GGGCAACT >Rosalind_3 ATGGATCT >Rosalind_4 AAGCAACC >Rosalind_5 TTGGAACT >Rosalind_6 ATGCCATT >Rosalind_7 ATGGCACT
看到了吗,每一个位置出现频率最高的碱基都只有一个,我把第一条和第三条的A都改成了G,这样就造成了第一个位置上有两个出现频率最高的碱基,我再运行我的代码,果然得到的consensus string长度不是8而是9。我又回去看了系统给的数据和我的结果,系统给的fasta文件序列长度为905,而我跑出来的consensus string长度达到了1200+。
懂了!再看我的代码的第34-36行,每当检测到计数字典counter中的ATCG任意一个计数等于当前位置最大计数的时候,就往生成的consensus序列上添加对应的碱基,但是一旦出现上面的情况,该循环会将所有符合if条件的碱基全部append到后面,这时就是一个合适的使用break语句的时间,只要if被执行过一次,直接break中止当前循环,进行下一位碱基的添加,这也是符合刚才提到的题目最后一句要求的,即任意输出一条符合要求的consensus序列即可。
即在代码空出来的第37行加上:
break #this is to prevent output multiple nucleotides in one spot
其实python中的break和continue语句有点像c语言里的用法,在合适的地方使用才能避免出错!
本文于2020年9月21日发布在https://emmettoncloud.cn/archives/33(本人博客上),现搬至CSDN博客。