问题描述
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.
In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808
60.919540
题解
Python处理
字符串处理,注意处理行末换行符即可
BioPython
相当于是一个fasta文件的读入,之后获取每个序列的GC丰度
fasta文件读入需要用到Bio.SeqIO,对fasta文件使用Bio.SeqIO.parse()解析,返回一个SeqRecord迭代器
对每个SeqRecord对象,获取SeqRecord.description获取label信息(description是获取除’>'以外的整行,id、name是第一个单词),利用Bio.SeqUtils模块的Bio.SeqUtils.GC()方法获取GC丰度
参考代码
Python
with open("rosalind_gc.txt", "r") as f:
seq_name = f.readline().replace("\n", "")
seq_content = ""
ans_gc = 0.0
ans_name = ""
while True:
line = f.readline().replace("\n", "")
if (line == "" or line[0] == '>'):
now_gc = (seq_content.count('G') + seq_content.count('C'))* 1.0 / len(seq_content)
if (now_gc > ans_gc):
ans_gc = now_gc
ans_name = seq_name
if (line == ""):
break
seq_name = line[1:]
seq_content = ""
continue
seq_content = seq_content + line
with open("out.txt", "w") as fo:
fo.write("%s\n%.6f" % (ans_name, ans_gc * 100.0))
fo.close()
f.close()
BioPython
from Bio import SeqIO
from Bio import Seq
from Bio.SeqUtils import GC
ans_gc = 0.0
ans_name = ""
for seq_record in SeqIO.parse("rosalind_gc.txt", "fasta"):
now_gc = GC(seq_record.seq)
if (now_gc > ans_gc):
ans_gc = now_gc
ans_name = seq_record.description
print("%s\n%.6f" % (ans_name, ans_gc))