【Rosalind】Computing GC Content

最新推荐文章于 2023-03-21 19:33:31 发布

Sprjdfn

最新推荐文章于 2023-03-21 19:33:31 发布

阅读量220

点赞数

分类专栏： Rosalind 文章标签：生物信息学

本文链接：https://blog.csdn.net/casber/article/details/123200901

版权

Rosalind 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

问题描述

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

题解

Python处理

字符串处理，注意处理行末换行符即可

BioPython

相当于是一个fasta文件的读入，之后获取每个序列的GC丰度
fasta文件读入需要用到Bio.SeqIO，对fasta文件使用Bio.SeqIO.parse()解析，返回一个SeqRecord迭代器
对每个SeqRecord对象，获取SeqRecord.description获取label信息（description是获取除’>'以外的整行，id、name是第一个单词），利用Bio.SeqUtils模块的Bio.SeqUtils.GC()方法获取GC丰度

参考代码

Python

with open("rosalind_gc.txt", "r") as f:
	seq_name = f.readline().replace("\n", "")
	seq_content = ""
	ans_gc = 0.0
	ans_name = ""
	while True:
		line = f.readline().replace("\n", "")
		if (line == "" or line[0] == '>'):
			now_gc = (seq_content.count('G') + seq_content.count('C'))* 1.0 / len(seq_content)
			if (now_gc > ans_gc):
				ans_gc = now_gc
				ans_name = seq_name
			if (line == ""):
				break
			seq_name = line[1:]
			seq_content = ""
			continue
		seq_content = seq_content + line
	with open("out.txt", "w") as fo:
		fo.write("%s\n%.6f" % (ans_name, ans_gc * 100.0))
		fo.close()
	f.close()

BioPython

from Bio import SeqIO
from Bio import Seq
from Bio.SeqUtils import GC

ans_gc = 0.0
ans_name = ""

for seq_record in SeqIO.parse("rosalind_gc.txt", "fasta"):
	now_gc = GC(seq_record.seq)
	if (now_gc > ans_gc):
		ans_gc = now_gc
		ans_name = seq_record.description
print("%s\n%.6f" % (ans_name, ans_gc))