【Rosalind】Computing GC Content

问题描述

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

题解

Python处理

字符串处理,注意处理行末换行符即可

BioPython

相当于是一个fasta文件的读入,之后获取每个序列的GC丰度
fasta文件读入需要用到Bio.SeqIO,对fasta文件使用Bio.SeqIO.parse()解析,返回一个SeqRecord迭代器
对每个SeqRecord对象,获取SeqRecord.description获取label信息(description是获取除’>'以外的整行,id、name是第一个单词),利用Bio.SeqUtils模块的Bio.SeqUtils.GC()方法获取GC丰度

参考代码

Python

with open("rosalind_gc.txt", "r") as f:
	seq_name = f.readline().replace("\n", "")
	seq_content = ""
	ans_gc = 0.0
	ans_name = ""
	while True:
		line = f.readline().replace("\n", "")
		if (line == "" or line[0] == '>'):
			now_gc = (seq_content.count('G') + seq_content.count('C'))* 1.0 / len(seq_content)
			if (now_gc > ans_gc):
				ans_gc = now_gc
				ans_name = seq_name
			if (line == ""):
				break
			seq_name = line[1:]
			seq_content = ""
			continue
		seq_content = seq_content + line
	with open("out.txt", "w") as fo:
		fo.write("%s\n%.6f" % (ans_name, ans_gc * 100.0))
		fo.close()
	f.close()

BioPython

from Bio import SeqIO
from Bio import Seq
from Bio.SeqUtils import GC

ans_gc = 0.0
ans_name = ""

for seq_record in SeqIO.parse("rosalind_gc.txt", "fasta"):
	now_gc = GC(seq_record.seq)
	if (now_gc > ans_gc):
		ans_gc = now_gc
		ans_name = seq_record.description
print("%s\n%.6f" % (ans_name, ans_gc))
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值