Rosalind Python|Finding a Protein Motif

Rosalind编程问题之online数据库实现蛋白质序列查找motif。

Finding a Protein Motif

Problem:
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means “either X or Y” and {X} means “any amino acid except X.” For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID “uniprot_id” in the UniProt database, by inserting the ID number into http://www.uniprot.org/uniprot/uniprot_id

Given: At most 15 UniProt Protein Database access IDs.
Sample input

A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST

Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.
Sample output

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


这道题需要我们使用在线蛋白质数据库uniprot实现蛋白序列检索,而后在检索到的序列中查找motif。由于需要使用爬虫相关API,本道题采用了Python语言完成。特别提一下待检索motif的格式。题目中需要我们检索的是 N-glycosylation motif,写作为: N{P}[ST]{P}。其中N代表必须匹配到唯一的天冬酰胺N,{P}表示匹配到脯氨酸P以外所有其他的氨基酸,[ST]表示匹配到丝氨酸S或者苏氨酸T

题目本身难度不大:requests包爬取蛋白序列+正则表达式匹配motif即可完成。

下面是实现代码:

import requests
def N_gly_motif(ID, sequence):
    sequence = list(sequence)
    global result  # 在函数内部对函数外的对象进行操作
    result = []
    for i in range(0, len(sequence) - 3):
        seq = sequence[i:i + 4]  ##以四个氨基酸一组进行遍历,正则表达式匹配motif
        if (seq[0] == "N") and (seq[2] == "S" or seq[2] == "T") and (seq[1] and seq[3] != "P"):
            result.append(i + 1)


if __name__ == '__main__':
    # 1.从文件中读取蛋白序列ID
    url = 'https://www.uniprot.org/uniprot/'
    with open("C:/Users/Administrator/Desktop/rosalind_mprt.txt") as file:
        seqIDs = file.read().replace("\n", " ").split()
        sequences = {}

    # 2.Html API检索对应的序列
    for proID in seqIDs:
        goToURL = url + proID + ".fasta"
        response = requests.get(goToURL)
        sequences[proID] = (response.text.split("\n"))
        sequences[proID] = "".join(sequences[proID][1::])

    # 3.以键值对的方式输出最终结果
    for key, value in sequences.items():
        N_gly_motif(key, value)
        if not result:  # global函数使得方法外仍可调用result变量
            continue
        else:
            print(key)
            print(*result)  ##星号消除result自带的括号[],以空格分隔结果。

一些代码优化的步骤参考了国外大佬的思路,在此表示感谢!!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值