【Rosalind】Finding a Protein Motif – 正则表达式的使用

Rosalind习题Finding a Protein Motif

先祝大家中秋和国庆双节快乐!
在这个日历我还在努力地学习代码(指只有晚上写了2h白天都在睡觉)zzzz
最近在做Rosalind上的这道习题,其中用到了Python中的正则表达式,也就是调用了re模块。之前在学perl(没有学会的那种学)的时候接触过正则表达式,但是自己用的并不是很多,这次正好找到个机会使用一下,还不是很熟练。

Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means “either X or Y” and {X} means “any amino acid except X.” For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID “uniprot_id” in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given: At most 15 UniProt Protein Database access IDs.

Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

Sample Dataset
A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST
Sample Output
B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614

我的解法

'''
Rosalind Problems: [MRPT] Finding a Protein Motif
'''


from urllib.request import urlretrieve
import re

def get_accession(inputfile):
    with open(inputfile, "r", encoding="utf-8") as f:
        accession = f.read().splitlines()
    #print(accession)
    return(accession)

def download_fasta(accession):
    print("Retrieving download information for " + accession)
    url_address = "http://www.uniprot.org/uniprot/" + accession + ".fasta"
    url_filename = accession + ".fasta"
    urlretrieve(url_address, url_filename)
    print(accession +" downloaded successfully!")
    #downloaded files are multiline fasta files. Should be converted to singleline fasta files first.
    singleline(url_filename)
    print(accession +" fasta file standardized!")

def find_motif(fasta_file):
    fasta_file2 = fasta_file + '.fasta'
    with open(fasta_file2, "r", encoding="utf-8") as f:
        data = f.read().splitlines()
        name = fasta_file2.split('.')[0]
        ids, seqs = data[0], data[1]
        regex =r'N[^P][ST][^P]'
        index = [0]
        new_index = 0
        while(re.search(regex, seqs[index[-1]:]) != None):
            reg = re.search(regex, seqs[index[-1]:]).span()
            new_index = new_index + reg[0] + 1
            #print(new_index)
            index.append(new_index)
        index.pop(0)
        #print(index)
        if index != []:
            print(fasta_file)
            for i in index:
                if i != index[-1]:
                    print(i, end = ' ')
                else:
                    print(i, end = '\n')

def singleline(fasta_file):
    with open(fasta_file, 'r') as file:
        new_fasta = {}
        for line in file:
            if line.startswith('>'):
                name = line.split()[0]
                new_fasta[name]=''
            else:
                new_fasta[name] = new_fasta[name] + line.replace('\n', '')
        #print(new_fasta)
    with open(fasta_file, 'w+') as out:
        key_list = list(new_fasta.keys())
        for i in new_fasta.keys():
            if i != key_list[-1]:
                out.write(i)
                out.write('\n')
                out.write(new_fasta[i])
                out.write('\n')
            else:
                out.write(i)
                out.write('\n')
                out.write(new_fasta[i])

#accession_list = get_accession("E:/Denglab/学习/Rosalind Problems/accessionID.txt")
#for i in accession_list:

#  download_fasta(i)

if __name__ == '__main__':
    accession_list = get_accession('accessionID.txt')
    for i in accession_list:
        download_fasta(i)
    for i in accession_list:
        find_motif(i)

输出

我用的测试文件内容是:
P02748_CO9_HUMAN
P01217_GLHA_BOVIN
Q7TMB8
Q924A4
Q28409
P10643_CO7_HUMAN
P05783_K1CR_HUMAN
A6LJ74
P80195_MPP3_BOVIN
P01045_KNH2_BOVIN
P01044_KNH1_BOVIN
P13838_LEUK_RAT
P40225_TPO_HUMAN

我的输出:

Retrieving download information for P02748_CO9_HUMAN
P02748_CO9_HUMAN downloaded successfully!
P02748_CO9_HUMAN fasta file standardized!
Retrieving download information for P01217_GLHA_BOVIN
P01217_GLHA_BOVIN downloaded successfully!
P01217_GLHA_BOVIN fasta file standardized!
Retrieving download information for Q7TMB8
Q7TMB8 downloaded successfully!
Q7TMB8 fasta file standardized!
Retrieving download information for Q924A4
Q924A4 downloaded successfully!
Q924A4 fasta file standardized!
Retrieving download information for Q28409
Q28409 downloaded successfully!
Q28409 fasta file standardized!
Retrieving download information for P10643_CO7_HUMAN
P10643_CO7_HUMAN downloaded successfully!
P10643_CO7_HUMAN fasta file standardized!
Retrieving download information for P05783_K1CR_HUMAN
P05783_K1CR_HUMAN downloaded successfully!
P05783_K1CR_HUMAN fasta file standardized!
Retrieving download information for A6LJ74
A6LJ74 downloaded successfully!
A6LJ74 fasta file standardized!
Retrieving download information for P80195_MPP3_BOVIN
P80195_MPP3_BOVIN downloaded successfully!
P80195_MPP3_BOVIN fasta file standardized!
Retrieving download information for P01045_KNH2_BOVIN
P01045_KNH2_BOVIN downloaded successfully!
P01045_KNH2_BOVIN fasta file standardized!
Retrieving download information for P01044_KNH1_BOVIN
P01044_KNH1_BOVIN downloaded successfully!
P01044_KNH1_BOVIN fasta file standardized!
Retrieving download information for P13838_LEUK_RAT
P13838_LEUK_RAT downloaded successfully!
P13838_LEUK_RAT fasta file standardized!
Retrieving download information for P40225_TPO_HUMAN
P40225_TPO_HUMAN downloaded successfully!
P40225_TPO_HUMAN fasta file standardized!
P02748_CO9_HUMAN
277 415
P01217_GLHA_BOVIN
80 106
Q7TMB8
209 291 328 442 607 672 831 858
Q924A4
74
P10643_CO7_HUMAN
202 754
P05783_K1CR_HUMAN
193 423
P80195_MPP3_BOVIN
95
P01045_KNH2_BOVIN
47 87 168 169 197 204 280
P01044_KNH1_BOVIN
47 87 168 169 197 204
P13838_LEUK_RAT
274 300
P40225_TPO_HUMAN
197 206 234 255 340 348
[Finished in 45.0s]

下载信息去掉不看就是想要的答案,时间是sublime text3计算出来的,这个应该是包含了下载文件的时间(挣扎XD)
我的代码最主要的内容就是在find_motif()函数里。

大佬的答案

但是我看了看讨论区的答案,原来大佬只用了20行就解决了我差不多80行的事情。在这里也贴一个大佬的代码学习一下:(来源rosalind用户hascool

#!/usr/bin/env python
from re import finditer
from sys import argv
from urllib.request import urlopen


uniprot = "http://www.uniprot.org/uniprot/%s.fasta"

for protein in open(argv[1], 'r').read().strip().splitlines():

    # Fetch the protein's fasta file and get rid of newlines.'
    f = urlopen(uniprot % protein).read().decode('utf-8')
    f = ''.join(f.splitlines()[1:])
    
    # Find all the positions of the N-glycosylation motif.
    locs = [g.start()+1 for g in finditer(r'(?=N[^P][ST][^P])', f)]
    
    # Print them out, if any.
    if locs != []:
        print(protein)
        print(' '.join(map(str, locs)))

世界著名的程序员Jamie Zawinski曾经说过(我好像也没调查过他是不是真的说过XD):

“有些人面临一个问题时会想:‘我知道,可以用正则表达式来解决这个问题。’于是现在他们就有两个问题了”——Jamie Zawinski

btw正则表达式其实是在生物信息学中应用很广的一个方法,感觉之后应该多多熟悉正则表达式的使用,对于序列这种特定文本的匹配非常有帮助。之后慢慢整理一下,也想写一个正则表达式学习的归纳整理,先把flag立在这,之后一定还会回来~

本文于2020年10月1日发布在https://emmettoncloud.cn/archives/46(本人博客上),现搬至CSDN博客。

©️2020 CSDN 皮肤主题: 1024 设计师:上身试试 返回首页