Problem
A matrix is a rectangular table of values divided into rows and columns. An matrix has rows and columns. Given a matrix , we write to indicate the value found at the intersection of row and column .
Say that we have a collection of DNA strings, all having the same length . Their profile matrix is a matrix in which represents the number of times that 'A' occurs in the th position of one of the strings, represents the number of times that C occurs in the th position, and so on (see below).
A consensus string is a string of length formed from our collection by taking the most common symbol at each position; the th symbol of therefore corresponds to the symbol having the maximum value in the -th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
A T C C A G C T | |
G G G C A A C T | |
A T G G A T C T | |
DNA Strings | A A G C A A C C |
T T G G A A C T | |
A T G C C A T T | |
A T G G C A C T | |
| |
A 5 1 0 0 5 5 0 0 | |
Profile | C 0 0 1 4 2 0 6 1 |
G 1 1 6 3 0 1 0 0 | |
T 1 5 0 0 0 1 1 6 | |
| |
Consensus | A T G C A A C T |
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
Sample Dataset
>Rosalind_1 ATCCAGCT >Rosalind_2 GGGCAACT >Rosalind_3 ATGGATCT >Rosalind_4 AAGCAACC >Rosalind_5 TTGGAACT >Rosalind_6 ATGCCATT >Rosalind_7 ATGGCACT
Sample Output
ATGCAACT A: 5 1 0 0 5 5 0 0 C: 0 0 1 4 2 0 6 1 G: 1 1 6 3 0 1 0 0 T: 1 5 0 0 0 1 1 6
python解决方案
#%%
import numpy as np
import os
from collections import Counter
fhand = open('./10.txt')
t = []
for line in fhand:
if line.startswith('>'):
continue
else:
line = line.rstrip()
line_list = list(line)
t.append(line_list)
a = np.array(t)#创建一个二维数组
print(a)
L1,L2,L3,L4 = [], [], [], []
comsquence=''
for i in range(a.shape[1]):
l = [x[i] for x in a] #调出二维数组的每一列
L1.append(l.count('A'))
L2.append(l.count('C'))
L3.append(l.count('T'))
L4.append(l.count('G'))
comsquence=comsquence+Counter(l).most_common()[0][0]
print (comsquence)
print ('A:',L1,'\n','C:',L2,'\n','T:',L3,'\n','G:',L4)