10 Consensus and Profile

最新推荐文章于 2024-09-14 19:55:48 发布

acoikw2620

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量277

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/think-and-do/p/7273364.html

版权

Problem

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

	A T C C A G C T
	G G G C A A C T
	A T G G A T C T
DNA Strings	A A G C A A C C
	T T G G A A C T
	A T G C C A T T
	A T G G C A C T

	A 5 1 0 0 5 5 0 0
Profile	C 0 0 1 4 2 0 6 1
	G 1 1 6 3 0 1 0 0
	T 1 5 0 0 0 1 1 6

Consensus	A T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

方法一：

#-*-coding:UTF-8-*-
### 10. Consensus and Profile ###

from collections import Counter
from collections import OrderedDict

fh = open('consesus_and_profile_output2.txt', 'wt')

rosalind = OrderedDict()
seqLength = 0
with open('Sample Dataset.txt') as f:
    for line in f:
        line = line.rstrip()
        if line.startswith('>'):
            rosalindName = line
            rosalind[rosalindName] = ''
            continue
        rosalind[rosalindName] += line
    seqLength = len(rosalind[rosalindName])   #len(ATCCAGCT)

A, C, G, T = [], [], [], []
consensusSeq = ''
for i in range(seqLength):
    seq = ''
    for k in rosalind.keys():      # rosalind.keys = Rosalind_1...Rosalind_7
        seq += rosalind[k][i]       # 把 Rosalind_1...Rosalind_7 相同顺序的碱基加起来
    A.append(seq.count('A'))
    C.append(seq.count('C'))
    G.append(seq.count('G'))
    T.append(seq.count('T'))
    counts = Counter(seq)           # 为seq计数
    consensusSeq += counts.most_common()[0][0]  #从多到少返回,是个list啊，只返回第一个

fh.write(consensusSeq + '\n')
fh.write('\n'.join(['A:\t' + '\t'.join(map(str, A)), 'C:\t' + '\t'.join(map(str, C)),
                    'G:\t' + '\t'.join(map(str, G)), 'T:\t' + '\t'.join(map(str, T))]))
                    #.join(map(str,A))  把 A=[5, 1, 0, 0, 5, 5, 0, 0] 格式转化成 A：5 1 0 0 5 5 0 0
fh.close()

　　方法二：

# coding=utf-8
import numpy as np
import os
from collections import Counter

fhand = open('Sample Dataset.txt')
t = []
for line in fhand:
    if line.startswith('>'):
        continue
    else:
        line = line.rstrip()
    line_list = list(line)   #变成一个list
    t.append(line_list)      # 再把list 放入list
a = np.array(t)  # 创建一个二维数组


L1, L2, L3, L4 = [], [], [], []
comsquence = ''

for i in range(a.shape[1]): # shape[0],是7 行数，shape[1]是8 项目数
    l = [x[i] for x in a]  # 调出二维数组的每一列
    L1.append(l.count('A'))
    L2.append(l.count('C'))
    L3.append(l.count('T'))
    L4.append(l.count('G'))
    comsquence = comsquence + Counter(l).most_common()[0][0]
print comsquence
print 'A:', L1, '\n', 'T:', L2, '\n', 'C:', L3, '\n', 'G:', L4

转载于:https://www.cnblogs.com/think-and-do/p/7273364.html