这是基于动态规划的一种文本比对算法,常用于基因组序列的比对。
比如基因组上有一段序列为ATCGATCTGT,我需要比对的序列为ATCCATCAG,那么首先需要构建一个初始的打分矩阵:
A. T. C. G. A. T. C. T. G. T.
[[ 0. -1. -2. -3. -4. -5. -6. -7. -8. -9. -10.]
A. [ -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
T. [ -2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
C. [ -3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
C. [ -4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
A. [ -5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
T. [ -6. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
C. [ -7. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
A. [ -8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
G. [ -9. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
然后从第二列第二行开始遍历矩阵,分别从左边,上面,斜上角往该格子进行计算。match则加1,mismatch则减1,左边和上边过来的数都为gap减1,然后这个格子取三个方向的最大值 max(1,0,0),于是第二行第二列填上1。
同理,遍历完所有格子后得到如下矩阵:
A. T. C. G. A. T. C. T. G. T.
[[ 0. -1. -2. -3. -4. -5. -6. -7. -8. -9. -10.]
A. [ -1. 1. 1. -1. -3. -3. -3. -5. -7. -9. -10.]
T. [ -2. 1. 2. 2. 0. -2. -2. -2. -4. -4. -6.]
C. [ -3. -1. 2. 3. 3. 1. -1. -1. -1. -3. -5.]
C. [ -4. -3. 0. 3. 3. 2. 0. 0. 0. -2. -4.]
A. [ -5. -3. -2. 3. 2. 4. 4. 2. 0. -1. -3.]
T. [ -6. -3. -2. 1. 2. 4. 5. 5. 3. 3. 1.]
C. [ -7. -5. -2. -1. 0. 2. 5. 6. 6. 4. 2.]
A. [ -8. -6. -4. -1. -2. 1. 3. 6. 5. 5. 3.]
G. [ -9. -6. -6. -3. 0. 1. 1. 4. 5. 6. 6.]]
最后就需要从右下角往左上角回溯,回溯时每次都取左边,上面,斜上角的最大值,如果左边的数最大则放到左侧的序列引入一个gap,如果上面数最大则上面的序列引入一个gap。
贴上代码:
#!/usr/local/bin/python3
# -*- coding: UTF-8 -*-
import numpy as np
class NeedlemanWunsch:
def __init__(self, subject, query, match=1, mismatch=-1, gap=-1):
self.subject = subject.upper()
self.query = query.upper()
self.match = match
self.mismatch = mismatch
self.gap = gap
def init_matrix(self):
score_matrix = np.zeros((len(self.query)+1, len(self.subject)+1))
sub_init = [i-len(self.subject) for i in range(0, len(self.subject)+1)]
sub_init.reverse()
score_matrix[0] = sub_init
que_init = [i-len(self.query) for i in range(0, len(self.query)+1)]
que_init.reverse()
score_matrix[:, 0] = que_init
return score_matrix
def check_next_pos(self, pos, gap=False):
init_score = 0
if self.query[pos[0]-1] == self.subject[pos[1]-1]:
init_score += 1
else:
init_score += self.mismatch
if gap:
init_score += self.gap
return init_score
def next_score(self, score_matrix, pos):
right_pos = (pos[0], pos[1]+1)
down_pos = (pos[0]+1, pos[1])
right_down_pos = (pos[0]+1, pos[1]+1)
right_score = self.check_next_pos(right_pos, gap=True)+score_matrix[right_pos]
down_score = self.check_next_pos(down_pos, gap=True)+score_matrix[down_pos]
right_down_score = self.check_next_pos(right_down_pos, gap=False)+score_matrix[pos]
score_matrix[right_down_pos] = np.max([right_score, down_score, right_down_score])
return score_matrix
def run_match(self):
score_matrics = self.init_matrix()
for i in range(0, len(self.query)):
for j in range(0, len(self.subject)):
score_matrics = self.next_score(score_matrics, (i, j))
return score_matrics
def get_match(self, score_matrix):
out_query = []
out_subject = []
que_index = len(self.query)
sub_index = len(self.subject)
while que_index > 0:
up_score = score_matrix[que_index-1, sub_index]
left_score = score_matrix[que_index, sub_index-1]
up_left_score = score_matrix[que_index-1, sub_index-1]
if up_left_score >= up_score and up_left_score >= left_score:
out_query.append(self.query[que_index - 1])
out_subject.append(self.subject[sub_index - 1])
sub_index -= 1
que_index -= 1
elif left_score >= up_score:
out_query.append("-")
out_subject.append(self.subject[sub_index - 1])
sub_index -= 1
else:
out_subject.append("-")
out_query.append(self.query[que_index - 1])
que_index -= 1
out_query.reverse()
out_subject.reverse()
out_subject = "".join(out_subject)
out_query = "".join(out_query)
return out_subject, out_query
@staticmethod
def print_seq(out_subject, out_query):
print(out_subject)
print("".join(["|" if out_subject[i] == out_query[i] else " "
for i in range(0, len(out_query))]))
print(out_query)
def main(self):
result = self.run_match()
print(result)
out_subject, out_query = self.get_match(result)
self.print_seq(out_subject, out_query)
if __name__ == "__main__":
map_cls = NeedlemanWunsch("ATCGATCTGT", "ATCCATCAG", match=1, mismatch=-1, gap=-1)
map_cls.main()
最终打印结果示例:
ATCGATCTGT
||| ||| |
ATCCATCAG-
把mismatch和gap值调整成0,初始矩阵都填充为0,实际上就是SmithWaterman算法了。