基于numpy的基因序列比对算法NeedlemanWunsch

最新推荐文章于 2024-08-17 21:30:16 发布

Q.1

最新推荐文章于 2024-08-17 21:30:16 发布

阅读量1.1k

点赞数 3

分类专栏：生物信息

本文链接：https://blog.csdn.net/yuqiuwang929/article/details/103910522

版权

生物信息专栏收录该内容

11 篇文章 1 订阅

订阅专栏

这是基于动态规划的一种文本比对算法，常用于基因组序列的比对。

比如基因组上有一段序列为ATCGATCTGT，我需要比对的序列为ATCCATCAG，那么首先需要构建一个初始的打分矩阵：

               A.   T.   C.   G.   A.   T.   C.   T.   G.   T.   
      [[  0.  -1.  -2.  -3.  -4.  -5.  -6.  -7.  -8.  -9. -10.]
A.     [ -1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
T.     [ -2.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
C.     [ -3.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
C.     [ -4.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
A.     [ -5.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
T.     [ -6.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
C.     [ -7.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
A.     [ -8.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
G.     [ -9.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]]

然后从第二列第二行开始遍历矩阵，分别从左边，上面，斜上角往该格子进行计算。match则加1，mismatch则减1，左边和上边过来的数都为gap减1，然后这个格子取三个方向的最大值 max(1,0,0)，于是第二行第二列填上1。

同理，遍历完所有格子后得到如下矩阵：

               A.   T.   C.   G.   A.   T.   C.   T.   G.   T.   
      [[  0.  -1.  -2.  -3.  -4.  -5.  -6.  -7.  -8.  -9. -10.]
A.     [ -1.   1.   1.  -1.  -3.  -3.  -3.  -5.  -7.  -9. -10.]
T.     [ -2.   1.   2.   2.   0.  -2.  -2.  -2.  -4.  -4.  -6.]
C.     [ -3.  -1.   2.   3.   3.   1.  -1.  -1.  -1.  -3.  -5.]
C.     [ -4.  -3.   0.   3.   3.   2.   0.   0.   0.  -2.  -4.]
A.     [ -5.  -3.  -2.   3.   2.   4.   4.   2.   0.  -1.  -3.]
T.     [ -6.  -3.  -2.   1.   2.   4.   5.   5.   3.   3.   1.]
C.     [ -7.  -5.  -2.  -1.   0.   2.   5.   6.   6.   4.   2.]
A.     [ -8.  -6.  -4.  -1.  -2.   1.   3.   6.   5.   5.   3.]
G.     [ -9.  -6.  -6.  -3.   0.   1.   1.   4.   5.   6.   6.]]

最后就需要从右下角往左上角回溯，回溯时每次都取左边，上面，斜上角的最大值，如果左边的数最大则放到左侧的序列引入一个gap，如果上面数最大则上面的序列引入一个gap。

贴上代码：

#!/usr/local/bin/python3
# -*- coding: UTF-8 -*-

import numpy as np


class NeedlemanWunsch:

    def __init__(self, subject, query, match=1, mismatch=-1, gap=-1):
        self.subject = subject.upper()
        self.query = query.upper()
        self.match = match
        self.mismatch = mismatch
        self.gap = gap

    def init_matrix(self):
        score_matrix = np.zeros((len(self.query)+1, len(self.subject)+1))

        sub_init = [i-len(self.subject) for i in range(0, len(self.subject)+1)]
        sub_init.reverse()
        score_matrix[0] = sub_init

        que_init = [i-len(self.query) for i in range(0, len(self.query)+1)]
        que_init.reverse()
        score_matrix[:, 0] = que_init
        return score_matrix

    def check_next_pos(self, pos, gap=False):
        init_score = 0
        if self.query[pos[0]-1] == self.subject[pos[1]-1]:
            init_score += 1
        else:
            init_score += self.mismatch
        if gap:
            init_score += self.gap
        return init_score

    def next_score(self, score_matrix, pos):
        right_pos = (pos[0], pos[1]+1)
        down_pos = (pos[0]+1, pos[1])
        right_down_pos = (pos[0]+1, pos[1]+1)
        right_score = self.check_next_pos(right_pos, gap=True)+score_matrix[right_pos]
        down_score = self.check_next_pos(down_pos, gap=True)+score_matrix[down_pos]
        right_down_score = self.check_next_pos(right_down_pos, gap=False)+score_matrix[pos]
        score_matrix[right_down_pos] = np.max([right_score, down_score, right_down_score])
        return score_matrix

    def run_match(self):
        score_matrics = self.init_matrix()
        for i in range(0, len(self.query)):
            for j in range(0, len(self.subject)):
                score_matrics = self.next_score(score_matrics, (i, j))
        return score_matrics

    def get_match(self, score_matrix):
        out_query = []
        out_subject = []
        que_index = len(self.query)
        sub_index = len(self.subject)
        while que_index > 0:
            up_score = score_matrix[que_index-1, sub_index]
            left_score = score_matrix[que_index, sub_index-1]
            up_left_score = score_matrix[que_index-1, sub_index-1]
            if up_left_score >= up_score and up_left_score >= left_score:
                out_query.append(self.query[que_index - 1])
                out_subject.append(self.subject[sub_index - 1])
                sub_index -= 1
                que_index -= 1
            elif left_score >= up_score:
                out_query.append("-")
                out_subject.append(self.subject[sub_index - 1])
                sub_index -= 1
            else:
                out_subject.append("-")
                out_query.append(self.query[que_index - 1])
                que_index -= 1

        out_query.reverse()
        out_subject.reverse()
        out_subject = "".join(out_subject)
        out_query = "".join(out_query)
        return out_subject, out_query

    @staticmethod
    def print_seq(out_subject, out_query):
        print(out_subject)
        print("".join(["|" if out_subject[i] == out_query[i] else " "
                       for i in range(0, len(out_query))]))

        print(out_query)

    def main(self):
        result = self.run_match()
        print(result)
        out_subject, out_query = self.get_match(result)
        self.print_seq(out_subject, out_query)


if __name__ == "__main__":
    map_cls = NeedlemanWunsch("ATCGATCTGT", "ATCCATCAG", match=1, mismatch=-1, gap=-1)
    map_cls.main()

最终打印结果示例：