用Python实现生信分析——序列比对（Sequence Alignment）详细讲解

最新推荐文章于 2025-04-22 13:39:47 发布

写代码的M教授

最新推荐文章于 2025-04-22 13:39:47 发布

阅读量2.4k

点赞数 27

分类专栏：生信分析文章标签： python 开发语言

本文链接：https://blog.csdn.net/qq_41698317/article/details/141286887

版权

生信分析专栏收录该内容

8 篇文章

订阅专栏

1. 序列比对的背景和应用

生物序列（如DNA、RNA和蛋白质）是生物体内的遗传信息载体。通过比较不同生物之间的这些序列，我们可以理解它们之间的进化关系，找到具有相似功能的基因，甚至预测基因的功能。

应用场景：
- 进化分析：通过比对不同物种的基因序列，可以构建进化树，分析物种之间的亲缘关系。
- 基因注释：通过比对新测序的基因组与已知基因组，可以推测新基因的功能。
- 功能预测：通过比对蛋白质序列，可以预测未知蛋白质的结构和功能。

2. 全局比对（Needleman-Wunsch算法）详细解析

Needleman-Wunsch算法是序列比对中的经典算法，用于找到两条序列的最优全局比对。算法的核心思想是通过动态规划，逐步构建一个得分矩阵，最终通过回溯找到最优比对路径。

2.1 动态规划的核心思想

动态规划是一种将复杂问题分解为更小子问题的方法。对于序列比对，动态规划通过以下步骤实现：

（1）构建得分矩阵：

矩阵的每个元素表示比对序列中某一位置的得分。
行和列分别对应两条待比对的序列。
矩阵元素的值取决于序列字符的匹配情况、插入和删除的代价。

（2）初始化：

第一行和第一列初始化为累加的gap代价（插入或删除的代价），表示一种极端情况，即完全匹配一条序列，而另一条序列有多个插入或删除。

（3）递推关系：

对于每个矩阵元素，通过比较三个可能的得分来源（左边、上边、对角线）来选择最优值：
- 对角线：匹配或不匹配得分。
- 左边：表示插入。
- 上边：表示删除。

（4）回溯路径：

从矩阵的最后一个元素（对应于两个序列的末端）开始，回溯到第一个元素（对应于序列的起点），得到最优比对。

2.2 Python实现详解

我们以两个DNA序列为例：seq1 = "GATTACA" 和 seq2 = "GCATGCU"，并使用Needleman-Wunsch算法对其进行全局比对。

代码结构：

import numpy as np

def needleman_wunsch(seq1, seq2, match=1, mismatch=-1, gap=-1):
    len1, len2 = len(seq1), len(seq2)
    score_matrix = np.zeros((len1 + 1, len2 + 1), dtype=int)
    
    # Initialize the score matrix
    for i in range(len1 + 1):
        score_matrix[i][0] = gap * i
    for j in range(len2 + 1):
        score_matrix[0][j] = gap * j

    # Fill the score matrix
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            if seq1[i-1] == seq2[j-1]:
                score = match
            else:
                score = mismatch
            score_matrix[i][j] = max(
                score_matrix[i-1][j-1] + score,  # Diagonal, match/mismatch
                score_matrix[i-1][j] + gap,      # Up, gap in seq2
                score_matrix[i][j-1] + gap       # Left, gap in seq1
            )
    
    # Traceback to get the aligned sequences
    align1, align2 = '', ''
    i, j = len1, len2
    while i > 0 and j > 0:
        score_current = score_matrix[i][j]
        score_diagonal = score_matrix[i-1][j-1]
        score_up = score_matrix[i][j-1]
        score_left = score_matrix[i-1][j]
        
        if score_current == score_diagonal + (match if seq1[i-1] == seq2[j-1] else mismatch):
            align1 += seq1[i-1]
            align2 += seq2[j-1]
            i -= 1
            j -= 1
        elif score_current == score_left + gap:
            align1 += seq1[i-1]
            align2 += '-'
            i -= 1
        elif score_current == score_up + gap:
            align1 += '-'
            align2 += seq2[j-1]
            j -= 1
    
    # Fill in the remaining sequence if any
    while i > 0:
        align1 += seq1[i-1]
        align2 += '-'
        i -= 1
    while j > 0:
        align1 += '-'
        align2 += seq2[j-1]
        j -= 1
    
    # Reverse the sequences as we built them in reverse order
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return align1, align2, score_matrix

# Test sequences
seq1 = "GATTACA"
seq2 = "GCATGCU"

# Run the Needleman-Wunsch algorithm
align1, align2, score_matrix = needleman_wunsch(seq1, seq2)

align1, align2, score_matrix

关键步骤详细解释：

（1）初始化矩阵：

将第一行和第一列的值设置为序列长度乘以gap的代价。例如，序列GATTACA的第一行初始化为[0, -1, -2, -3, -4, -5, -6, -7]，表示不断插入gap的代价。

（2）填充矩阵：

矩阵的填充基于字符匹配的得分和gap的代价。例如，当比对G和G时，得分为match（1）；比对A和C时，得分为mismatch（-1）。

（3）回溯过程：

回溯的路径通过得分矩阵的值来决定，选择从左、上、或对角线方向得分最高的路径。例如，若GATTACA与GCATGCU比对，在回溯时，选择匹配或插入gap。

2.3 运行结果和分析

运行结果：

('G-ATTACA',
 'GCA-TGCU',
 array([[ 0, -1, -2, -3, -4, -5, -6, -7],
        [-1,  1,  0, -1, -2, -3, -4, -5],
        [-2,  0,  0,  1,  0, -1, -2, -3],
        [-3, -1, -1,  0,  2,  1,  0, -1],
        [-4, -2, -2, -1,  1,  1,  0, -1],
        [-5, -3, -3, -1,  0,  0,  0, -1],
        [-6, -4, -2, -2, -1, -1,  1,  0],
        [-7, -5, -3, -1, -2, -2,  0,  0]]))

比对结果：

序列1：G-ATTACA
序列2：GCA-TGCU

得分矩阵：

[[ 0, -1, -2, -3, -4, -5, -6, -7],
 [-1,  1,  0, -1, -2, -3, -4, -5],
 [-2,  0,  0,  1,  0, -1, -2, -3],
 [-3, -1, -1,  0,  2,  1,  0, -1],
 [-4, -2, -2, -1,  1,  1,  0, -1],
 [-5, -3, -3, -1,  0,  0,  0, -1],
 [-6, -4, -2, -2, -1, -1,  1,  0],
 [-7, -5, -3, -1, -2, -2,  0,  0]]

结果分析

（1）比对结果：

序列1 (G-ATTACA) 和序列2 (GCA-TGCU) 的比对结果显示了最优全局比对路径。在这个比对中，存在一些插入/删除操作（gap），例如在序列1的第二个位置和序列2的第四个位置，都插入了gap。

（2）得分矩阵：

这个矩阵表示比对过程中的每一步的得分。矩阵的值从左上角累积到右下角，最终得出全局比对的最优路径。回溯时，通过比较对角线、左边和上方的得分，决定比对路径。

3. 局部比对（Smith-Waterman算法）详细解析

3.1 什么是局部比对？

局部比对（Local Alignment）是指在两条序列中寻找相似的局部区域进行比对。它允许我们在长度不相等或部分相似的序列中找到最相似的片段。局部比对尤其适用于在大多数序列差异较大但在某些部分具有高度相似性的情况下。

3.2 Smith-Waterman算法的核心思想

Smith-Waterman算法是用于局部比对的经典算法。该算法的核心是动态规划，它通过构建一个得分矩阵，找出两个序列中最相似的子序列，并避免负得分。最终的比对结果只包括得分最高的局部区域。

关键特性：

得分矩阵初始化为0：所有矩阵单元初始化为0，避免负得分的累积。
回溯路径从最大得分开始：最终的比对路径从矩阵中的最大得分位置开始回溯。

3.3 Smith-Waterman算法步骤

（1）初始化得分矩阵：

与全局比对不同，得分矩阵的第一行和第一列全部初始化为0。

（2）递推计算：

对于每个矩阵元素，计算从左边、上边和对角线位置的得分，并取这些值的最大值。如果得分为负数，则将其置为0，确保得分矩阵中的值始终为非负。

（3）回溯路径：

从矩阵中的最大值开始回溯，直到遇到得分为0的位置。回溯路径表示局部比对的最优子序列。

3.4 Python实现：局部比对（Smith-Waterman算法）

以下是Python代码实现的Smith-Waterman算法，使用两条DNA序列进行局部比对。

import numpy as np

def smith_waterman(seq1, seq2, match=1, mismatch=-1, gap=-1):
    len1, len2 = len(seq1), len(seq2)
    score_matrix = np.zeros((len1 + 1, len2 + 1), dtype=int)
    
    # Track the maximum score and its position for traceback
    max_score = 0
    max_pos = None
    
    # Fill the score matrix
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            if seq1[i-1] == seq2[j-1]:
                score = match
            else:
                score = mismatch
            # Compute the score for the current cell
            score_matrix[i][j] = max(
                0,  # No negative scores allowed
                score_matrix[i-1][j-1] + score,  # Diagonal move (match/mismatch)
                score_matrix[i-1][j] + gap,      # Up move (gap in seq2)
                score_matrix[i][j-1] + gap       # Left move (gap in seq1)
            )
            # Update the maximum score found
            if score_matrix[i][j] > max_score:
                max_score = score_matrix[i][j]
                max_pos = (i, j)
    
    # Traceback from the max score position
    align1, align2 = '', ''
    i, j = max_pos
    
    while score_matrix[i][j] != 0:
        score_current = score_matrix[i][j]
        score_diagonal = score_matrix[i-1][j-1]
        score_up = score_matrix[i][j-1]
        score_left = score_matrix[i-1][j]
        
        if score_current == score_diagonal + (match if seq1[i-1] == seq2[j-1] else mismatch):
            align1 += seq1[i-1]
            align2 += seq2[j-1]
            i -= 1
            j -= 1
        elif score_current == score_left + gap:
            align1 += seq1[i-1]
            align2 += '-'
            i -= 1
        elif score_current == score_up + gap:
            align1 += '-'
            align2 += seq2[j-1]
            j -= 1
    
    # Reverse the alignments as we constructed them in reverse order
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return align1, align2, score_matrix

# Test sequences
seq1 = "GATTACA"
seq2 = "GCATGCU"

# Run the Smith-Waterman algorithm
align1, align2, score_matrix = smith_waterman(seq1, seq2)

# Output the results
print("Sequence 1 alignment:", align1)
print("Sequence 2 alignment:", align2)
print("Score matrix:\n", score_matrix)

3.5 运行结果和分析

运行结果：

Sequence 1 alignment: AT
Sequence 2 alignment: AT
Score matrix:
 [[0 0 0 0 0 0 0 0]
  [0 1 0 0 0 1 0 0]
  [0 0 0 1 0 0 0 0]
  [0 0 0 0 2 1 0 0]
  [0 0 0 0 1 1 0 0]
  [0 0 0 1 0 0 0 0]
  [0 0 1 0 0 0 1 0]
  [0 0 0 2 1 0 0 0]]

比对结果：