原创: hxj7
(公众号:生信了)
不知不觉中,已写了七篇关于序列比对的文章:
《序列比对(一)全局比对Needleman-Wunsch算法》
《序列比对(二)Needleman-Wunsch算法之仿射罚分》
《序列比对(三)局部联配Smith-Waterman算法》
《序列比对(四)Smith-Waterman算法之仿射罚分》
《序列匹配(五)重复匹配问题的动态规划算法》
《序列比对(六)交叉匹配问题》
以及《序列比对(七)序列比对之线性空间算法》
这七篇文章将序列比对基础部分的常见问题基本都涉及到了,并且给出了实现代码。回顾这些文章,最核心的都是动态规划算法。现在想来,动态规划最适用于这一类问题:即复杂系统的最终结果分多步实现,每一步的状态与上一步(或上几步)有关。
具体来说,动态规划算法运用到序列比对的关键部分有三个:
1. 初始条件的设置:比如 F(i, 0) 以及 F(0, j)控制着是否要对头部序列进行罚分。全局比对中是要罚分的,而交叉比对就不需要罚分。
2. 迭代方程的建立。这一步最关键,如何选择一个清晰有效好实现的迭代元素需要仔细考量。
3. 终止条件的确立。以得分矩阵而言,全局比对回溯的终止条件是到达F(0, 0),而局部比对是到达分值为0的单元,交叉比对则是到达F(i,0)或者F(0, j)的行或列。
具体到每一种比对方式:
全局比对
全局比对的特点是:
- 两条序列的所有符号都要参加比对
- 两条序列的头部符号如果和空位联配是要罚分的。
其公式是:
F ( i , j ) is the maximum score of alignments between x 1... i and y 1... j . F ( 0 , 0 ) = 0 F ( i , 0 ) = i × d , i = 1 , 2 , . . . , m F ( 0 , j ) = j × d , j = 1 , 2 , . . . , n F ( i , j ) = max { F ( i − 1 , j − 1 ) + s ( x i , y j ) , x i aligned to y j F ( i − 1 , j ) + d , x i aligned to a gap F ( i , j − 1 ) + d , y j aligned to a gap \begin{aligned} & \text{$F(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j}$.} \\ & F(0,0) = 0 \\ & F(i, 0) = i \times d, \ \ \ \ \ i=1,2,...,m \\ & F(0, j) = j \times d, \ \ \ \ \ j=1,2,...,n \\ & F(i, j) = \max \begin{cases} F(i-1, j-1) + s(x_i, y_j), & \text{$x_i$ aligned to $y_j$} \\ F(i-1,j) + d, & \text{$x_i$ aligned to a gap} \\ F(i, j-1) + d, & \text{$y_j$ aligned to a gap} \\ \end{cases} \end{aligned} F(i,j) is the maximum score of alignments between x1...i and y1...j.F(0,0)=0F(i,0)=i×d, i=1,2,...,mF(0,j)=j×d, j=1,2,...,nF(i,j)=max⎩⎪⎨⎪⎧F(i−1,j−1)+s(xi,yj),F(i−1,j)+d,F(i,j−1)+d,xi aligned to yjxi aligned to a gapyj aligned to a gap
比对的最高得分就是 F ( m , n ) F(m, n) F(m,n)。回溯方式是从矩阵的右下角 F ( m , n ) F(m, n) F(m,n)开始回溯,直到左上角 F ( 0 , 0 ) F(0,0) F(0,0)。
局部比对
局部比对的特点是:
- 两条序列的符号不比全部参加比对
- 两条序列的两端如果联配上空位(不参加比对)不罚分。
其公式是:
F ( i , j ) is the maximum score of alignments between x 1... i and y 1... j . F ( 0 , 0 ) = 0 F ( i , 0 ) = 0 , i = 1 , 2 , . . . , m F ( 0 , j ) = 0 , j = 1 , 2 , . . . , n F ( i , j ) = max { 0 , F ( i − 1 , j − 1 ) + s ( x i , y j ) , x i aligned to y j F ( i − 1 , j ) + d , x i aligned to a gap F ( i , j − 1 ) + d , y j aligned to a gap \begin{aligned} & \text{$F(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j}$.} \\ & F(0,0) = 0 \\ & F(i, 0) = 0, \ \ \ \ \ i=1,2,...,m \\ & F(0, j) = 0, \ \ \ \ \ j=1,2,...,n \\ & F(i, j) = \max \begin{cases} 0, \\ F(i-1, j-1) + s(x_i, y_j), & \text{$x_i$ aligned to $y_j$} \\ F(i-1,j) + d, & \text{$x_i$ aligned to a gap} \\ F(i, j-1) + d, & \text{$y_j$ aligned to a gap} \\ \end{cases} \end{aligned} F(i,j) is the maximum score of alignments between x1...i and y1...j.F(0,0)=0F(i,0)=0, i=1,2,...,mF(0,j)=0, j=1,2,...,nF(i,j)=max⎩⎪⎪⎪⎨⎪⎪⎪⎧0,F(i−1,j−1)+s(xi,yj),F(i−1,j)+d,F(i,j−1)+d,xi aligned to yjxi aligned to a gapyj aligned to a gap
回溯方式是首先找到得分矩阵中的最大值 F ( i l o c , j l o c ) F(i_{loc}, j_{loc}) F(iloc,jloc),然后从 F ( i l o c , j l o c ) F(i_{loc}, j_{loc}) F(iloc,jloc)往左上角回溯,直至到达值为 0 0 0的矩阵单元。
仿射罚分应用在全局比对上
上面所涉及到的空位罚分都是线性罚分
,即连续 g g g个空位被罚的分数为 g × d g \times d g×d;而仿射罚分
对连续 g g g个空位的罚分是: d + ( g − 1 ) × e d + (g - 1) \times e d+(g−1)×e。这样做的好处是在比对时倾向于让临近的空位尽可能挨在一起形成连续空位。
原来的线性罚分全局比对公式变为:
M ( i , j ) is the maximum score of alignments between x 1... i and y 1... j , given x i aligned to y j . X ( i , j ) is the maximum score of alignments between x 1... i and y 1... j , given x i aligned to a gap. Y ( i , j ) is the maximum score of alignments between x 1... i and y 1... j , given y j aligned to a gap. F ( i , j ) is the maximum score of alignments between x 1... i and y 1... j . M ( i , j ) = max { M ( i − 1 , j − 1 ) + s ( x i , y j ) X ( i − 1 , j − 1 ) + s ( x i , y j ) Y ( i − 1 , j − 1 ) + s ( x i , y j ) X ( i , j ) = max { M ( i − 1 , j ) + d X ( i − 1 , j ) + e Y ( i , j ) = max { M ( i , j − 1 ) + d Y ( i , j − 1 ) + e F ( i , j ) = max { M ( i , j ) X ( i , j ) Y ( i , j ) \begin{aligned} & \text{$M(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j}$,} \\ & \text{\ \ \ given $x_i$ aligned to $y_j$.} \\ & \text{$X(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j}$,} \\ & \text{\ \ \ given $x_i$ aligned to a gap.} \\ & \text{$Y(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j}$,} \\ & \text{\ \ \ given $y_j$ aligned to a gap.} \\ & \text{$F(i,j)$ is the maximum score of alignments between $x_{1...i}$ and $y_{1...j.}$} \\ & M(i, j) = \max \begin{cases} M(i-1, j-1) + s(x_i, y_j) \\ X(i-1,j-1) + s(x_i, y_j) \\ Y(i-1, j-1) + s(x_i, y_j) \\ \end{cases} \\ & X(i,j) = \max \begin{cases} M(i-1, j) + d \\ X(i-1,j) + e \\ \end{cases} \ \ \ \ \ \ Y(i,j) = \max \begin{cases} M(i, j-1) + d \\ Y(i,j-1) + e \\ \end{cases} \\ & F(i, j) = \max \begin{cases} M(i, j) \\ X(i, j) \\ Y(i, j) \end{cases} \end{aligned} M(i,j) is the maximum score of alignments between x1...i and y1...j, given xi aligned to yj.X(i,