生物信息学序列比对算法——动态规划
前言
序列比对是生物信息学中非常重要的一个概念,对分析生物数据具有不可或缺的作用。目前绝大多数的序列比对工具均包含了基于动态规划的序列比对的步骤。序列比对问题类似于最长公共子序列问题 (Longest Common Subsequence problem, LCS) ,进而衍生出了全局比对算法尼德曼-翁施算法 (Needleman-Wunsch Algorithm) 与局部比对算法史密斯-沃特曼算法 (Smith-Waterman algorithm) 。
一、LCS问题
1. 子序列
如果给出一条序列 X = { x 1 , x 2 , . . . , x m } X = \{x_1,x_2,...,x_m\} X={x1,x2,...,xm},并且有另外一条序列 Z = { z 1 , z 2 , . . . , z k } Z=\{z_1,z_2,...,z_k\} Z={z1,z2,...,zk}, 存在严格递增的下标序列 { i 1 , i 2 , . . . , i k } \{i_1,i_2,...,i_k\} {i1,i2,...,ik},并且对于 j = 1 , 2 , . . . , k j = 1, 2, ..., k j=1,2,...,k有 z j = x i j z_j =x_{i_j} zj=xij,那么称序列 Z Z Z为序列 X X X的子序列。如序列 Z = { B , C , D , B } Z = \{B,C,D,B\} Z={B,C,D,B}为序列 X = { A , B , C , B , D , A , B } X=\{A,B,C,B,D,A,B\} X={A,B,C,B,D,A,B}的子序列,并且存在对应的递增下标序列 { 2 , 3 , 5 , 7 } \{2,3,5,7\} {2,3,5,7}。
2. 公共子序列
给出两条序列
X
X
X和
Y
Y
Y,如果序列
Z
Z
Z是序列
X
X
X和
Y
Y
Y的共同的子序列,则称
Z
Z
Z为序列
X
X
X和
Y
Y
Y的公共子序列。
设
Z
=
{
z
1
,
z
2
,
.
.
.
,
z
k
}
Z=\{z_1,z_2,...,z_k\}
Z={z1,z2,...,zk}是序列
X
=
{
x
1
,
x
2
,
.
.
.
,
x
m
}
X=\{x_1,x_2,...,x_m\}
X={x1,x2,...,xm}与序列
Y
=
{
y
1
,
y
2
,
.
.
.
,
y
n
}
Y=\{y_1,y_2,...,y_n\}
Y={y1,y2,...,yn}的最长公共子序列,则有:
(1)如果
x
m
=
y
n
x_m=y_n
xm=yn,则有
z
k
=
x
m
=
y
n
z_k=x_m=y_n
zk=xm=yn,并且
Z
k
−
1
Z_{k-1}
Zk−1是序列
X
m
−
1
X_{m-1}
Xm−1与
Y
n
−
1
Y_{n-1}
Yn−1的最长公共子序列;
(2)如果
x
m
≠
y
n
x_m \neq y_n
xm=yn,且
z
k
≠
x
m
z_k \neq x_m
zk=xm,则序列
Z
Z
Z是序列
X
m
−
1
X_{m-1}
Xm−1和序列
Y
Y
Y的最长公共子序列;
(3)如果
x
m
≠
y
n
x_m \neq y_n
xm=yn,且
z
k
≠
y
n
z_k \neq y_n
zk=yn,则序列
Z
Z
Z是序列
X
X
X和序列
Y
n
−
1
Y_{n-1}
Yn−1的最长公共子序列;
由此可见,两个序列的最长公共子序列包含两个序列前缀的最长公共子序列。因此,最长公共子序列问题具有最优子结构性质。
根据最长公共子序列问题的最优子结构性质,可以建立子问题最优值的递归关系。使用
c
[
i
]
[
j
]
c[i][j]
c[i][j]表示序列
X
i
=
{
x
1
,
x
2
,
.
.
.
,
x
i
}
X_i=\{x_1,x_2,...,x_i\}
Xi={x1,x2,...,xi}与序列
Y
j
=
{
y
1
,
y
2
,
.
.
.
,
y
j
}
Y_j=\{y_1,y_2,...,y_j\}
Yj={y1,y2,...,yj}的最长公共子序列的长度。当
i
=
0
i=0
i=0或者
j
=
0
j=0
j=0时,显然最长公共子序列是空的,此时
c
[
i
]
[
j
]
=
0
c[i][j]=0
c[i][j]=0。进而可以根据最优子结构特性构建递归关系,如下所示:
c
[
i
]
[
j
]
=
{
0
,
i
=
0
∨
j
=
0
c
[
i
−
1
]
[
j
−
1
]
+
1
,
i
,
j
>
0
∧
x
i
=
y
j
max
{
c
[
i
]
[
j
−
1
]
,
c
[
i
−
1
]
[
j
]
}
,
i
,
j
>
0
∧
x
i
≠
y
j
\begin{equation} c[i][j]= \begin{cases} 0, & i=0 \vee j=0 \\ c[i-1][j-1]+1, & i,j > 0 \land x_i=y_j \\ \max\{c[i][j-1], c[i-1][j]\}, & i,j>0 \land x_i\neq y_j \end{cases} \end{equation}
c[i][j]=⎩
⎨
⎧0,c[i−1][j−1]+1,max{c[i][j−1],c[i−1][j]},i=0∨j=0i,j>0∧xi=yji,j>0∧xi=yj
如果需要构造最长公共子序列的话,我们需要一个辅助矩阵
b
b
b来记录构造信息。每一条记录
b
[
i
]
[
j
]
b[i][j]
b[i][j]指示到达
c
[
i
]
[
j
]
c[i][j]
c[i][j]值的子问题。
在构造最长公共子序列时,我们首先从
b
[
m
]
[
n
]
b[m][n]
b[m][n]开始,沿着箭头指示的方向搜索矩阵
b
b
b。
(1)当
b
[
i
]
[
j
]
=
“
↖
”
b[i][j]=“↖”
b[i][j]=“↖”时(这意味着元素
x
i
=
y
j
x_i=y_j
xi=yj LCS的一部分),表示序列
X
i
X_i
Xi与序列
Y
j
Y_j
Yj中的公共子序列为序列
X
i
−
1
X_{i-1}
Xi−1与序列
Y
j
−
1
Y_{j-1}
Yj−1的公共子序列在尾部增加元素
x
i
x_i
xi或者
y
j
y_j
yj;
(2)当
b
[
i
]
[
j
]
=
“
↑
”
b[i][j]=“↑”
b[i][j]=“↑”时,这表明此时序列
X
i
X_i
Xi与
Y
j
Y_j
Yj的最长公共子序列和序列
X
i
−
1
X_{i-1}
Xi−1与序列
Y
j
Y_j
Yj的最长公共子序列相同;
(3)当
b
[
i
]
[
j
]
=
“
←
”
b[i][j]=“←”
b[i][j]=“←”时,这表明此时序列
X
i
X_i
Xi与
Y
j
Y_j
Yj的最长公共子序列和序列
X
i
X_i
Xi与序列
Y
j
−
1
Y_{j-1}
Yj−1的最长公共子序列相同。
构造最长公共子序列的矩阵
c
c
c的时间复杂度为
O
(
m
n
)
O(mn)
O(mn),构造最长公共子序列的时间复杂度为
O
(
m
+
n
)
O(m+n)
O(m+n)。
二、Needleman Wunsch
Needeman-Wunsch也采用动态规划算法,与LCS算法类似,但构造矩阵
c
c
c不同。LCS中长度的概念不再用于构造矩阵
c
c
c,确认代之的是使用最优比对结果(使用得分表示)用于构造递归关系。与此同时,会引入罚分的概念。
为了与LCS算法表示区别,使用函数
P
(
s
t
,
t
j
)
P(s_t,t_j)
P(st,tj)表示打分函数,使用矩阵
d
d
d表示得分情况。在这里,我们设置如下的打分函数(单位阵):
P
(
s
t
,
t
j
)
=
{
1
,
s
i
=
t
j
0
,
s
i
≠
t
j
−
1
,
s
i
=
g
a
p
∨
t
j
=
g
a
p
\begin{equation} P(s_t,t_j)= \begin{cases} 1, & s_i=t_j \\ 0, & s_i \neq t_j \\ -1, &s_i=gap \vee t_j=gap \end{cases} \end{equation}
P(st,tj)=⎩
⎨
⎧1,0,−1,si=tjsi=tjsi=gap∨tj=gap
d
[
i
]
[
j
]
=
max
{
d
[
i
−
1
]
[
j
−
1
]
+
P
(
s
i
,
t
j
)
,
d
[
i
−
1
]
[
j
]
+
P
(
s
i
,
g
a
p
)
,
d
[
i
]
[
j
−
1
]
+
P
(
g
a
p
,
t
j
)
\begin{equation} d[i][j]=\max \begin{cases} d[i-1][j-1] + P(s_i,t_j),\\ d[i-1][j]+P(s_i,gap),\\ d[i][j-1]+P(gap,t_j) \end{cases} \end{equation}
d[i][j]=max⎩
⎨
⎧d[i−1][j−1]+P(si,tj),d[i−1][j]+P(si,gap),d[i][j−1]+P(gap,tj)
辅组矩阵
b
[
m
]
[
n
]
b[m][n]
b[m][n]同样用于构建序列比对结果,但使用“-”表示间隙。
三、Smith Waterman算法
该算法主要用于局部比对即寻找局部最优解。与Needleman Wunsch算法相比,构造得分矩阵
d
[
m
]
[
n
]
d[m][n]
d[m][n]与辅组矩阵
b
[
m
]
[
n
]
b[m][n]
b[m][n]的步骤可以是一模一样的。在构造矩阵
d
d
d时,
d
[
m
]
[
n
]
d[m][n]
d[m][n]需要于0比较,当值小于0时应置0。与Needleman Wunsch算法相比,不同之处在于构造结果是从得分最大值处开始,结束于得分为0的位置(即一条全程大于0的路线)。Needleman Wunsch算法则是从左下角向右上角搜索,贯穿整个矩阵,整个路径中可能存在得分小于0的情况。即,我们可以得到如下的打分函数:
P
(
s
t
,
t
j
)
=
{
1
,
s
i
=
t
j
0
,
s
i
≠
t
j
−
1
,
s
i
=
g
a
p
∨
t
j
=
g
a
p
\begin{equation} P(s_t,t_j)= \begin{cases} 1, & s_i=t_j \\ 0, & s_i \neq t_j \\ -1, &s_i=gap \vee t_j=gap \end{cases} \end{equation}
P(st,tj)=⎩
⎨
⎧1,0,−1,si=tjsi=tjsi=gap∨tj=gap
d
[
i
]
[
j
]
=
max
{
d
[
i
−
1
]
[
j
−
1
]
+
P
(
s
i
,
t
j
)
,
d
[
i
−
1
]
[
j
]
+
P
(
s
i
,
g
a
p
)
,
d
[
i
]
[
j
−
1
]
+
P
(
g
a
p
,
t
j
)
,
0
\begin{equation} d[i][j]=\max \begin{cases} d[i-1][j-1] + P(s_i,t_j),\\ d[i-1][j]+P(s_i,gap),\\ d[i][j-1]+P(gap,t_j),\\ 0 \end{cases} \end{equation}
d[i][j]=max⎩
⎨
⎧d[i−1][j−1]+P(si,tj),d[i−1][j]+P(si,gap),d[i][j−1]+P(gap,tj),0
四、算法实现(函数式)
// Alignment.cpp : 此文件包含 "main" 函数。程序执行将在此处开始并结束。
//
#include <iostream>
#include <cstring>
#include <cstdlib>
int BLOSUM_62[21][21] =
{
{-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10},
{-10, 4,-1,-2,-2, 0,-1,-1, 0,-2,-1,-1,-1,-1,-2,-1, 1, 0,-3,-2, 0},
{-10,-1, 5, 0,-2,-3, 1, 0,-2, 0,-3,-2, 2,-1,-3,-2,-1,-1,-3,-2,-3},
{-10,-2, 0, 6, 1,-3, 0, 0, 0, 1,-3,-3, 0,-2,-3,-2, 1, 0,-4,-2,-3},
{-10,-2,-2, 1, 6,-3, 0, 2,-1,-1,-3,-4,-1,-3,-3,-1, 0,-1,-4,-3,-3},
{-10, 0,-3,-3,-3, 9,-3,-4,-3,-3,-1,-1,-3,-1,-2,-3,-1,-1,-2,-2,-1},
{-10,-1, 1, 0, 0,-3, 5, 2,-2, 0,-3,-2, 1, 0,-3,-1, 0,-1,-2,-1,-2},
{-10,-1, 0, 0, 2,-4, 2, 5,-2, 0,-3,-3, 1,-2,-3,-1, 0,-1,-3,-2,-2},
{-10, 0,-2, 0,-1,-3,-2,-2, 6,-2,-4,-4,-2,-3,-3,-2, 0,-2,-2,-3,-3},
{-10,-2, 0, 1,-1,-3, 0, 0,-2, 8,-3,-3,-1,-2,-1,-2,-1,-1,-1, 2,-3},
{-10,-1,-3,-3,-3,-1,-3,-3,-4,-3, 4, 2,-3, 1, 0,-3,-2,-1,-3,-1, 3},
{-10,-1,-2,-3,-4,-1,-2,-3,-4,-3, 2, 4,-2, 2, 0,-3,-2,-1,-2,-1, 1},
{-10,-1, 2, 0,-1,-3, 1, 1,-2,-1,-3,-2, 5,-1,-3,-1, 0,-1,-3,-2,-2},
{-10,-1,-1,-2,-3,-1, 0,-2,-3,-2, 1, 2,-1, 5, 0,-2,-1,-1,-1,-1, 1},
{-10,-2,-3,-3,-3,-2,-3,-3,-3,-1, 0, 0,-3, 0, 6,-4,-2,-2, 1, 3,-1},
{-10,-1,-2,-2,-1,-3,-1,-1,-2,-2,-3,-3,-1,-2,-4, 7,-1,-1,-4,-3,-2},
{-10, 1,-1, 1, 0,-1, 0, 0, 0,-1,-2,-2, 0,-1,-2,-1, 4, 1,-3,-2,-2},
{-10, 0,-1, 0,-1,-1,-1,-1,-2,-2,-1,-1,-1,-1,-2,-1, 1, 5,-2,-2, 0},
{-10,-3,-3,-4,-4,-2,-2,-3,-2,-2,-3,-2,-3,-1, 1,-4,-3,-2,11, 2,-3},
{-10,-2,-2,-2,-3,-2,-1,-2,-3, 2,-1,-1,-2,-1, 3,-3,-2,-2, 2, 7,-1},
{-10, 0,-3,-3,-3,-1,-2,-2,-3,-3, 3, 1,-2, 1,-1,-2,-2, 0,-3,-1, 4}
};
int Unitary[5][5] =
{
{-1,-1,-1,-1,-1},
{-1, 1, 0, 0, 0},
{-1, 0, 1, 0, 0},
{-1, 0, 0, 1, 0},
{-1, 0, 0, 0, 1},
};
int Transition_transversion[5][5] =
{
{-5,-5,-5,-5,-5},
{-5, 1,-5,-5,-1},
{-5,-5, 1,-1,-5},
{-5,-5,-1, 1,-5},
{-5,-1,-5,-5, 1},
};
int Blast[5][5] =
{
{-4,-4,-4,-4,-4},
{-4, 5,-4,-4,-4},
{-4,-4, 5,-4,-4},
{-4,-4,-4, 5,-4},
{-4,-4,-4,-4, 5},
};
int Penalty;
int Score(char a, char b, int type) {
int i, j;
if (type == 1 || type == 3 || type == 4) {
switch (a) {
case 'A':i = 1; break;
case 'T':i = 2; break;
case 'C':i = 3; break;
case 'G':i = 4; break;
default:i = 0;
}
switch (b) {
case 'A':j = 1; break;
case 'T':j = 2; break;
case 'C':j = 3; break;
case 'G':j = 4; break;
default:j = 0;
}
//return Unitary[i][j];
}
else {
switch (a) {
case 'A':i = 1; break;
case 'R':i = 2; break;
case 'N':i = 3; break;
case 'D':i = 4; break;
case 'C':i = 5; break;
case 'Q':i = 6; break;
case 'E':i = 7; break;
case 'G':i = 8; break;
case 'H':i = 9; break;
case 'I':i = 10; break;
case 'L':i = 11; break;
case 'K':i = 12; break;
case 'M':i = 13; break;
case 'F':i = 14; break;
case 'P':i = 15; break;
case 'S':i = 16; break;
case 'T':i = 17; break;
case 'W':i = 18; break;
case 'Y':i = 19; break;
case 'V':i = 20; break;
default:i = 0;
}
switch (b) {
case 'A':j = 1; break;
case 'R':j = 2; break;
case 'N':j = 3; break;
case 'D':j = 4; break;
case 'C':j = 5; break;
case 'Q':j = 6; break;
case 'E':j = 7; break;
case 'G':j = 8; break;
case 'H':j = 9; break;
case 'I':j = 10; break;
case 'L':j = 11; break;
case 'K':j = 12; break;
case 'M':j = 13; break;
case 'F':j = 14; break;
case 'P':j = 15; break;
case 'S':j = 16; break;
case 'T':j = 17; break;
case 'W':j = 18; break;
case 'Y':j = 19; break;
case 'V':j = 20; break;
default:j = 0;
}
}
switch (type)
{
case 1:return Unitary[i][j];
case 2:return BLOSUM_62[i][j];
case 3:return Transition_transversion[i][j];
case 4:return Blast[i][j];
}
}
int max(int a, int b, int c) {
if (a > b) {
if (a > c)
return a;
else
return c;
}
else {
if (b > c)
return b;
else
return c;
}
}
//动态规划算法
int DyAlignment(std::string s1, std::string s2, int** a, int** b, int m, int n, int& x, int& y, int type, bool optimal) {
using namespace std;
int score = 0;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
// 构造得分矩阵
int c = a[i - 1][j - 1] + Score(s1[i], s2[j], type);
int d = a[i - 1][j] + Score(' ', s2[j], type);
int e = a[i][j - 1] + Score(s1[i], ' ', type);
a[i][j] = max(c, d, e);
// SW算法:
if(optimal){
a[i][j] = a[i][j] > 0 ? a[i][j] : 0;
}
if (a[i][j] >= score) {
score = a[i][j];
x = i;
y = j;
}
if (a[i][j] == d)
b[i][j] = 2;
else if (a[i][j] == e)
b[i][j] = 3;
else
b[i][j] = 1;
}
}
return score;
}
//打点法
void DotAlignment(std::string s1, std::string s2) {
using namespace std;
cout << " ";
for (unsigned int i = 0; i < s2.length(); i++) {
cout << s2[i] << " ";
}
cout << endl;
for (unsigned int i = 0; i < s1.length(); i++) {
cout << s1[i] << " ";
for (unsigned int k = 0; k < s2.length(); k++) {
if (s1[i] == s2[k]) {
cout << "* ";
}
else {
cout << " ";
}
}
cout << endl;
}
}
// 统计字符数
int countchar(std::string s, char c) {
int count = 0;
for (int i = 0; i < s.length(); i++) {
if (s[i] == c)
count++;
}
return count;
}
// 根据参数比对
void Aligment(std::string s1, std::string s2, int type, bool optimal) {
int gapcount = 0;
using namespace std;
s1 = " " + s1;
s2 = " " + s2;
int** a, ** b;
int score = 0;
int m = s1.length();
int n = s2.length();
a = (int**)malloc(sizeof(int*) * m);
b = (int**)malloc(sizeof(int*) * m);
for (int i = 0; i < m; i++) {
a[i] = (int*)malloc(sizeof(int) * n);
b[i] = (int*)malloc(sizeof(int) * n);
}
for (int i = 0; i < m; i++) {
a[i][0] = -i * Penalty;
b[i][0] = 2;
}
for (int j = 0; j < n; j++) {
a[0][j] = -j * Penalty;
b[0][j] = 3;
}
for (int i = 1; i < m; i++) {
for (int j = 1; j < n; j++) {
a[i][j] = 0;
}
}
int x, y;
score = DyAlignment(s1, s2, a, b, m - 1, n - 1, x, y, type, optimal);
if (m <= 21 && n <= 21) {
cout << " ";
for (unsigned int i = 0; i < s2.length(); i++) {
cout << s2[i] << " ";
}
cout << endl;
for (int i = 0; i < m; i++) {
cout << s1[i] << " ";
for (int j = 0; j < n; j++) {
printf("%3d", a[i][j]);
}
cout << endl;
}
cout << "\n\n ";
for (unsigned int i = 0; i < s2.length(); i++) {
cout << s2[i] << " ";
}
cout << endl;
for (int i = 0; i < m; i++) {
cout << s1[i] << " ";
for (int j = 0; j < n; j++) {
if (b[i][j] == 1) {
cout << " ↖";
}
else if (b[i][j] == 2) {
cout << " ↑";
}
else {
cout << " ←";
}
}
cout << endl;
}
}
cout << "\n\n";
string out1, out2, out3;
int identity = 0, similarity = 0;
if (!optimal) {
x = m - 1;
y = n - 1;
}
while ((x >= 0 || y >= 0) && (a[x][y] >= 0 || !optimal)) {
if (x <= 0 && y <= 0)
break;
else if (x < 0)
x = 0;
else if (y < 0)
y = 0;
if (b[x][y] == 1) {
out1 = s1[x] + out1;
out2 = s2[y] + out2;
if (type == 1) {
if (Score(s1[x], s2[y], 3) == 1) {
identity++;
similarity++;
out3 = "|" + out3;
}
else if (Score(s1[x], s2[y], 3) == -1) {
similarity++;
out3 = ':' + out3;
}
else {
out3 = '.' + out3;
}
}
else {
if (s1[x] == s2[y]) {
identity++;
similarity++;
out3 = "|" + out3;
}
else if (Score(s1[x], s2[y], 2) > 0) {
similarity++;
out3 = ':' + out3;
}
else {
out3 = '.' + out3;
}
}
x--;
y--;
}
else if (b[x][y] == 2)
{
out1 = s1[x] + out1;
out2 = '-' + out2;
out3 = " " + out3;
x--;
gapcount++;
}
else
{
out1 = '-' + out1;
out2 = s2[y] + out2;
out3 = " " + out3;
y--;
gapcount++;
}
}
size_t length = out1.length();
cout << "\t\tIdentity:\t" << "(" << identity << "/" << length << ")" << 100.0 * identity / length << "%" << endl;
cout << "\t\tSimilarity:\t" << "(" << similarity << "/" << length << ")" << 100.0 * similarity / length << "%" << endl;
cout << "\t\tGap:\t" << "(" << gapcount << "/" << length << ")" << 100.0 * gapcount / length << "%" << endl;
cout << "\t\tScore:" << score << endl;
int count1 = 0, count2 = 0;
for (int i = 0; i <= length / 50; i++) {
cout << "seq1:\t\t" << count1 + 1 << "\t";
count1 = count1 + 50 - countchar(out1.substr(i * 50, 50), '-');
if (count1 < 0)
count1 = 0;
if (count1 > m)
count1 = m;
cout << out1.substr(i * 50, 50) << "\t" << count1 << endl;
cout << " \t\t\t";
cout << out3.substr(i * 50, 50) << endl;
cout << "seq2:\t\t" << count2 + 1 << "\t";
count2 = count2 + 50 - countchar(out2.substr(i * 50, 50), '-');
if (count2 < 0)
count2 = 0;
if (count2 > n)
count2 = n;
cout << out2.substr(i * 50, 50) << "\t" << count2 << endl << endl;
}
for (int i = 0; i < m; i++) {
free(a[i]);
free(b[i]);
}
free(a);
free(b);
}
std::string reverse(std::string s1) {
using namespace std;
string s;
for (int i = s1.length() - 1; i >= 0; i--) {
if (s1[i] == 'T')
s += 'A';
else if (s1[i] == 'A')
s += 'T';
else if (s1[i] == 'G')
s += 'C';
else if (s1[i] == 'C')
s += 'G';
}
return s;
}
int main()
{
using namespace std;
unsigned command;
unsigned type;
cout << "[1]打点法\t\t[2]动态规划" << endl;
cout << "请选择您要使用的序列比对方法:";
cin >> command;
string s1, s2;
cout << "请输入序列1:";
cin >> s1;
cout << "请输入序列2:";
cin >> s2;
if (command == 2) {
cout << "[1]Unitary\t\t[2]BLOSUM_62\t\t[3]Transition_transversion\t\t[4]Blast" << endl;
cout << "请选择比对的矩阵类型:";
cin >> type;
//cout << "[0]返回全局结果\t\t[1]仅最优(不包括前端得分小于0的序列与后端得分非最高序列)";
cout << "[0]返回全局结果\t\t[1]返回局部结果(不包括前端得分小于0的序列与后端得分非最高序列)";
cout << "是否返回最优结果:" << endl;
bool optimal;
cin >> optimal;
cout << "空位罚分(BLOSUM_62:10;Unitary:1;Transition_transversion:5;Blast:4):";
cin >> Penalty;
if (type == 2) {
for (int i = 0; i < 21; i++) {
BLOSUM_62[i][0] = -Penalty;
BLOSUM_62[0][i] = -Penalty;
}
}
else {
for (int i = 0; i < 5; i++) {
Blast[i][0] = -Penalty;
Unitary[i][0] = -Penalty;
Transition_transversion[i][0] = -Penalty;
Blast[0][i] = -Penalty;
Unitary[0][i] = -Penalty;
Transition_transversion[0][i] = -Penalty;
}
}
cout << "\n\t\t#打分矩阵:";
switch (type)
{
case 1:cout << "Unitary" << endl;
break;
case 2:cout << "BLOSUM_62" << endl;
break;
case 3:cout << "Transition Transversion" << endl;
break;
case 4:cout << "Blast" << endl;
default:
break;
}
cout << "\t\t#Gap_penalty:" << Penalty << endl;
if (command == 2) {
if (type == 1 || type == 3 || type == 4 || type == 5) {
cout << endl << "\t\tStrand=Plus/Plus" << endl;
Aligment(s1, s2, type, optimal);
cout << endl << "\t\tStrand=Plus/Minus" << endl;
Aligment(s1, reverse(s2), type, optimal);
/*cout << endl << "\t\tStrand=Minus/Plus" << endl;
Aligment(reverse(s1), s2, type, optimal);
cout << endl << "\t\tStrand=Minus/Minus" << endl;
Aligment(reverse(s1), reverse(s2), type, optimal);*/
}
else {
Aligment(s1, s2, type, optimal);
}
}
}
else {
DotAlignment(s1, s2);
}
system("pause");
getchar();
return 0;
}
// 运行程序: Ctrl + F5 或调试 >“开始执行(不调试)”菜单
// 调试程序: F5 或调试 >“开始调试”菜单
// 入门使用技巧:
// 1. 使用解决方案资源管理器窗口添加/管理文件
// 2. 使用团队资源管理器窗口连接到源代码管理
// 3. 使用输出窗口查看生成输出和其他消息
// 4. 使用错误列表窗口查看错误
// 5. 转到“项目”>“添加新项”以创建新的代码文件,或转到“项目”>“添加现有项”以将现有代码文件添加到项目
// 6. 将来,若要再次打开此项目,请转到“文件”>“打开”>“项目”并选择 .sln 文件
五 算法实现(面向对象)
aligner.h
#pragma once
#pragma warning(disable:4267)
#pragma warning(disable:4244)
#include<string>
#include<iostream>
#include<sstream>
#include<iomanip>
#include<functional>
#include<tuple>
#include<vector>
#include<algorithm>
namespace aligner {
using std::string;
using matrix = std::vector<std::vector<int>>;
static int BLOSUM_62[21][21] =
{
{-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10,-10},
{-10, 4,-1,-2,-2, 0,-1,-1, 0,-2,-1,-1,-1,-1,-2,-1, 1, 0,-3,-2, 0},
{-10,-1, 5, 0,-2,-3, 1, 0,-2, 0,-3,-2, 2,-1,-3,-2,-1,-1,-3,-2,-3},
{-10,-2, 0, 6, 1,-3, 0, 0, 0, 1,-3,-3, 0,-2,-3,-2, 1, 0,-4,-2,-3},
{-10,-2,-2, 1, 6,-3, 0, 2,-1,-1,-3,-4,-1,-3,-3,-1, 0,-1,-4,-3,-3},
{-10, 0,-3,-3,-3, 9,-3,-4,-3,-3,-1,-1,-3,-1,-2,-3,-1,-1,-2,-2,-1},
{-10,-1, 1, 0, 0,-3, 5, 2,-2, 0,-3,-2, 1, 0,-3,-1, 0,-1,-2,-1,-2},
{-10,-1, 0, 0, 2,-4, 2, 5,-2, 0,-3,-3, 1,-2,-3,-1, 0,-1,-3,-2,-2},
{-10, 0,-2, 0,-1,-3,-2,-2, 6,-2,-4,-4,-2,-3,-3,-2, 0,-2,-2,-3,-3},
{-10,-2, 0, 1,-1,-3, 0, 0,-2, 8,-3,-3,-1,-2,-1,-2,-1,-1,-1, 2,-3},
{-10,-1,-3,-3,-3,-1,-3,-3,-4,-3, 4, 2,-3, 1, 0,-3,-2,-1,-3,-1, 3},
{-10,-1,-2,-3,-4,-1,-2,-3,-4,-3, 2, 4,-2, 2, 0,-3,-2,-1,-2,-1, 1},
{-10,-1, 2, 0,-1,-3, 1, 1,-2,-1,-3,-2, 5,-1,-3,-1, 0,-1,-3,-2,-2},
{-10,-1,-1,-2,-3,-1, 0,-2,-3,-2, 1, 2,-1, 5, 0,-2,-1,-1,-1,-1, 1},
{-10,-2,-3,-3,-3,-2,-3,-3,-3,-1, 0, 0,-3, 0, 6,-4,-2,-2, 1, 3,-1},
{-10,-1,-2,-2,-1,-3,-1,-1,-2,-2,-3,-3,-1,-2,-4, 7,-1,-1,-4,-3,-2},
{-10, 1,-1, 1, 0,-1, 0, 0, 0,-1,-2,-2, 0,-1,-2,-1, 4, 1,-3,-2,-2},
{-10, 0,-1, 0,-1,-1,-1,-1,-2,-2,-1,-1,-1,-1,-2,-1, 1, 5,-2,-2, 0},
{-10,-3,-3,-4,-4,-2,-2,-3,-2,-2,-3,-2,-3,-1, 1,-4,-3,-2,11, 2,-3},
{-10,-2,-2,-2,-3,-2,-1,-2,-3, 2,-1,-1,-2,-1, 3,-3,-2,-2, 2, 7,-1},
{-10, 0,-3,-3,-3,-1,-2,-2,-3,-3, 3, 1,-2, 1,-1,-2,-2, 0,-3,-1, 4}
};
static int Unitary[5][5] =
{
{-1,-1,-1,-1,-1},
{-1, 1, 0, 0, 0},
{-1, 0, 1, 0, 0},
{-1, 0, 0, 1, 0},
{-1, 0, 0, 0, 1},
};
static int Transition_transversion[5][5] =
{
{-5,-5,-5,-5,-5},
{-5, 1,-5,-5,-1},
{-5,-5, 1,-1,-5},
{-5,-5,-1, 1,-5},
{-5,-1,-5,-5, 1},
};
static int Blast[5][5] =
{
{-4,-4,-4,-4,-4},
{-4, 5,-4,-4,-4},
{-4,-4, 5,-4,-4},
{-4,-4,-4, 5,-4},
{-4,-4,-4,-4, 5},
};
class aligner_result {
public:
aligner_result(const string& n1, const string& n2, int sc,
const string& s1, const string& s2, const string& mi,
int ga, int id, int si,
int s1s, int s1e, int s2s,
int s2e, bool st
) : name1(n1), name2(n2), score(sc),
seq1(s1), seq2(s2), midline(mi),
gap_number(ga), identity_number(id), similarity_number(si),
seq1_start(s1s), seq1_end(s1e), seq2_start(s2s),
seq2_end(s2e), strand(st) {}
const string& get_name1() const {
return name1;
}
const string& get_name2() const {
return name2;
}
const string& get_seq1() const {
return seq1;
}
const string& get_seq2() const {
return seq2;
}
const string& get_midline() const {
return midline;
}
const int get_score() const {
return score;
}
const int get_start1() const {
return seq1_start;
}
const int get_start2() const {
return seq2_start;
}
const int get_end1() const {
return seq1_end;
}
const int get_end2() const {
return seq2_end;
}
const int get_length1() const {
return seq1_end - seq1_start + 1;
}
const int get_length2() const {
return seq2_end - seq2_start + 1;
}
const int get_gap() const {
return gap_number;
}
const int get_identity() const {
return identity_number;
}
const int get_similarity() const {
return similarity_number;
}
string get_strand() const {
if (strand) {
return "plus/plus";
}
else {
return "plus/minus";
}
}
void print(std::ostream& out, int format = 1);
private:
string name1;
string name2;
int score;
string seq1;
string seq2;
string midline;
int gap_number;
int identity_number;
int similarity_number;
int seq1_start;
int seq2_start;
int seq1_end;
int seq2_end;
bool strand;
string to_json();
string to_pair();
};
struct pos {
pos() { x = 0; y = 0; };
pos(int x, int y) : x(x), y(y) {}
int x;
int y;
};
using dy_res = std::tuple<string, string, string, int, int, int, int, int, int>;
using align_function = std::function<dy_res(const string& s1, const string& s2)>;
class base_dynamic {
public:
base_dynamic(int matrix_type = 2, int penalty = 10, int extend_penalty = 1);
protected:
int align(const string& s1, const string& s2, matrix& score, matrix& traceback, pos& max_score_pos);
int align2(const string& s1, const string& s2, matrix& score, matrix& traceback, pos& max_score_pos);
bool isSimilarity(char a, char b);
private:
int Score(char a, char b);
std::vector<int*> score_matrix;
int matrix_type;
int penalty;
int extend_penalty;
};
class needleman_wunsch : public base_dynamic {
public:
needleman_wunsch(int matrix_type = 2, int penalty = 10, int extend_penalty = 1);
dy_res operator()(const string& s1, const string& s2);
};
class smith_waterman : public base_dynamic {
public:
smith_waterman(int matrix_type = 2, int penalty = 10, int extend_penalty = 1);
dy_res operator()(const string& s1, const string& s2);
};
class aligner {
public:
aligner(const align_function& f, bool is_auto_reverse = false);
aligner(const string& algorithm, const string& matrix_name = "Unitary", int penalty = 10, int extend_penalty = 1, bool is_auto_reverse = false);
aligner_result operator()(const string& s1, const string& s2, const string& name_1 = "seq_1", const string& name_2 = "seq_2", int start_1 = 0, int start_2 = 0);
private:
align_function func;
bool is_auto_reverse;
string reverse_complement(const string& s);
};
}
aligner.cpp
#include "aligner.h"
using std::string;
using aligner::aligner_result;
using aligner::dy_res;
inline int countCharNumber(const string& s, char c) {
int count = 0;
for (auto ch : s) {
if (ch == c) {
count++;
}
}
return count;
}
void aligner::aligner_result::print(std::ostream& out, int format)
{
switch (format)
{
case 1:
out << to_pair();
break;
case 2:
out << to_json();
break;
default:
out << to_pair();
break;
}
}
string aligner::aligner_result::to_json()
{
std::ostringstream oss;
oss << "{";
oss << "\"name1\":\"" << name1 << "\",";
oss << "\"name2\":\"" << name2 << "\",";
oss << "\"score\":" << score << ",";
oss << "\"seq1\":\"" << seq1 << "\",";
oss << "\"seq2\":\"" << seq2 << "\",";
oss << "\"midline\":\"" << midline << "\",";
oss << "\"gap_number\":" << gap_number << ",";
oss << "\"identity_number\":" << identity_number << ",";
oss << "\"similarity_number\":" << similarity_number << ",";
oss << "\"seq1_start\":" << seq1_start << ",";
oss << "\"seq2_start\":" << seq2_start << ",";
oss << "\"seq1_end\":" << seq1_end << ",";
oss << "\"seq2_end\":" << seq2_end << ",";
oss << "\"strand\":" << strand;
oss << "}";
return oss.str();
}
string aligner::aligner_result::to_pair()
{
std::ostringstream oss;
oss << "1: " << "\t" << name1 << "\n";
oss << "2: " << "\t" << name2 << "\n\n";
auto length = seq1.size();
//保留两位小数
oss << "Identity: " << "\t" <<
identity_number << "/" << length <<
"(" << std::fixed << std::setprecision(2) << (1.0 * identity_number / length * 100) << "%)" << "\n";
oss << "Similarity: " << "\t" <<
similarity_number << "/" << length <<
"(" << std::fixed << std::setprecision(2) << (1.0 * similarity_number / length * 100) << "%)" << "\n";
oss << "Gap: " << "\t" <<
gap_number << "/" << length <<
"(" << std::fixed << std::setprecision(2) << (1.0 * gap_number / length * 100) << "%)" << "\n";
oss << "Score: " << "\t" << score << "\n";
if (strand) {
oss << "Strand: " << "\t" << "plus/plus" << "\n";
}
else {
oss << "Strand: " << "\t" << "plus/minus" << "\n";
}
int count1 = seq1_start, count2 = seq2_start, size = length / 51;
auto m = seq1_end - seq1_start;
auto n = seq2_end - seq2_start;
for (int i = 0; i <= size; i++) {
oss << "seq1:\t\t" << count1 + 1 << "\t";
auto&& sub_1 = seq1.substr(i * 50, 50);
count1 = count1 + 50 - countCharNumber(sub_1, '-');
oss << sub_1 << "\t" << count1 << "\n";
oss << " \t\t\t";
oss << midline.substr(i * 50, 50) << "\n";
oss << "seq2:\t\t" << count2 + 1 << "\t";
auto&& sub_2 = seq2.substr(i * 50, 50);
count2 = count2 + 50 - countCharNumber(sub_2, '-');
oss << sub_2 << "\t" << count2 << "\n";
oss << "\n";
}
return oss.str();
}
aligner::base_dynamic::base_dynamic(int matrix_type, int penalty, int extend_penalty)
:penalty(penalty), extend_penalty(extend_penalty), matrix_type(matrix_type)
{
if (matrix_type == 1) {
for (int i = 0; i < 21; i++) {
this->score_matrix.push_back(BLOSUM_62[i]);
}
for (int i = 0; i < 21; i++) {
this->score_matrix[0][i] = -penalty;
}
for (int i = 0; i < 21; i++) {
this->score_matrix[i][0] = -penalty;
}
}
else if (matrix_type == 2) {
for (int i = 0; i < 5; i++) {
this->score_matrix.push_back(Unitary[i]);
}
for (int i = 0; i < 5; i++) {
this->score_matrix[0][i] = -penalty;
}
for (int i = 0; i < 5; i++) {
this->score_matrix[i][0] = -penalty;
}
}
else if (matrix_type == 3) {
for (int i = 0; i < 5; i++) {
this->score_matrix.push_back(Transition_transversion[i]);
}
for (int i = 0; i < 5; i++) {
this->score_matrix[0][i] = -penalty;
}
for (int i = 0; i < 5; i++) {
this->score_matrix[i][0] = -penalty;
}
}
else if (matrix_type == 4) {
for (int i = 0; i < 5; i++) {
this->score_matrix.push_back(Blast[i]);
}
for (int i = 0; i < 5; i++) {
this->score_matrix[0][i] = -penalty;
}
for (int i = 0; i < 5; i++) {
this->score_matrix[i][0] = -penalty;
}
}
else {
throw std::runtime_error("得分矩阵不正确");
}
}
int aligner::base_dynamic::align(const string& s1, const string& s2, matrix& score, matrix& traceback, pos& max_score_pos)
{
int m = s1.length() + 1;
int n = s2.length() + 1;
score.resize(m);
traceback.resize(m);
for (int i = 0; i < m; i++) {
score[i].resize(n, 0);
traceback[i].resize(n, 0);
}
for (int i = 0; i < m; i++) {
score[i][0] = -penalty * i;
traceback[i][0] = 2;
}
for (int j = 0; j < n; j++) {
score[0][j] = -penalty * j;
traceback[0][j] = 3;
}
int score_res = 0;
for (int i = 1; i < m; i++) {
for (int j = 1; j < n; j++) {
int score_diag = score[i - 1][j - 1] + Score(s1[i - 1], s2[j - 1]);
int score_up = score[i - 1][j] + Score(' ', s2[j - 1]);
int score_left = score[i][j - 1] + Score(s1[i - 1], ' ');
int max = std::max({ score_diag, score_up, score_left });
score[i][j] = max;
if (score_diag == max) {
traceback[i][j] = 1;
}
else if (score_up == max) {
traceback[i][j] = 2;
if (traceback[i - 1][j] == 2 || traceback[i - 1][j] == 3) {
score[i][j] -= extend_penalty;
}
}
else if (score_left == max) {
traceback[i][j] = 3;
if (traceback[i][j - 1] == 3 || traceback[i][j - 1] == 2) {
score[i][j] -= extend_penalty;
}
}
if (score[i][j] > score_res) {
score_res = score[i][j];
max_score_pos.x = i;
max_score_pos.y = j;
}
}
}
return score_res;
}
int aligner::base_dynamic::align2(const string& s1, const string& s2, matrix& score, matrix& traceback, pos& max_score_pos)
{
int m = s1.length() + 1;
int n = s2.length() + 1;
score.resize(m);
traceback.resize(m);
for (int i = 0; i < m; i++) {
score[i].resize(n, 0);
traceback[i].resize(n, 0);
}
for (int i = 0; i < m; i++) {
score[i][0] = -penalty * i;
traceback[i][0] = 2;
}
for (int j = 0; j < n; j++) {
score[0][j] = -penalty * j;
traceback[0][j] = 3;
}
int score_res = 0;
for (int i = 1; i < m; i++) {
for (int j = 1; j < n; j++) {
int score_diag = score[i - 1][j - 1] + Score(s1[i - 1], s2[j - 1]);
int score_up = score[i - 1][j] + Score(' ', s2[j - 1]);
int score_left = score[i][j - 1] + Score(s1[i - 1], ' ');
int max = std::max({ score_diag, score_up, score_left, 0 });
score[i][j] = max;
if (score_diag == max) {
traceback[i][j] = 1;
}
else if (score_up == max) {
traceback[i][j] = 2;
if (traceback[i - 1][j] == 2 || traceback[i - 1][j] == 3) {
score[i][j] -= extend_penalty;
}
}
else if (score_left == max) {
traceback[i][j] = 3;
if (traceback[i][j - 1] == 3 || traceback[i][j - 1] == 2) {
score[i][j] -= extend_penalty;
}
}
if (score[i][j] > score_res) {
score_res = score[i][j];
max_score_pos.x = i;
max_score_pos.y = j;
}
}
}
return score_res;
}
bool aligner::base_dynamic::isSimilarity(char a, char b)
{
int i, j;
if (matrix_type == 1) {
static auto get_index = [](char c) {
int i;
switch (c) {
case 'A':i = 1; break;
case 'R':i = 2; break;
case 'N':i = 3; break;
case 'D':i = 4; break;
case 'C':i = 5; break;
case 'Q':i = 6; break;
case 'E':i = 7; break;
case 'G':i = 8; break;
case 'H':i = 9; break;
case 'I':i = 10; break;
case 'L':i = 11; break;
case 'K':i = 12; break;
case 'M':i = 13; break;
case 'F':i = 14; break;
case 'P':i = 15; break;
case 'S':i = 16; break;
case 'T':i = 17; break;
case 'W':i = 18; break;
case 'Y':i = 19; break;
case 'V':i = 20; break;
default:i = 0;
}
return i;
};
i = get_index(a);
j = get_index(b);
return BLOSUM_62[i][j] > 0;
}
else {
static auto get_index = [](char c) {
int i;
switch (c) {
case 'A':i = 1; break;
case 'C':i = 2; break;
case 'G':i = 3; break;
case 'T':i = 4; break;
default:i = 0;
}
return i;
};
i = get_index(a);
j = get_index(b);
return Transition_transversion[i][j] == -1;
}
}
int aligner::base_dynamic::Score(char a, char b)
{
if (matrix_type == 1) {
static auto get_index = [](char c) {
int i;
switch (c) {
case 'A':i = 1; break;
case 'R':i = 2; break;
case 'N':i = 3; break;
case 'D':i = 4; break;
case 'C':i = 5; break;
case 'Q':i = 6; break;
case 'E':i = 7; break;
case 'G':i = 8; break;
case 'H':i = 9; break;
case 'I':i = 10; break;
case 'L':i = 11; break;
case 'K':i = 12; break;
case 'M':i = 13; break;
case 'F':i = 14; break;
case 'P':i = 15; break;
case 'S':i = 16; break;
case 'T':i = 17; break;
case 'W':i = 18; break;
case 'Y':i = 19; break;
case 'V':i = 20; break;
default:i = 0;
}
return i;
};
return score_matrix[get_index(a)][get_index(b)];
}
static auto get_index = [](char c) {
int i;
switch (c) {
case 'A':i = 1; break;
case 'C':i = 2; break;
case 'G':i = 3; break;
case 'T':i = 4; break;
default:i = 0;
}
return i;
};
return score_matrix[get_index(a)][get_index(b)];
}
aligner::needleman_wunsch::needleman_wunsch(int matrix_type, int penalty, int extend_penalty)
:base_dynamic(matrix_type, penalty, extend_penalty)
{
}
dy_res aligner::needleman_wunsch::operator()(const string& s1, const string& s2)
{
matrix score;
matrix traceback;
pos max_score_pos;
int score_res = align(s1, s2, score, traceback, max_score_pos);
string s1_res;
string s2_res;
string midline;
s1_res.reserve((s1.size() + s2.size()) * 0.7);
s2_res.reserve((s1.size() + s2.size()) * 0.7);
int gap_number = 0, identity = 0, similarity = 0;
int x = s1.size(), y = s2.size();
while (x >= 0 || y >= 0) {
if (x <= 0 && y <= 0)
break;
if (traceback[x][y] == 1) {
x--;
y--;
s1_res.push_back(s1[x]);
s2_res.push_back(s2[y]);
if (s1[x] == s2[y]) {
identity++;
similarity++;
midline.push_back('|');
}
else if (isSimilarity(s1[x], s2[y])) {
similarity++;
midline.push_back(':');
}
else {
midline.push_back('.');
}
}
else if (traceback[x][y] == 2) {
x--;
s1_res.push_back(s1[x]);
s2_res.push_back('-');
midline.push_back(' ');
gap_number++;
}
else {
y--;
s1_res.push_back('-');
s2_res.push_back(s2[y]);
midline.push_back(' ');
gap_number++;
}
}
std::reverse(s1_res.begin(), s1_res.end());
std::reverse(s2_res.begin(), s2_res.end());
std::reverse(midline.begin(), midline.end());
return dy_res(s1_res, s2_res, midline, score_res, identity, similarity, gap_number, x, y);
}
aligner::smith_waterman::smith_waterman(int matrix_type, int penalty, int extend_penalty)
:base_dynamic(matrix_type, penalty, extend_penalty)
{
}
dy_res aligner::smith_waterman::operator()(const string& s1, const string& s2)
{
matrix score;
matrix traceback;
pos max_score_pos;
int score_res = align2(s1, s2, score, traceback, max_score_pos);
string s1_res;
string s2_res;
string midline;
s1_res.reserve(s1.size() + s2.size() * 0.7);
s2_res.reserve(s1.size() + s2.size() * 0.7);
int gap_number = 0, identity = 0, similarity = 0;
int x = max_score_pos.x, y = max_score_pos.y;
while ((x >= 0 || y >= 0) && score[x][y] > 0) {
if (x <= 0 && y <= 0)
break;
if (traceback[x][y] == 1) {
x--;
y--;
s1_res.push_back(s1[x]);
s2_res.push_back(s2[y]);
if (s1[x] == s2[y]) {
identity++;
similarity++;
midline.push_back('|');
}
else if (isSimilarity(s1[x], s2[y])) {
similarity++;
midline.push_back(':');
}
else {
midline.push_back('.');
}
}
else if (traceback[x][y] == 2) {
x--;
s1_res.push_back(s1[x]);
s2_res.push_back('-');
midline.push_back(' ');
gap_number++;
}
else {
y--;
s1_res.push_back('-');
s2_res.push_back(s2[y]);
midline.push_back(' ');
gap_number++;
}
}
std::reverse(s1_res.begin(), s1_res.end());
std::reverse(s2_res.begin(), s2_res.end());
std::reverse(midline.begin(), midline.end());
return dy_res(s1_res, s2_res, midline, score_res, identity, similarity, gap_number, x, y);
}
aligner::aligner::aligner(const align_function& f, bool is_auto_reverse)
:func(f), is_auto_reverse(is_auto_reverse)
{
}
aligner::aligner::aligner(const string& algorithm, const string& matrix_name, int penalty, int extend_penalty, bool is_auto_reverse)
:is_auto_reverse(is_auto_reverse)
{
int matrix_type = 2;
if (matrix_name == "BLOSUM_62") {
matrix_type = 1;
}
else if (matrix_name == "Transition_transversion") {
matrix_type = 3;
}
else if (matrix_name == "Blast") {
matrix_type = 4;
}
if (algorithm == "Needleman-Wunsch") {
needleman_wunsch f(matrix_type, penalty, extend_penalty);
func = f;
}
else if (algorithm == "Smith-Waterman") {
smith_waterman f(matrix_type, penalty, extend_penalty);
func = f;
}
}
aligner_result aligner::aligner::operator()(const string& s1, const string& s2, const string& name_1, const string& name_2, int start_1, int start_2)
{
bool strand = true;
auto&& res_1 = func(s1, s2);
auto score = std::get<3>(res_1);
if (is_auto_reverse) {
auto&& res_2 = func(s1, reverse_complement(s2));
auto score_2 = std::get<3>(res_2);
if (score > score_2)
goto NEXT;
strand = false;
auto& seq1 = std::get<0>(res_2);
auto& seq2 = std::get<1>(res_2);
auto& midline = std::get<2>(res_2);
auto identity = std::get<4>(res_2);
auto similarity = std::get<5>(res_2);
auto gap_number = std::get<6>(res_2);
auto x = std::get<7>(res_2);
auto y = std::get<8>(res_2);
return aligner_result(name_1, name_2, score_2, seq1, seq2, midline, gap_number, identity, similarity,
start_1 + x, start_1 + s1.size(), start_2 + y, start_2 + s2.size(), strand);
}
NEXT:
auto& seq1 = std::get<0>(res_1);
auto& seq2 = std::get<1>(res_1);
auto& midline = std::get<2>(res_1);
auto identity = std::get<4>(res_1);
auto similarity = std::get<5>(res_1);
auto gap_number = std::get<6>(res_1);
auto x = std::get<7>(res_1);
auto y = std::get<8>(res_1);
return aligner_result(name_1, name_2, score, seq1, seq2, midline, gap_number, identity, similarity,
start_1 + x, start_1 + s1.size(), start_2 + y, start_2 + s2.size(), strand);
}
string aligner::aligner::reverse_complement(const string& s)
{
string res;
res.reserve(s.size());
for (auto itr = s.rbegin(); itr != s.rend(); itr++) {
auto ch = *itr;
switch (ch)
{
case 'A':
res.push_back('T');
break;
case 'T':
res.push_back('A');
break;
case 'C':
res.push_back('G');
break;
case 'G':
res.push_back('C');
break;
default:
break;
}
}
return res;
}
main.cpp
#include "aligner.h"
using namespace tree_search;
int main()
{
aligner::aligner smith_waterman_aligner("Smith-Waterman", "Blast", 10, 1, false);
auto&& res = smith_waterman_aligner("AACTGACTACGTATCGACTGAGAGTCACTATCTCTTACGATCGATGCATGTACGACGACATCCACTACGGACGATGCATACGTAGCTACGATCGACATCGACTACGTACGACTGACTGACTGACTGACTCAGTACGAGACTACGATCGACTCAGATCGACTCTGACTAGCTACGACTGCAT", "ATAAACTGACTACGTACGATCGAGTCACTTTTTTTTACGATCGATGCATGTACGACGACATCGACTACGACGATGCATACGTAGCTACGACATCGACTACGTACGACTGACTGACTGACTGACTCAGTACGAGACTACGATCGACTCAGATCGACTCTGACTAGCTACGACTGCATACGATCTTA");
res.print(std::cout, 1);
return 0;
}
总结
本文从LCS算法开始,渐进式介绍NW算法、SW算法的基本原理并给出了相关算法的具体实现(CPP)。实质上三种算法的本质均为填写一张结果表,该表的填法由最优子性质决定。本质上动态规划也是一种递归算法,只不过使用一张表将每一次递归的结果记录下来避免二次计算,是一种以空间换时间的算法。