描述
It is well known that a human gene can be considered as a sequence, consisting of four nucleotides, which are simply denoted by four letters, A, C, G, and T. Biologists have been interested in identifying human genes and determining their functions, because these can be used to diagnose human diseases and to design new drugs for them.
A human gene can be identified through a series of time-consuming biological experiments, often with the help of computer programs. Once a sequence of a gene is obtained, the next job is to determine its function.
One of the methods for biologists to use in determining the function of a new gene sequence that they have just identified is to search a database with the new gene as a query. The database to be searched stores many gene sequences and their functions – many researchers have been submitting their genes and functions to the database and the database is freely accessible through the Internet.
A database search will return a list of gene sequences from the database that are similar to the query gene.
Biologists assume that sequence similarity often implies functional similarity. So, the function of the new gene might be one of the functions that the genes from the list have. To exactly determine which one is the right one another series of biological experiments will be needed.
Your job is to make a program that compares two genes and determines their similarity as explained below. Your program may be used as a part of the database search if you can provide an efficient one.
Given two genes AGTGATG and GTTAG, how similar are they? One of the methods to measure the similarity
of two genes is called alignment. In an alignment, spaces are inserted, if necessary, in appropriate positions of
the genes to make them equally long and score the resulting genes according to a scoring matrix.
For example, one space is inserted into AGTGATG to result in AGTGAT-G, and three spaces are inserted into GTTAG to result in –GT--TAG. A space is denoted by a minus sign (-). The two genes are now of equal
length. These two strings are aligned:
AGTGAT-G
-GT--TAG
In this alignment, there are four matches, namely, G in the second position, T in the third, T in the sixth, and G in the eighth. Each pair of aligned characters is assigned a score according to the following scoring matrix.
denotes that a space-space match is not allowed. The score of the alignment above is (-3)+5+5+(-2)+(-3)+5+(-3)+5=9.
Of course, many other alignments are possible. One is shown below (a different number of spaces are inserted into different positions):
AGTGATG
-GTTA-G
This alignment gives a score of (-3)+5+5+(-2)+5+(-1) +5=14. So, this one is better than the previous one. As a matter of fact, this one is optimal since no other alignment can have a higher score. So, it is said that the
similarity of the two genes is 14.
众所周知,一个人类基因可以被看作是一个序列,由四个核苷酸组成,这四个核苷酸简单地用四个字母表示,A、 C、 G 和 T。生物学家一直有兴趣识别人类基因并确定它们的功能,因为这些基因可以用来诊断人类疾病并为它们设计新的药物。人类基因可以通过一系列耗时的生物实验来识别,这些实验往往借助于计算机程序。一旦获得了一个基因序列,下一步的工作就是确定它的功能。生物学家在确定他们刚刚鉴定的新基因序列的功能时所使用的方法之一是用新基因作为查询搜索数据库。被搜索的数据库存储了许多基因序列及其功能——许多研究人员已经将他们的基因和功能提交给数据库,数据库可以通过互联网免费访问。数据库搜索将返回数据库中与查询基因相似的基因序列列表。生物学家认为序列相似性通常意味着功能相似性。所以,这个新基因的功能可能是列表中的基因所具有的功能之一。为了确定哪一个是正确的,还需要一系列的生物实验。你的工作是制作一个程序,比较两个基因并确定它们的相似性,如下所述。如果您能够提供一个有效的程序,那么您的程序可以用作数据库搜索的一部分。给出两个基因 AGTGATG 和 GTTAG,它们有多相似?测量两个基因相似性的方法之一叫做比对。在排列中,如果有必要,空格被插入到基因的适当位置,使它们同样长,并根据评分矩阵对得到的基因进行评分。
例如,在 AGTGATG 中插入一个空格以生成 AGTGAT-G,在 GTTAG 中插入三个空格以生成 -GT —— TAG。空格用减号(-)表示。这两个基因现在长度相等。这两个字符串对齐了: AGTGAT-G-GT —— TAG在这种对齐中,有四个匹配项,即 G 在第二个位置,T 在第三个位置,T 在第六个位置,G 在第八个位置。每对对齐的字符根据下面的得分矩阵分配一个分数。表示不允许空间-空间匹配。上面对齐的得分是(- 3) + 5 + 5 + (- 2) + (- 3) + 5 + (- 3) + 5 = 9。当然,还有许多其他的可能性。其中一个如下所示(插入不同位置的空格数目不同) : AGTGATG-GTTA-GThis 对齐给出的评分为(- 3) + 5 + 5 + (- 2) + 5 + (- 1) + 5 = 14。所以,这个比上一个好。事实上,这一个是最佳的,因为没有其他排列可以有更高的分数。所以,据说这两个基因的相似度是14。
输入
The input consists of T test cases. The number of test cases ) (T is given in the first line of the input file. Each test case consists of two lines: each line contains an integer, the length of a gene, followed by a gene sequence. The length of each gene sequence is at least one and does not exceed 100.
输入由 T 测试用例组成。测试用例的数量)(T 在输入文件的第一行中给出。每个测试用例由两行组成: 每行包含一个整数,一个基因的长度,后面跟着一个基因序列。每个基因序列的长度至少是一个,不超过100。
输出
The output should print the similarity of each test case, one per line.
输出应该打印每个测试用例的相似性,每行一个。
样例输入
2
7 AGTGATG
5 GTTAG
7 AGCTATT
9 AGCTTTAAA
样例输出
14
21
来源
Taejon 2001
分析
题目本身就是lcs问题,我们在任意一点dp[i][j]的情况无非三种(以s1,s2表示输入的字符串,a表示上面的矩阵)
直接让s1[i],s2[j]进行匹配,此时的值位dp[i][j]=dp[i-1][j-1]+a[s1[i]][s2[j]]
给s2补上-符号,此时的值应该为dp[i][j]=dp[i-1][j]+a[s1[i]][-],dp[i-1][j]也就是我们用s1的前i-1个字符和s2的前j个字符进行匹配,那么到dp[i][j]的时候,s2的前j个字符已经用光了,所以此时s1[i]必须和-进行匹配。
给s1补上-符号,此时的值应该为dp[i][j]=dp[i][j-1]+a[-][s2[j]]
当然不存在-,-匹配的情况,这不是多此一举嘛,用两个-,-只是徒增消耗。
取上述三种情况的最大值,不断向下dp就可以了。
同时需要注意第一列和第一行需要初始化,全部视为用-进行匹配来初始化。
代码
#include<stdio.h>
#include<string.h>
#include<algorithm>
using namespace std;
char a[110];
char b[110];
int dp[110][110];
int maxx(int a,int b,int c)
{
int aa=-9999999;
if(a>aa)
aa=a;
if(b>aa)
aa=b;
if(c>aa)
aa=c;
return aa;
}
int ans[5][5]={
5,-1,-2,-1,-3,
-1,5,-3,-2,-4,
-2,-3,5,-2,-2,
-1,-2,-2,5,-1,
-3,-4,-2,-1,100000};
int find(char a)
{
if(a=='A')
return 0;
if(a=='C')
return 1;
if(a=='G')
return 2;
if(a=='T')
return 3;
if(a=='-')
return 4;
}
int main()
{
int t,n1,n2,n;
scanf("%d",&t);
while(t--)
{
scanf("%d ",&n1);
for(int i=1;i<=n1;i++)
{
scanf("%c",&a[i]);
}
scanf("%d ",&n2);
for(int i=1;i<=n2;i++)
{
scanf("%c",&b[i]);
}
n=max(n1,n2);
//memset(dp,0,sizeof(dp));
dp[0][0]=0;
//printf("&&%d\n",ans[find('-')][find(a[2])]);
for(int i=1;i<=n;i++)
{
dp[0][i]=dp[0][i-1]+ans[find('-')][find(b[i])];
dp[i][0]=dp[i-1][0]+ans[find(a[i])][find('-')];
}
for(int i=1;i<=n1;i++)
{
for(int j=1;j<=n2;j++)
{
dp[i][j]=maxx(dp[i-1][j-1]+ans[find(a[i])][find(b[j])],dp[i-1][j]+ans[find(a[i])][find('-')],dp[i][j-1]+ans[find('-')][find(b[j])]);
}
}
printf("%d\n",dp[n1][n2]);
}
return 0;
}
给个赞和关注吧