动态规划之最长公共子序列

最新推荐文章于 2024-08-23 20:07:50 发布

qeesung

最新推荐文章于 2024-08-23 20:07:50 发布

阅读量998

点赞数 1

分类专栏： C++ 算法算法鸣泣之时文章标签：动态规划算法 LCS

本文链接：https://blog.csdn.net/ii1245712564/article/details/45056045

版权

C++ 同时被 3 个专栏收录

40 篇文章 2 订阅

订阅专栏

算法

30 篇文章 1 订阅

订阅专栏

算法鸣泣之时

30 篇文章 6 订阅

订阅专栏

我们之前提到过过动态规划的几个经典问题：
动态规划原理：http://blog.csdn.net/ii1245712564/article/details/45040037
钢条切割问题：http://blog.csdn.net/ii1245712564/article/details/44464689
矩阵链乘法问题：http://blog.csdn.net/ii1245712564/article/details/44464689
今天我们来看一下动态规划的另外一个经典问题：最长公共子序列（longest-common-subsequence）

问题背景

在生物学中，经常需要比较两个不同生物的DNA，一个DNA串由由一串称为碱基的的分子组成，碱基有鸟嘌呤，腺嘌呤，胞嘧啶，胸腺嘧啶四中，我们用英文字母的首字母表示四种碱基，那么DNA就是在有限集{A,C,G,T}上的一个字符串。例如某种生物的DNA序列为：S1=ACCGGTCACGGTCGGAGTGGGAAC…，S2=GTAAAGTCACAAAACGTGT…，我们比较两个DNA串的原因就是希望确定他们的相似度，作为衡量两个物种相似度的标准。如果一个串是另外一个串的子串，那么它们就是相似的，但是这里彼此都不是彼此的子串。于是我们就有了新的衡量标准：寻找一个新串S3，使得S3中的元素都在S1，S2中出现过，且不要求连续，但是碱基在三个串中出现的顺序一定要相同。可以找到的S3的长度越长，表明两个物种越相似！

问题描述

给定两个字符串X[x1,x2,x3,x4,…,xn]和Y[y1,y2,y3,y4,…,ym],求X和Y的最长公共子序列LCS！

问题分析

首先我们尝试一下蛮力法，看看将其中一个序列X[x1,x2,…,xn]里面的所有子串xi提取出来，并与另外一个序列Y[y1,y2,…,ym]进行比较，检测子串xi中的元素是否在Y中按照相同的顺序出现。这里X的子序列一共有2^n个（每一个元素就是加入子串或者不加入子串两种选择，所以共有2*2*2*…2=2^n个），那么在和Y串进行对比的时候，每次都是O(m)。蛮力法运行时间是在指数级别的，要是两个串的长度长一点的话，比较一次都够你睡个午觉了。
那有没有更快的方法呢？动态规划上场。。。

问题解决

首先我们声明定义两种记法，方便后面的书写：我们记X[x1,x2,…,xn]为Xn,比如X2=[x1,x2]了

1.刻画最长公共子序列的特征

令X=[x1,x2,…,xn]和Y=[y1,y2,…,ym]，并且他们的最长公共子序列为Z=[z1,z2,…,zk],那么我们有一下结论：

如果xn == ym, 则xn == ym == zk 且 Zk-1就是Xn-1和Ym-1的最长公共子序列。
如果xn != ym , 那么当xn != zk 时，Zk是Xn-1和Ym的最长公共子序列。
如果xn != ym , 那么当ym !=zk时，Zk是Xn和Ym-1的最长公共子序列。

证明：

在xn == ym时，如果zk != xn 或者zk !=ym，那么X和Y的最长公共子序列为Z+1，与原问题矛盾，所以在xn == ym时， xn == ym == zk成立。如果在xn == ym,Xn-1和Ym-1存在一个比k-1更长的公共子序列W,又因为xm == ym，W+1>Z,与原问题矛盾，不成立！
如果xn != zk，那么Zk是Xn-1和Ym的最长公共子序列,，如果存在一个长度大于k的序列W是Xn-1和Ym的最长公共子序列，那么W同样也是Xn和Ym的最长公共子序列，W>Z，与原问题矛盾!
同上可证！

我们看出，LCS问题是具有最优子结构的！

2.一个递归解

由上面得到的最长公共子序列的特征，我们知道，当xn == ym时，最长公共子序列长度就是在子问题算出的最长公共子序列长度上面+1即可，如果 xn != ym，那么最长公共子序列就是[Xn-1,Ym]和[Xn, Ym-1]中最长的哪一个！于是我们等到下面的递归解：

c [i, j] = ⎧ ⎩ ⎨ 0, c [i - 1, j - 1], m a x {c [i - 1, j], c [i, j - 1]} i f i = = 0 或 者 j = = 0 i f x n = = y m i f x n! = y m

$c[i,j]= \begin{cases} 0, &if\ i==0 或者j==0 \\ c[i-1,j-1] , &if\ xn\ ==\ ym \\ max\lbrace c[i-1, j],c[i,j-1]\rbrace &if\ xn\ !=\ ym \end{cases}$

3.计算LCS长度

通过上面的递归式，我们现在采用两种方法来求解LCS的长度

/** 这里采用自顶向下的算法 */
#include <iostream>
#include <vector>
#include <climits>

using namespace std;

unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData);
/**
 * 计算最长公共子序列长度
 * @param  X X序列
 * @param  Y Y序列
 * @return   最大长度
 */
size_t LCS(std::vector<char> &X , std::vector<char> &Y)
{
    if(X.size() == 0 ||  Y.size() == 0)
        return 0;
    /** 创建一个数组来保存临时数据 */
    std::vector<std::vector<unsigned int> > tempData(X.size()+1 , std::vector<unsigned>());
    for (int i = 0; i <= X.size(); ++i)
    {
        for (int j = 0; j <= Y.size(); ++j)
        {
            if(i == 0 || j == 0)
                tempData[i].push_back(0);
            else
                tempData[i].push_back(UINT_MAX);
        }
    }
    dealLCS(X , Y , tempData);//开始进行计算
    return tempData[X.size()][Y.size()];
}

/**
 * 实际计算LCS的最大长度
 * @param X        X序列
 * @param Y        Y序列
 * @param tempData 缓存数据
 */
unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData)
{
    if(X.size() == 0 || Y.size() == 0 )
        return 0;
    //首先看看这个问题有没有被计算过
    if(tempData[X.size()][Y.size()] != UINT_MAX)
        return tempData[X.size()][Y.size()];
    //表示没有计算过，下面开始计算
    unsigned int maxLength = 0;
    if(X[X.size()-1] == Y[Y.size()-1])// 最后两个元素相等的情况
    {
        maxLength = dealLCS(std::vector<char>(X.begin() , X.end()-1),\
                            std::vector<char>(Y.begin() , Y.end()-1),\
                            tempData)+1;
    }
    else//最后两个元素不相等的情况
    {
        unsigned int temp1 = dealLCS(std::vector<char>(X.begin() , X.end()-1),\
                                     std::vector<char>(Y.begin() , Y.end()) , tempData);
        unsigned int temp2 = dealLCS(std::vector<char>(X.begin() , X.end()),\
                                     std::vector<char>(Y.begin() , Y.end()-1), tempData);
        maxLength = temp1 > temp2 ? temp1 : temp2;
    }
    tempData[X.size()][Y.size()] = maxLength;
    return maxLength;
}  


int main(int argc, char const *argv[])
{
    char array1[]={'A','B','C','B','D','A','B'};
    char array2[]={'B','D','C','A','B','A'};
    std::vector<char> X(array1 , array1+sizeof(array1));
    std::vector<char> Y(array2 , array2+sizeof(array2));
    cout<<"The Lcs length is :"<<LCS(X,Y)<<endl;
    return 0;
}

下面这个是自底向上的方法：

/** 这里采用自底向上的算法实现 */

#include <iostream>
#include <vector>
#include <climits>

using namespace std;

unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData);
/**
 * 计算最长公共子序列长度
 * @param  X X序列
 * @param  Y Y序列
 * @return   最大长度
 */
size_t LCS(std::vector<char> &X , std::vector<char> &Y)
{
    if(X.size() == 0 ||  Y.size() == 0)
        return 0;
    /** 创建一个数组来保存临时数据 */
    std::vector<std::vector<unsigned int> > tempData(X.size()+1 , std::vector<unsigned>());
    for (int i = 0; i <= X.size(); ++i)
    {
        for (int j = 0; j <= Y.size(); ++j)
        {
            if(i == 0 || j == 0)
                tempData[i].push_back(0);
            else
                tempData[i].push_back(UINT_MAX);
        }
    }
    dealLCS(X , Y , tempData);//开始进行计算
    return tempData[X.size()][Y.size()];
}

/**
 * 实际计算LCS的最大长度
 * @param X        X序列
 * @param Y        Y序列
 * @param tempData 缓存数据
 */
unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData)
{
    if(X.size() == 0 || Y.size() == 0)
        return 0;
    for (unsigned int i = 0; i < X.size(); ++i)
    {
        for (unsigned int j = 0; j < Y.size(); ++j)
        {
            if(X[i] == Y[j])
            {
                tempData[i+1][j+1] = tempData[i][j]+1;
            }
            else
            {
                unsigned int temp1 = tempData[i+1][j];
                unsigned int temp2 = tempData[i][j+1];
                tempData[i+1][j+1] = temp1 > temp2 ? temp1 : temp2;
            }

        }
    }
    return tempData[X.size()][Y.size()];
}

int main(int argc, char const *argv[])
{
    char array1[]={'A','B','C','B','D','A','B'};
    char array2[]={'B','D','C','A','B','A'};
    std::vector<char> X(array1 , array1+sizeof(array1));
    std::vector<char> Y(array2 , array2+sizeof(array2));
    cout<<"The Lcs length is :"<<LCS(X,Y)<<endl;
    while(1);
    return 0;
}

如果要了解自上而下和自下而上两种算法的区别，请见：
重叠子问题：http://blog.csdn.net/ii1245712564/article/details/45040037

上面两个算法里面都用到了tempData数组来保存零时数据,tempData[i][j]表示从 $Xi和Yj$ 的最长公共子串的长度，于是最后我们得到自下而上的tempData矩阵：

t e m p D a t a [] [] = ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ 00000000001111110011122200122222011222330122333401223344 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

$tempData[\ ][\ ]= \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 & 1 & 2 & 2 \\ 0 & 1 & 1 & 2 & 2 & 2 & 2 \\ 0 & 1 & 1 & 2 & 2 & 3 & 3 \\ 0 & 1 & 2 & 2 & 2 & 3 & 3 \\ 0 & 1 & 2 & 2 & 3 & 3 & 4\\ 0 & 1 & 2 & 2 & 3 & 4 & 4 \\ \end{pmatrix}$
最终右下角的tempData[8][7]就是X8和Y7保存的最长公共子串的长度！

4.构造LCS

下面我们以上面的tempData构成的矩阵为基础，从右下角开始，寻找最长的子序列！

自上而下构造LCS

/** 自顶向下带有解决方案 */
#include <iostream>
#include <vector>
#include <climits>

using namespace std;

unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData);
void printSolution(std::vector<char> X ,\
                   std::vector<std::vector<unsigned int> > & solution ,\
                   size_t i , size_t j)
{
    if(i == 0 || j ==0)
        return;
    if(solution[i][j] == solution[i-1][j])
        printSolution(X,solution,i-1,j);
    else if(solution[i][j] == solution[i][j-1])
        printSolution(X,solution,i,j-1);
    else if(solution[i][j]-1 == solution[i-1][j-1])
    {
        cout<<i<<" "<<j<<" "<<X[i-1]<<"\n";
        printSolution(X,solution , i-1 ,j-1);
    }
}
/**
 * 计算最长公共子序列长度
 * @param  X X序列
 * @param  Y Y序列
 * @return   最大长度
 */
size_t LCS(std::vector<char> &X , std::vector<char> &Y)
{
    if(X.size() == 0 ||  Y.size() == 0)
        return 0;
    /** 创建一个数组来保存临时数据 */
    std::vector<std::vector<unsigned int> > tempData(X.size()+1 , std::vector<unsigned>());
    for (int i = 0; i <= X.size(); ++i)
    {
        for (int j = 0; j <= Y.size(); ++j)
        {
            if(i == 0 || j == 0)
                tempData[i].push_back(0);
            else
                tempData[i].push_back(UINT_MAX);
        }
    }
    dealLCS(X , Y , tempData);//开始进行计算
    printSolution(X,tempData,X.size(),Y.size());
    return tempData[X.size()][Y.size()];
}

/**
 * 实际计算LCS的最大长度
 * @param X        X序列
 * @param Y        Y序列
 * @param tempData 缓存数据
 */
unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData)
{
    if(X.size() == 0 || Y.size() == 0 )
        return 0;
    //首先看看这个问题有没有被计算过
    if(tempData[X.size()][Y.size()] != UINT_MAX)
        return tempData[X.size()][Y.size()];
    //表示没有计算过，下面开始计算
    unsigned int maxLength = 0;
    if(X[X.size()-1] == Y[Y.size()-1])// 最后两个元素相等的情况
    {
        maxLength = dealLCS(std::vector<char>(X.begin() , X.end()-1),\
                            std::vector<char>(Y.begin() , Y.end()-1),\
                            tempData)+1;
    }
    else//最后两个元素不相等的情况
    {
        unsigned int temp1 = dealLCS(std::vector<char>(X.begin() , X.end()-1),\
                                     std::vector<char>(Y.begin() , Y.end()) , tempData);
        unsigned int temp2 = dealLCS(std::vector<char>(X.begin() , X.end()),\
                                     std::vector<char>(Y.begin() , Y.end()-1), tempData);
        maxLength = temp1 > temp2 ? temp1 : temp2;
    }
    tempData[X.size()][Y.size()] = maxLength;
    return maxLength;
}  


int main(int argc, char const *argv[])
{
    char array1[]={'A','B','C','B','D','A','B'};
    char array2[]={'B','D','C','A','B','A'};
    std::vector<char> X(array1 , array1+sizeof(array1));
    std::vector<char> Y(array2 , array2+sizeof(array2));
    cout<<"The Lcs length is :"<<LCS(X,Y)<<endl;
    while(1);
    return 0;
}

自下而上的构造LCS

/** 这里采用自底向上带有绝决方案的算法实现 */
#include <iostream>
#include <vector>
#include <climits>

using namespace std;

unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData);

void printSolution(std::vector<char> X ,\
                   std::vector<std::vector<unsigned int> > & solution ,\
                   size_t i , size_t j)
{
    if(i == 0 || j ==0)
        return;
    if(solution[i][j] == solution[i-1][j])
        printSolution(X,solution,i-1,j);
    else if(solution[i][j] == solution[i][j-1])
        printSolution(X,solution,i,j-1);
    else if(solution[i][j]-1 == solution[i-1][j-1])
    {
        cout<<i<<" "<<j<<" "<<X[i-1]<<"\n";
        printSolution(X,solution , i-1 ,j-1);
    }
}
/**
 * 计算最长公共子序列长度
 * @param  X X序列
 * @param  Y Y序列
 * @return   最大长度
 */
size_t LCS(std::vector<char> &X , std::vector<char> &Y)
{
    if(X.size() == 0 ||  Y.size() == 0)
        return 0;
    /** 创建一个数组来保存临时数据 */
    std::vector<std::vector<unsigned int> > tempData(X.size()+1 , std::vector<unsigned>());
    for (int i = 0; i <= X.size(); ++i)
    {
        for (int j = 0; j <= Y.size(); ++j)
        {
            if(i == 0 || j == 0)
                tempData[i].push_back(0);
            else
                tempData[i].push_back(UINT_MAX);
        }
    }
    dealLCS(X , Y , tempData);//开始进行计算
    printSolution(X,tempData,X.size(),Y.size());
    return tempData[X.size()][Y.size()];
}

/**
 * 实际计算LCS的最大长度
 * @param X        X序列
 * @param Y        Y序列
 * @param tempData 缓存数据
 */
unsigned int dealLCS(std::vector<char>  X , std::vector<char>  Y ,
             std::vector<std::vector<unsigned int> > & tempData)
{
    if(X.size() == 0 || Y.size() == 0)
        return 0;
    for (unsigned int i = 0; i < X.size(); ++i)
    {
        for (unsigned int j = 0; j < Y.size(); ++j)
        {
            if(X[i] == Y[j])
            {
                tempData[i+1][j+1] = tempData[i][j]+1;
            }
            else
            {
                unsigned int temp1 = tempData[i+1][j];
                unsigned int temp2 = tempData[i][j+1];
                tempData[i+1][j+1] = temp1 > temp2 ? temp1 : temp2;
            }

        }
    }
    return tempData[X.size()][Y.size()];
}

int main(int argc, char const *argv[])
{
    char array1[]={'A','B','C','B','D','A','B'};
    char array2[]={'B','D','C','A','B','A'};
    std::vector<char> X(array1 , array1+sizeof(array1));
    std::vector<char> Y(array2 , array2+sizeof(array2));
    cout<<"The Lcs length is :"<<LCS(X,Y)<<endl;
    while(1);
    return 0;
}