Biological Problem:
Rearrangement (Flipping of genomic sequence) often happened in mammalian evolutionary history. For example, human X chromosome can be viewed as rearrangement scenario of mouse X chromosome. Biologists are interested in the most parsimonious evolutionary scenario, which involving the smallest number of reversals.
To Describe the Problem: π = π1*π2*...* (πi*πi+1*.....*πj-1*πj) *πj+1*.....πn
π*∂(i, j) = π1*π2*...* (πj*πj-1*.....*πji1*πi) *πj+1*.....πn
Reversal Distance Problem: given permutations π and å, output a series of ∂ (t in total ) which can transform π into å, such that t is minimum.
When we set å as standard, like 1, 2, 3....n, then "Reversal Distance Problem" can be transformed into "Sorting by Reversal" Problem: given permutations π , output a series of ∂ (t in total ) which can transform π into identity permutation, such that t is minimum.
To Solve the Problem: (Greedy Algorithm)
SIMPLE REVERSAL SORT: (however, this methods is quite short sighted)
for i <- 1 to n-1
j <- position of element i in π (π j = i )
if i ne j, π <- π * ∂(i, j)
if π is the identity permutation, return.
If we define prefix(π) to be the number of already-sorted elements of π, then the strategy is to increase prefix(π) at every step.
-> Bofore this, computer scientists are faced with "Prefix Reversal Problem", also known as "Pancake Flipping Problem", which is similar to "Sorting by Reversal" problem, but for ∂(i, j), i is always 1.
-> Approximation Algorithm: when optimal algorithm is still unknown, we often use an approximate algorithm to give an approximate solution. The approximation ratio is : A(π) / OPT(π), A(π) refers to the solution produced by algorithm A, OPT(π) refers to the optimal algorithm, and an approximation ratio of 1 is the acme of perfection.
For minimization algorithm, the ratio is max(A(π)/OPT(π)); for maximization, the ratio is min(A(π)/OPT(π)). Since the ratio often gives the worst-case scenario. For example, Reversal Sorting Problem required the least sorting times. Therefore, the approximation ratio for the algorithm is max(A(π)/OPT(π)).
The approximation ratio of SIMPLE REVERSAL SORT is at least (n-1)/2, like π = n123...(n-1), even though d(π) = 2.
BREAKPOINTS:
The problem of SIMPLE REVERSAL SORT is prefix(π) is a naive measure of our progress toward the identity permutation. So we have a new concept - breakpoint.
If πi and πi+1are not consecutive numbers, then there is a breakpoint between πi and πi+1 .Strip is an interval between two breakpoints. Strip can be divided into increasing and decreasing. Here we introduce another way to solve the problem:
BREAKPOINT REVERSAL SOTR (π):
while b(π) > 0
among all revisals, choosing reversal ∂ that minimize b(π*∂);
π <- π*∂
output π
return
However, there are two problems concerned this method: 1. whether this algorithm could terminate?; 2. If it is a better approximation algorithm then SIMPLE REVERSAL SORT ?
These two questions can be solved two theorems.
Theorem 5.1: If a permutation π contins a decreasing strip, then there is a reversal ∂ that decreases the number of breakpoints in π, that is, b(π*∂) < b(π). so when decreasing the breakpoints step by step, the algorithm would finally terminate. But what if there is no decreasing strip, well, just flip one increasing strip to make it decreasing.
Theorem 5.2: It is an approximation algorithm with a performance guarantee of at most 4. Suppose every step can only reduce one breakpoint and suppose before every step, we need to flip one increasing strip into decreasing one. Then the approximation ratio is at most 2b(π)/d(π). d(π) > b(π)/2, therefore, 2b(π)/d(π) <= 4.
Plus, there is also a greedy algorithm to Motif Finding Problem. It first find the two close l-mere in sequence 1 and 2, and forms a 2*l seed matrix, which requires
l*(n - l +1)**2 operations. Then for left sequences, it searches the l-mer that maximize Score(s, i) for every sequence and add the row to seed matrix. This step requires l * (n-l+1) operations in each iteration. Thus the running time of this algorithm is O(ln**2 + lnt), which is vastly better than O (ln**4) or O(nt*4**l).