An Introduction to Bioinformatics Algorithms - II - page79 -114

最新推荐文章于 2019-01-15 11:08:08 发布

Coraline_third_year

最新推荐文章于 2019-01-15 11:08:08 发布

阅读量764

点赞数

分类专栏： Book Notes 文章标签： algorithm

本文链接：https://blog.csdn.net/Coraline_second_year/article/details/37772141

版权

Book Notes 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Finally step into the world of Algorithm...

07/14

Restriction Mapping problem was coming up with to give an example to illustrate exhaustive search. Restriction Mapping: in order to get the restriction enzyme sites along the sequence, biologist use the enzyme to partial digest the sequence, and then get several short sequences. Through the length of those short sequences, they infer the position of restriction sites. (Partial Digest Problem, PDP; also called Turnpike problem). It should be noted that the restriction map get from the length information is not unique.

Impractical Restriction Mapping Algorithm: 1. BruteForce PDP, given the short sequences (L) which returns the set of X of n integers, take the largest sequence M as the largest factor. scan all the arrays with n factors within M to see of ∆X can be L. This solution has a big O notation of O(M **n-2) time; 2. a wiser solution is not scan every integer between 0 and M, but only choose those integer form L, which would have a time of O(n **2n-4).

Practical Restriction Mapping Algorithm (develop in 1990): For every step, choose the largest number left, put it into the right position between 0 and M (check if minus results match L), delete the gotten sequence from L, and step by step, fit the every number into every position. However, if both "right" and "left" alternative hold and it continues to happen in future steps. It would become exponential. and finally, the polynomial algorithm was designed recently.

07/15-16

1. Describe the Problem:

Motif finding problem: motif is assumed to appear most frequently in DNA sequence, therefore, the problem is : given the length of motif, find the most frequently appeared sequence with the length within a long DNA sequence. To simplify the question, given several DNA sequences, we need to find the starting positions s corresponding to the most conserved profile. When we use Score(s, DNA) to represent consensus score, the motif finding problem can be shown as given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score.

Another view into this problem is to find a median string. Since we can use Hamming Distance to describe the difference between two strings. The motif finding problem can also be viewed as finding the minimum total Hamming Distance between string v and any set of starting positions in the DNA. Notice that this is a double minimization: we are finding a string v that minimizes TotalDistance(v, DNA), which is in turn the smallest distance among all choices of staring positions points in the DNA sequences.

2. Basic Algorithm:

In both Motif Find Problem and Median String Problem, we need to sift through a large number of strings. How to consider them one by one, NEXTLeaf algorithm give us an answer;

To scan the entire tree, we can use "NEXT VERTEX" which can be used in branch-and-bound approach.

3. If we use Motif Finding method, we can use brute force approach ( O(l*(n**t)) ), as well as branch-and-bound approach (which spend less time).

If we solve finding median string problem, we can also use both of brute force approach and brand-and-bound approach ( O((4**l)*nt) ), which is more favorable than Motif Finding method.