Advance articles alert 08 July 2019

 

1. Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

Alfredo Velasco, II Benjamin T James Vincent D Wells Hani Z Girgis

Bioinformatics, btz551, https://doi.org/10.1093/bioinformatics/btz551

Abstract

Motivation

Simple tandem repeats(STR ?  法医鉴定?), microsatellites in particular, have regulatory functions, links to several diseases, and applications in biotechnology. There is an immediate need for an accurate tool for detecting microsatellites in newly sequenced genomes. The current available tools are either sensitive or specific but not both; some tools require adjusting parameters manually.

Results

We propose Look4TRs, the first application of self-supervised hidden Markov models to discovering microsatellites. Look4TRs adapts itself to the input genomes, balancing high sensitivity and low false positive rate. It auto-calibrates itself. We evaluated Look4TRs on 26 eukaryotic genomes. Based on F measure, which combines sensitivity and false positive rate, Look4TRs outperformed TRF and MISA —the most widely-used tools—by 78% and 84%. Look4TRs outperformed the second and the third best tools, MsDetector and Tantan by 17% and 34%. On eight bacterial genomes, Look4TRs outperformed the second and the third best tools by 27% and 137%.

Availability

https://github.com/TulsaBioinformaticsToolsmith/Look4TRs

Supplementary information

Supplementary data are available at Bioinformatics online and on https://drive.google.com/open?id=1cIcS7Gvj0wj1B81-rnTU_OAG3IiNH54Y.

Through Look4TRs, we make three main contributions to the methodology of MS discovery. First, Look4TRs takes into account the nucleotide composition of the input system. Second, Look4TRs is the first self-supervised system for annotating MS. Third, Look4TRs is the first auto-calibrating system for locating MS. Supervised learning algorithms require the availability of labeled examples, whereas self-supervised learning algorithms can generate its own training labeled examples.

Look4TRs generates a random chromosome based on a real chromosome of the input genome. Then it inserts
semi-synthetic MS in the random chromosome. Finally, the HMM is trained and calibrated using these semi-synthetic MS.
Three parameters affect the training of Look4TRs’s HMM. Look4TRs automatically trains multiple HMMs using all valid
combinations of these three parameters. The HMM resulting in the best balance between high sensitivity and low false
positive rate is selected automatically.

 

 

 

What is a hidden Markov model?

Nature Biotechnologyvolume 22, pages1315–1316 (2004) | Download Citation

Statistical models called hidden Markov models are a recurring theme in computational biology. What are hidden Markov models, and why are they so useful for so many different problems?

Often, biological sequence analysis is just a matter of putting the right label on each residue. In gene identification, we want to label nucleotides as exons, introns, or intergenic sequence. In sequence alignment, we want to associate residues in a query sequence with homologous residues in a target database sequence. We can always write an ad hoc program for any given problem, but the same frustrating issues will always recur(复发,重现). One is that we want to incorporate(合并) heterogeneous sources of information. A genefinder, for instance, ought to combine splice-site consensus, codon bias, exon/ intron length preferences and open reading frame analysis into one scoring system. How should these parameters be set? How should different kinds of information be weighted? A second issue is to interpret results probabilistically. Finding a best scoring answer is one thing, but what does the score mean, and how confident are we that the best scoring answer is correct? A third issue is extensibility. The moment we perfect our ad hoc genefinder, we wish we had also modeled translational initiation consensus, alternative splicing and a polyadenylation signal(多腺苷酸化信号). Too often, piling more reality onto a fragile ad hoc program makes it collapse under its own weight.

Hidden Markov models (HMMs) are a formal foundation for making probabilistic models(概率模型) of linear sequence 'labeling' problems1,2. They provide a conceptual toolkit(概念上的工具箱) for building complex models just by drawing an intuitive picture(直观的图片). They are at the heart of a diverse range of programs, including genefinding, profile searches, multiple sequence alignment and regulatory site identification. HMMs are the Legos of computational sequence analysis.

A toy HMM: 5′ splice site recognition

As a simple example, imagine the following caricature of a 5′ splice-site recognition problem. Assume we are given a DNA sequence that begins in an exon, contains one 5′ splice site and ends in an intron. The problem is to identify where the switch from exon to intron occurred—where the 5′ splice site (5′SS) is.

For us to guess intelligently, the sequences of exons, splice sites and introns must have different statistical properties. Let's imagine some simple differences: say that exons have a uniform base composition on average (25% each base), introns are A/T rich (say, 40% each for A/T, 10% each for C/G), and the 5′SS consensus nucleotide is almost always a G (say, 95% G and 5% A).

Starting from this information, we can draw an HMM (Fig. 1). The HMM invokes three states, one for each of the three labels we might assign to a nucleotide: E (exon), 5 (5′SS) and I (intron). Each state has its own emission probabilities (shown above the states), which model the base composition of exons, introns and the consensus G at the 5′SS. Each state also has transition probabilities (arrows), the probabilities of moving from this state to a new state. The transition probabilities describe the linear order in which we expect the states to occur: one or more Es, one 5, one or more Is.

Figure 1: A toy HMM for 5′ splice site recognition.

 

figure1

See text for explanation.

 

So, what's hidden?

It's useful to imagine an HMM generating a sequence. When we visit a state, we emit a residue from the state's emission probability distribution. Then, we choose which state to visit next according to the state's transition probability distribution. The model thus generates two strings of information. One is the underlying state path (the labels), as we transition from state to state. The other is the observed sequence(the DNA), each residue being emitted from one state in the state path.

The state path is a Markov chain, meaning that what state we go to next depends only on what state we're in. Since we're only given the observed sequence, this underlying state path is hidden—these are the residue labels that we'd like to infer. The state path is a hidden Markov chain.

The probability P(S,π|HMM,θ) that an HMM with parameters θ generates a state path π and an observed sequence S is the product of all the emission probabilities and transition probabilities that were used. For example, consider the 26-nucleotide sequence and state path in the middle of Figure 1, where there are 27 transitions and 26 emissions to tote up. Multiply all 53 probabilities together (and take the log, since these are small numbers) and you'll calculate log P(S,π|HMM,θ) = −41.22.

An HMM is a full probabilistic model—the model parameters and the overall sequence 'scores' are all probabilities. Therefore, we can use Bayesian probability theory to manipulate these numbers in standard, powerful ways, including optimizing parameters and interpreting the significance of scores.

Finding the best state path

In an analysis problem, we're given a sequence, and we want to infer the hidden state path. There are potentially many state paths that could generate the same sequence. We want to find the one with the highest probability.

For example, if we were given the HMM and the 26-nucleotide sequence in Figure 1, there are 14 possible paths that have non-zero probability, since the 5′SS must fall on one of 14 internal As or Gs. Figure 1 enumerates the six highest-scoring paths (those with G at the 5′SS). The best one has a log probability of −41.22, which infers that the most likely 5′SS position is at the fifth G.

For most problems, there are so many possible state sequences that we could not afford to enumerate them. The efficient Viterbi algorithm is guaranteed to find the most probable state path given a sequence and an HMM. The Viterbi algorithm is a dynamic programming algorithm quite similar to those used for standard sequence alignment.

Beyond best scoring alignments

Figure 1 shows that one alternative state path differs only slightly in score from putting the 5′SS at the fifth G (log probabilities of −41.71 versus −41.22). How confident are we that the fifth G is the right choice?

This is an example of an advantage of probabilistic modeling: we can calculate our confidence directly. The probability that residue i was emitted by state k is the sum of the probabilities of all the state paths that use state k to generate residue i (that is, πi = k in the state path π), normalized by the sum over all possible state paths. In our toy model, this is just one state path in the numerator and a sum over 14 state paths in the denominator. We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct (Fig. 1, bottom). This is called posterior decoding. For larger problems, posterior decoding uses two dynamic programming algorithms called Forward and Backward, which are essentially like Viterbi, but they sum over possible paths instead of choosing the best.

Making more realistic models

Making an HMM means specifying four things: (i) the symbol alphabet, K different symbols (e.g., ACGT, K = 4); (ii) the number of states in the model, M; (iii) emission probabilities ei(x) for each state i, that sum to one over K symbols x, Σxei(x) = 1; and (iv) transition probabilities ti(j) for each state i going to any other state j (including itself) that sum to one over the M states j, Σjti(j) = 1. Any model that has these properties is an HMM.

This means that one can make a new HMM just by drawing a picture corresponding to the problem at hand, like Figure 1. This graphical simplicity lets one focus clearly on the biological definition of a problem.

For example, in our toy splice-site model, maybe we're not happy with our discrimination power; maybe we want to add a more realistic six-nucleotide consensus GTRAGT at the 5′ splice site. We can put a row of six HMM states in place of '5' state, to model a six-base ungapped consensus motif, parameterizing the emission probabilities on known 5′ splice sites. And maybe we want to model a complete intron, including a 3′ splice site; we just add a row of states for the 3′SS consensus, and add a 3′ exon state to let the observed sequence end in an exon instead of an intron. Then maybe we want to build a complete gene model...whatever we add, it's just a matter of drawing what we want.

The catch

HMMs don't deal well with correlations between residues, because they assume that each residue depends only on one underlying state. An example where HMMs are usually inappropriate is RNA secondary structure analysis. Conserved RNA base pairs induce long-range pairwise correlations; one position might be any residue, but the base-paired partner must be complementary. An HMM state path has no way of 'remembering' what a distant state generated.

Sometimes, one can bend the rules of HMMs without breaking the algorithms. For instance, in genefinding, one wants to emit a correlated triplet codon instead of three independent residues; HMM algorithms can readily be extended to triplet-emitting states. However, the basic HMM toolkit can only be stretched so far. Beyond HMMs, there are more powerful (though less efficient) classes of probabilistic models for sequence analysis.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值