第四章 马尔科夫模型
4.4 学生课堂报告1
- Example1: Was she happy? 非常有意思的例子。。。hidden_states = (Happy, Unhappy)
observations = (Kiss, Beat, Do nothing) - Viterbi算法
- Example2: 5’ splice site recognition-----hidden_states = (E, 5, I) observations = (A, C, G, T)
- Example3: Coding region------hidden_states = (Sta1, Sta2, Sta3, Cod1, Cod2, Cod3, Sto1, Sto2, Sto3)
observations = (A, C, G, T) - Example4: Prokaryotic gene-----hidden_states = (intergenic,
Sta‐8, Sta‐7, Sta‐6, Sta‐5, Sta‐4, Sta‐3, Sta‐2, Sta‐1, Sta1, Sta2, Sta3, Sta4, Sta5, Sta6,
Cod1, Cod2, Cod3,
Sto‐3, Sto‐2, Sto‐1, Sto1, Sto2, Sto3, Sto4, Sto5, Sto6, Sto7, Sto8, Sto9, Sto10, Sto11)
observations = (A, C, G, T) - Example5: Eukaryotic gene 1
- Example6: Eukaryotic gene 2
4.5 学生课堂报告2
- What is Pfam? Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models; The Pfam database contains information about protein domains and families.
- Pfam entries are classified in one of four ways:
- Family--A collection of related protein regions
- Domain--A structural unit
- Repeat--A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
- Motifs--A short unit found outside globular domains
- Pfam includes Pfam-A and Pfam-B
- Pfam-A: Pfam-A is the manually curated portion of the database that contains over 10,000 entries. For each entry a protein sequence alignment and a hidden Markov model is stored.
- Pfam-B: Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement is provided called Pfam-B. Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found.
- 一般说Pfam库都是指Pfam-A
- Functions of Pfam, for each family in Pfam one can:
- Look at multiple alignments
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures
- pHMM Generation
- Principles in pHMM: Tokens: amino acid sequence; States: insertion, deletion, match;Column: probability of residues at each site
- pHMM Parameters are Derived from Training Set
- Parameters of pHMM: transition probability & emission probability
- Training set: curated, highly representative relatively conserved sequences of a family
- pHMM Parameters are ajusted to include all the members in the family.
- Estimated directly from a multiple alignment.
- Using expectation-maximization procedure from unaligned sequences.
- pHMM finally generates a pHMM Logo.
- Why Pfam is reliable?
- When generation the pHMM, all the database are manully curated.
- Using pHMM model to indicate is reliable.
- Pfam only shows result with great significance.
- http://pfam.xpfam.org/
- Limitations
- Pfam doesn't give the 3D-structure of the target protein
- Pfam only gives the function of specific domains, but doesn't describe the function of the whole protein
- Pfam doesn't give the basic properties of the target including PI, solution property, etc.
4.6 学生课堂报告3
- From Markov Model to HMM
- HMM学习最佳范例
- How to build a model
- Three fundamental problems: given model M=M(w)
- Evalution: one sequence 'O=O1O2...': calculate P(O|w)
- Decoding: multiple sequences 'Oa/Ob...' : choose S=q1q2... which could best interpret observed sequences O
- Learning: Adjust parameters to maximize P(O|w), use observed sequences to train the model
- Feasibility: Biological Meaning
- Be self-adaptive to target sequence. Do not rely on priori experience.
- Perform better when combined with other biological methods. ----Revised easily stucture data.
- Function as flexible method in different conditions. ----Adjustable to meet variable requirements.
- Shortages:
- The algorithm do not guarantee the global optimal solution.
- The training process is limited by sample seize.
- The choice of match states is arbitrary.
4.7 学生实践
- Finding CNVs with HMM
- What is Copy-number variation(CNVs) (基因拷贝数变异)?
- CNVs:一种发生在染色体尺度的大片段拷贝。
- Question Define: Identifying repeating sequences(CNVs) in a long DNA sequence.
- Step1: Hidden states: Is a CNV; Is not a CNV
- Step2: Matrix: Transition Matrix; Creat Matrix
- Step3: Training Set
- Step4: Dynamic Programming