《生物信息学:导论与方法》----马尔可夫模型----听课笔记(七)

第四章  马尔科夫模型

4.4 学生课堂报告1

  • Example1: Was she happy? 非常有意思的例子。。。hidden_states = (Happy, Unhappy)
    observations = (Kiss, Beat, Do nothing)
  • Viterbi算法
  • Example2: 5’ splice site recognition-----hidden_states = (E, 5, I)  observations = (A, C, G, T)
  • Example3: Coding region------hidden_states = (Sta1, Sta2, Sta3, Cod1, Cod2, Cod3, Sto1, Sto2, Sto3)
    observations = (A, C, G, T)
  • Example4: Prokaryotic gene-----hidden_states = (intergenic,
    Sta‐8, Sta‐7, Sta‐6, Sta‐5, Sta‐4, Sta‐3, Sta‐2, Sta‐1, Sta1, Sta2, Sta3, Sta4, Sta5, Sta6,
    Cod1, Cod2, Cod3,
    Sto‐3, Sto‐2, Sto‐1, Sto1, Sto2, Sto3, Sto4, Sto5, Sto6, Sto7, Sto8, Sto9, Sto10, Sto11)
    observations = (A, C, G, T)
  • Example5: Eukaryotic gene 1
  • Example6: Eukaryotic gene 2

4.5 学生课堂报告2

  • What is Pfam?  Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models; The Pfam database contains information about protein domains and families.
  • Pfam entries are classified in one of four ways:
  1. Family--A collection of related protein regions
  2. Domain--A structural unit
  3. Repeat--A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
  4. Motifs--A short unit found outside globular domains
  • Pfam includes Pfam-A and Pfam-B
  • Pfam-A: Pfam-A is the manually curated portion of the database that contains over 10,000 entries. For each entry a protein sequence alignment and a hidden Markov model is stored.
  • Pfam-B: Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement is provided called Pfam-B. Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found.
  • 一般说Pfam库都是指Pfam-A
  • Functions of Pfam, for each family in Pfam one can:
  1. Look at multiple alignments
  2. View protein domain architectures
  3. Examine species distribution
  4. Follow links to other databases
  5. View known protein structures
  • pHMM Generation
  • Principles in pHMM: Tokens: amino acid sequence; States: insertion, deletion, match;Column: probability of residues at each site
  • pHMM Parameters are Derived from Training Set
  • Parameters of pHMM: transition probability & emission probability
  • Training set: curated, highly representative relatively conserved sequences of a family
  • pHMM Parameters are ajusted to include all the members in the family.
  • Estimated directly from a multiple alignment.
  • Using expectation-maximization procedure from unaligned sequences.
  • pHMM finally generates a pHMM Logo.
  • Why Pfam is reliable?
  1. When generation the pHMM, all the database are manully curated.
  2. Using pHMM model to indicate is reliable.
  3. Pfam only shows result with great significance.
  • http://pfam.xpfam.org/
  • Limitations
  1. Pfam doesn't give the 3D-structure of the target protein
  2. Pfam only gives the function of specific domains, but doesn't describe the function of the whole protein
  3. Pfam doesn't give the basic properties of the target including PI, solution property, etc.

4.6 学生课堂报告3

  1. Three fundamental problems: given model M=M(w)
  2. Evalution: one sequence 'O=O1O2...': calculate P(O|w)
  3. Decoding: multiple sequences 'Oa/Ob...' : choose S=q1q2... which could best interpret observed sequences O
  4. Learning: Adjust parameters to maximize P(O|w), use observed sequences to train the model
  • Feasibility: Biological Meaning
  1. Be self-adaptive to target sequence. Do not rely on priori experience.
  2. Perform better when combined with other biological methods. ----Revised easily stucture data.
  3. Function as flexible method in different conditions. ----Adjustable to meet variable requirements.
  • Shortages:
  1. The algorithm do not guarantee the global optimal solution.
  2. The training process is limited by sample seize.
  3. The choice of match states is arbitrary.

4.7 学生实践

  • Finding CNVs with HMM
  • What is Copy-number variation(CNVs) (基因拷贝数变异)?
  • CNVs:一种发生在染色体尺度的大片段拷贝。
  • Question Define: Identifying repeating sequences(CNVs) in a long DNA sequence.
  • Step1: Hidden states: Is a CNV; Is not a CNV
  • Step2: Matrix: Transition Matrix; Creat Matrix
  • Step3: Training Set
  • Step4: Dynamic Programming
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值