BMS8110复习(一):Lecture 1-Introduction to Bioinformatics

Wiki definition:

  • Bioinformatics is an interdisciplinary field that develpes methods and software tools for understanding biological data.
  • As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathmatics, and engineering to analyze and interpret biological data.

 National Institutes of Health (NIH) definition:

  • Bioinformatics is "research, development, or application of computational tools and approaches for expanding the use of boological, medical, behavioral, or health data, including those to acquire, store, organize, analyze, or visualize such data."

Fields of Bioinformatics:

  • Sequence analysis
  1. Genome annotation
  2. Computational evolutionary biology
  3. Comparative genomics
  4. Cancer Genomics
  • Structural bioinformatics
  • Network and systems biology
  • High-throughput image analysis
  • Literature analysis / Text mining
  • Databases
  • Software and tools

The role and contribution of computational biology has often been misunderstood and undervalued!

All moder biology is computational biology:

  • Computational thinking and computational methods are so central to the quest of understanding life that today all biology is computational biology.
  • Computational biology:
  1. brings order into our understanding of life
  2. lets you see the big picture
  3. provides an atlas of life
  4. turns ideas into hypotheses

Computational BiologyBioinformatics
Is ScienceIs Engineering or CS
Aims to discoverAims to develop
Biological mechanismsBetter models/algorithms
Treat bioinformatics as tools (just like pipettes, qPCR, western Blots, etc., to wet-lab biologists)Treat Biological questions as case studies to demonstrate the "better performance"
Needs deeper Biological/Biomedical/Clinical knowledgeNeeds more CS/Maths/Stats background

Ronald Fisher: a biologist and a statistician  (correlation does not imply causation)

National Center for Biotechnology Information (NCBI)

Major milestones of NCBI: 1990 BLAST; 1992 GenBank; 1996 OMIM; 1999 Human Genome; 1999 Suite of Genomic Resources; 2000 PubMed Central; 2000 GEO; 2002 WGS; 2003 Entrez Gene; 2004 PubChem; 2006 dbGaP; 2008 1000 Genomies Project; 2010 dbVar; 2013 ClinVar

European Bioinformatics Institute (EBI)

Improve the efficiency of your research with public data

  • Testing your hypothesis using published data before your own experiments
  1. Supportive -> Proceed with your own experiments
  2. Discouraging -> Adjust/Give up your hypothesis
  • Validating your results using one or multiple independent data set(s)
  1. Supportive -> Strengthen your manuscript
  2. Discouraging-> Repeat your experiments

GEO: Gene expression Omnibus

ENCODE: Encyclopedia of DNA Elements

TCGA: The Cancer Genome Atlas

ICGC: International Cancer Genome Consortium

All models are wrong, some are useful.

We like the idea of using simple statistics to solve real, important problems.

We aren't fans of unnecessary complication -- that just leads to lies, damn lies and something else.

Basic Types of Analyses:

  • Summarize data
  • Test for difference between groups
  • Analyze rates and proportions
  • Test for trends

Types of data: Interval data; Nominal or categorical data; Ordinal data

Mean, median and mode

Variance, standard deviation. standard error of the mean

Hypothesis Test

  • In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.
  • Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the actually observed one?
  • The probability is known as the p-value
  • "more extreme" is dependent on the way the hypothesis is tested.

这一块感觉需要翻出来概率书好好理解一下。。。

P值(p-value)是当原假设为真时所得到样本观察结果或更极端结果出现的概率。

α值是一个临界概率值,它表示在“统计假设检验”中,用样本资料推断总体时,犯拒绝“假设”错误的可能性大小。α越小,犯拒绝“假设”的错误的可能性越小。

“拒绝原假设时犯错”是第一类错误(弃真错误),只取决于拒绝域,是一个平均含义,这个概率是不依赖于样本的,我们无法通过一个样本知道拒绝原假设犯错的概率。P值是一个依赖于样本的统计量,描述的是原假设为真的前提下,出现与样本相同或者更极端情况的概率。

How to test for differences between groups: Normal distribution; Student's t-distribution; Student t-test; 

How to test for associations: Fisher's Exact test for association

How to test for overrepresentations: Hypergenometric test for overrepresentation- the basic for Gene Ontology analysis; Gene Ontology analysis

How to test for trends:

Pearson Product-Moment Correlation Coefficient, but this method has some limitations:

  • Only represent linear relationships
  • Does not distinguish slopes of linear relationships
  • Cannot reflect nonlinear relationships

Spearman Rank Correlation Coefficient:

  • Defined as the Pearson correlation coefficient between the ranked variables
  • Used when the two variables being compared are monotonically related, even if their relationship is not linear
  • Less sensitive than the Pearson correlation to strong outliers

Data visualization is increasingly important for biological research

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值