Paper reading (十八):Machine learning applications in genetics and genomics

论文题目:Machine learning applications in genetics and genomics

scholar 引用:528

页数:12

发表时间:2015.05

发表刊物:nature REVIEWS genetics

作者:Maxwell W. Libbrecht and William Stafford Noble

摘要:

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

结论:

  • genomics通过一些大的projects获取了大量的数据,而机器学习在处理large,complex datasets方面的能力非常突出,那么将machine learning应用于genomics是必然趋势。
  • 简单粗暴的将machine learning的方法应用到genomics肯定没啥好结果。一般来说,算法用得好需要ML领域和相应的专业领域的理论和实践知识结合。
  • both machine learning itself and scientists proficient in these applications are likely to become increasingly important to advancing genetics and genomics.

Introduction:

  • ML对于large genomic datasets的interpretation很有用,在annotate a wide variety of genomic sequence elements也有应用。
  • As well as learning to recognize patterns in DNA sequences, machine learning algorithms can use input data generated by other genomic assays。

  • Machine learning applications have also been exten- sively used to assign functional annotations to genes.

  • a wide variety of machine learning methods have been developed to help to understand the mecha- nisms underlying gene expression.

  • machine learning researchers have tended to focus on a subset of prob- lems within statistics, emphasizing in particular the analysis of large heterogeneous data sets.

  • we begin by explaining several key distinctions in the main types of machine learning and then outlining some of the major challenges in applying machine learning methods to practical problems in genomics.

正文组织架构:

1. Introduction

2. Stages of machine learning

3. Supervised versus unsupervised learning

4. Generative versus discriminative modeling

5. Incorporating prior knowlege

6. Handling haterogeneous data

7. Feature selection

8. Imbalanced class sizes

9. Handling missing data

10. Modelling dependence among examples

11. Conclusions

正文部分内容摘录:

2. Stages of machine learning

  • the design–learn–test process provides a principled way to test a hypothesis about machine learning

  • the algorithm itself can be used to generate hypotheses

  • key question: 是否可解释以及如何解释这个model

3. Supervised versus unsupervised learning

  • 无监督学习:the machine learning algorithm uses only the unlabelled data and the desired number of different labels to assign as input

  • Semi-supervised learning:A machine-learning method that requires labels but that also makes use of unlabelled examples.

  • The learning procedure begins by constructing an initial gene-finding model on the basis of the labelled subset of the training data alone. Next, the model is used to scan the genome, and tentative labels are assigned throughout the genome.

  • These tentative labels can then be used to improve the learned model, and the procedure iterates until no new genes are found.

  • 没有label的时候,只能选择无监督学习;有label的时候,监督学习未必是最佳选择,because every supervised learning method rests on the implicit assumption that the distri- bution responsible for generating the training data set is the same as the distribution responsible for generat- ing the test data set.

  • Semi-supervised learning requires making certain assumptions about the data set22 and, in practice, assessing these assumptions can often be very difficult.

  • a good rule of thumb is to use semi-supervised learning only when there are a small amount of labelled data and a large amount of unlabelled data.

4. Generative versus discriminative modeling

  • 使用ML无非两个目的:prediction或者interpretation
  • There are trade-offs between accomplishing these two goals — methods that optimize prediction accuracy often do so at the cost of interpretability.???准确率越高,可解释性越差???不明白

  • 文中关于上述结论的解释,似懂非懂。。。追求准确性可能会去除一些特征这样,破坏了数据的完整性?A researcher applying a machine learning method to this problem may either want to understand what properties of a sequence are the most important for determining whether a transcription factor will bind (that is, interpretation), or simply want to predict the locations of transcription factor binding as accurately as possible (that is, prediction). 

  • 生成方法建立了两个类中特征分布的完整模型,然后比较了两种分布之间的差异。 相比之下,判别方法侧重于仅对两个类别之间的边界进行准确建模。

  • the generative description of the data implies that the model parameters have well-defined semantics relative to the generative pro- cess;generative models are frequently stated in terms of probabilities, and the probabilistic framework provides a principled way to handle problems like missing data.

  • generative approaches can sometimes perform better with limited training data.

  • when the amount of labelled training data is reasonably large, the discriminative approach will tend to find a better solution

5. Incorporating prior knowlege

  • the selection of an approach that matches the researcher’s prior knowledge about the problem is crucial to the success of the analysis.

  • Implicit prior knowledge,prior knowledge may be implicitly encoded in the learning algorithm itself, in which some types of solu- tions are preferred over others.

  • in general, the choice of input data sets, their representations and any pre-processing must be guided by prior knowledge about data and application.

  • Probabilistic priors, pseudocounts, 直译大概就是”伪计数“,为了将一个已知数据模型中已知不为0的概率改变为可以忽略但不是0的值,在已观测数据里的添加一个伪计数。

  • A particularly successful example is the use of Dirichlet mixture priors in protein modelling.

  • Prior information in non-probabilistic models. Kernel methods are algorithms that use a general class of mathematical functions called kernels in place of a simple similarity func- tion (specifically, the cosine of the angle between two input vectors)

6. Handling haterogeneous data

  • The most straightforward way to solve this problem is to transform each type of data into vector format before processing

  • each type of data can be encoded using a kernel function, with one kernel for each data type.

  • probability models provide a very differ- ent method for handling heterogeneous data.

  • In practice, an alternative method for han- dling heterogeneous data in a probability model is to make use of the general probabilistic mechanism for handling prior knowledge by treating one type of data before another.

7. Feature selection

  • it is important to distinguish among three distinct motivations for carrying out feature selection:identify a very small set of features that yield the best possible classifier;use the classifier to understand the underlying biology;train the most accurate possible classifier.

8. Imbalanced class sizes

  • The most straightforward solution to this problem is to select a random, smaller subset of the data.

  • it is more appropriate to separately evaluate sensitivity and precision. 比如说上篇paper讨论过的MCC系数。

  • the most appropriate performance meas- ure depends on the intended application of the classifier.

9. Handling missing data

  • The simplest way to deal with data that are miss- ing at random is to impute the missing values

  • Another method for dealing with missing data is to include in the model information about the ‘missingness’ of each data point.

  • probability models can explicitly model missing data by considering all the potential missing values.

10. Modelling dependence among examples

  • The most straightforward way to infer the relation- ships among examples is to consider each pair independently.

  • methods that infer a network as a whole are more biologically interpretable because they remove these indirect correlations.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Introduction to Machine Learning with Applications in Information Security-CRC(2018).epub For the past several years, I’ve been teaching a class on “Topics in Information Security.” Each time I taught this course, I’d sneak in a few more machine learning topics. For the past couple of years, the class has been turned on its head, with machine learning being the focus, and information security only making its appearance in the applications. Unable to find a suitable textbook, I wrote a manuscript, which slowly evolved into this book. In my machine learning class, we spend about two weeks on each of the major topics in this book (HMM, PHMM, PCA, SVM, and clustering). For each of these topics, about one week is devoted to the technical details in Part I, and another lecture or two is spent on the corresponding applications in Part II. The material in Part I is not easy—by including relevant applications, the material is reinforced, and the pace is more reasonable. I also spend a week covering the data analysis topics in Chapter 8 and several of the mini topics in Chapter 7 are covered, based on time constraints and student interest.1 Machine learning is an ideal subject for substantive projects. In topics classes, I always require projects, which are usually completed by pairs of students, although individual projects are allowed. At least one week is allocated to student presentations of their project results. A suggested syllabus is given in Table 1. This syllabus should leave time for tests, project presentations, and selected special topics. Note that the applications material in Part II is intermixed with the material in Part I. Also note that the data analysis chapter is covered early, since it’s relevant to all of the applications in Part II.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值