Paper reading (三十):A review on machine learning principles for multi-view biological data integratio

论文题目:A review on machine learning principles for multi-view biological data integration

scholar 引用:69

页数:16

发表时间:2016.12

发表刊物:Briefings in Bioinformatics

作者:Yifeng Li, Fang-Xiang Wu, and Alioune Ngom

摘要:Keywords: data integration, multi-omics data, random forest, multiple kernel learning, network fusion, matrix factorization, deep learning

Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better  use of vast volumes of heterogeneous information in the deep understanding of biological systems and the developments of predictive models. How data from multiple sources(called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have  potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-model learning for capturing the complex mechanism of biological systems.

Key Points:

  • We privide a comprehensive review on biological data integration techniques from a machine learning perspective.
  • Bayesian models and decision trees are discussed for incorporating prior information and integrating data of mixed data types.
  • Tri-matrix factorizations and network-based method are reviewed for two-raltional and mlti-relational association studies.
  • Multi-view matrix factorization models are investigated for detecting ubiquitous and view-specific components from multi-view omics data.
  • Multi-view deep learning approaches are discussed for simultaneous use of multiple data sets in supervised and unsupervised settings.

结论:

  • particularly multi-view matrix factorizations and multi-modal deep learning
  • 问题:Multi-omics data are almost unavailable in diseases other than cancer
  • 解决方法:researchers in non-cancer studies are eagerly suggested to switch their focus to integrative analysis and work in a coordinated manner to ensure the quality and completeness of multi-platform omics data.
  • globally examine the behavior of all features in multiple classes:BNs, feature clustering or multi-view matrix factorizations
  • We list open-source packages and tools for the seven categories of integrative models in Table 2
  • Python can serve as a promising platform to realize integrative models
  • deep learning在multi-view data方面的应用目前还不尽人意。

Introduction:

  • we are interested in four types of multi-view data:
  1. multi-view data with different groups of samples measured by the same feature set (or called multi-class data),
  2. multi-view data with the same set of objects (samples) but several distinct feature sets,
  3. multi-view data measuring the same set of objects by the same set of features in different conditions (can be represented by a three-way sample×feature×conditionsample×feature×condition tensor) 
  4. multi-view data with different features and different sample sets in the same phenomenon or system, which can be further transformed to multi-relational data.
  • 上述中type-2 and type-4 multi-view data described above are often referred as multi-omics data. 
  • Single-omics data enumerated above have the following characteristics:(1) high dimensionality, (2) redundancy, (3) highly correlated features and (4) non-negativity.
  • multi-omics data have the following characteristics: (1) mutual complementarity, (2) causality and (3) heterogeneity.
  • there are five types of data-driven analyses where integrative machine learning methods are required:
  1. multi-class feature selection and classification problem
  2.  integrating multi-omics data of the same set of labeled objects is expected to escalate prediction (classification or regression) power
  3. in the above setting but without class labels, the task becomes an unsupervised learning to discover novel groups of samples.
  4. given multiple heterogeneous feature sets observed for the same or group of samples, the interactions among inter-view features could be crucial to understand the pathways of a phenotype
  5. given homogeneous and heterogeneous relations within and between multiple sets of biological entries from different molecular levels and clinical descriptions, inferring the relations between inter-set entries is named association study in a complex system. 
  • data integration is an urgent need in current and future bioinformatics
  • Table 1:An overview of integrative analyses that can be conducted by machine learning methods on four types of multi-view data. Details regarding these methods and applications are described in separate sections
  • table1中主要的任务就是六类:classification, regression,  feature selection, pathway analysis, clustering, association study
  • machine learning methodologies主要是分成七类:feature concatenation, Bayesian models, tree-based ensemble methods, multiple kernel learning, network-based methods, matrix factorizations and deep neural networks.
  • We list open-source packages and tools for the seven categories of integrative models in Table 2. 

正文组织架构:

1. Introduction

2. Feature concatenation

3. Bayesian methods to integrate prior knowledge

4. Bayesian methods for data of mixed types

5. Trees of mixed data types and ensemble learning

6. Kernel learning and metric learning

7. Network-based approaches to integrate multiple homogeneous networks

8. Network-based methods for fusing multiple relational data

9. Feature extractions and matrix factorizations for detecting shared and view-specific components

10. Multi-modal deep learning

11. Discussion and conclusion

正文部分内容摘录(满屏的公式。。。):

2. Feature concatenation

  • The concatenated features require additional downstream processing that may lead to loss of key information. 
  • 第一,数据可能包含离散特征、连续特征等,则需要转换为算法可接受的类型。
  • 第二,features from multiple views usually have different scales. 因此可能需要进行归一化。
  • 第三,feature concatenation is often unworkable with modern data which possess a high dimensionality and rich structural information. 

3. Bayesian methods to integrate prior knowledge

  • 问题:it may be difficult to find useful information as prior features.

4. Bayesian methods for data of mixed types

  • Bayesian networks (BNs) can naturally model multi-view data with mixed distributions for classification and feature-interaction identification purposes.
  • three obstacles challenge us to apply BNs in data integration:
  1.  searching for the optimal BN structure is a NP-complete problem
  2.  the number of parameters may be much larger than the sample size.
  3. inference in a BN is intractable
  • Naïve Bayes classifier is famous as a slim and swift BN model because learning its structure is needless and the inference of class label is straightforward. 
  • Tree-augmented naïve Bayes classifier outperforms the naïve Bayes classifier but keeps the efficiency of model learning.

5. Trees of mixed data types and ensemble learning

  • Decision trees should be considered as integrative models because a mixture of discrete and continuous features can be simultaneously integrated without the need to normalize features.
  • The overfitting issue of decision trees can be overcome by collective intelligence, that is, ensemble learning:bagging and boosting
  • two challenges: a few trees learned in this manner may be highly correlated; learning a collection of decision trees for multi-view data with many features becomes much more unaffordable.
  • Random forest addresses two challenges by randomly picking up features in the construction of trees. 
  • There are three ways to integrate data by ensemble learning:
  1. The first way is to use the concatenated features as input of random forest. 
  2. The second way is to build multiple trees for each data view, and then use all learned trees of all views to vote for the final decision 
  3. The third way is to obtain new meta-features from multi-view data instead of using the original features.

6. Kernel learning and metric learning

  •  Multiple kernel learning (MKL) is an intermediate integration technique that first computes kernel (or similarity) matrices separately for each data view, then combines these matrices to generate a final kernel matrix to be used in a kernel model. 
  • Metric learning aims to learn a metric function from data such that the distances between within-class samples are closer, whereas the distances between inter-class samples are farther. 
  •  MKL should not be considered when identifying feature interactions.

7. Network-based approaches to integrate multiple homogeneous networks

  • Multi-view data of cohort samples can be integrated in the sample space by network fusion methods. 
  • Although these methods are essentially nonlinear MKL models, as they are mainly presented in the context of biological network mining, we discuss them in this separate section. 

8. Network-based methods for fusing multiple relational data

  •  the association problems, can be solved by either kernel (relational) matrix factorization methods or graphical methods, even a mixture of both. 
  • Based on the number of relationships, association studies can be categorized to two-relational or multi-relational problems.
  • In two-relational association studies, the challenge is how to integrate multiple relational (or adjacency) matrices of two sets of biological entries. 
  • Tri-matrix factorizations, combined with network methods, are found useful in association studies of multiple sets of biological entries, where pair-wise associations are represented in relational matrices.

9. Feature extractions and matrix factorizations for detecting shared and view-specific components

  • There are several benefits of using feature extraction in data integration:
  1. the natures of heterogeneous data from multiple omics data can be separately well counted.
  2. the high-dimensionality is dramatically reduced so that the downstream analysis will be more efficient.
  3. extracting new features separately for each data view implements the principle of divide and conquer, thus computational complexity can be significantly reduced.
  4. relational data can be well incorporated by kernel feature extraction methods.

10. Multi-modal deep learning

  • how to precisely catch and explicitly interpret inter-view feature interactions remains an open problem.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值