Paper reading (三十)：A review on machine learning principles for multi-view biological data integratio

最新推荐文章于 2020-12-22 11:28:46 发布

盲人骑瞎马5555

最新推荐文章于 2020-12-22 11:28:46 发布

阅读量554

点赞数

分类专栏： Paper Reading 文章标签： data integration multi-omics data machine learning

本文链接：https://blog.csdn.net/wxw060709/article/details/102293818

版权

Paper Reading 专栏收录该内容

133 篇文章 9 订阅

订阅专栏

论文题目：A review on machine learning principles for multi-view biological data integration

scholar 引用：69

页数：16

发表时间：2016.12

发表刊物：Briefings in Bioinformatics

作者：Yifeng Li, Fang-Xiang Wu, and Alioune Ngom

摘要：Keywords: data integration, multi-omics data, random forest, multiple kernel learning, network fusion, matrix factorization, deep learning

Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the developments of predictive models. How data from multiple sources(called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-model learning for capturing the complex mechanism of biological systems.

Key Points：

We privide a comprehensive review on biological data integration techniques from a machine learning perspective.
Bayesian models and decision trees are discussed for incorporating prior information and integrating data of mixed data types.
Tri-matrix factorizations and network-based method are reviewed for two-raltional and mlti-relational association studies.
Multi-view matrix factorization models are investigated for detecting ubiquitous and view-specific components from multi-view omics data.
Multi-view deep learning approaches are discussed for simultaneous use of multiple data sets in supervised and unsupervised settings.

结论：

particularly multi-view matrix factorizations and multi-modal deep learning
问题：Multi-omics data are almost unavailable in diseases other than cancer
解决方法：researchers in non-cancer studies are eagerly suggested to switch their focus to integrative analysis and work in a coordinated manner to ensure the quality and completeness of multi-platform omics data.
globally examine the behavior of all features in multiple classes：BNs, feature clustering or multi-view matrix factorizations
We list open-source packages and tools for the seven categories of integrative models in Table 2.
Python can serve as a promising platform to realize integrative models
deep learning在multi-view data方面的应用目前还不尽人意。

Introduction：

we are interested in four types of multi-view data:

multi-view data with different groups of samples measured by the same feature set (or called multi-class data),
multi-view data with the same set of objects (samples) but several distinct feature sets,
multi-view data measuring the same set of objects by the same set of features in different conditions (can be represented by a three-way sample×feature×conditionsample×feature×condition tensor)
multi-view data with different features and different sample sets in the same phenomenon or system, which can be further transformed to multi-relational data.

上述中type-2 and type-4 multi-view data described above are often referred as multi-omics data.
Single-omics data enumerated above have the following characteristics:(1) high dimensionality, (2) redundancy, (3) highly correlated features and (4) non-negativity.
multi-omics data have the following characteristics: (1) mutual complementarity, (2) causality and (3) heterogeneity.
there are five types of data-driven analyses where integrative machine learning methods are required：

multi-class feature selection and classification problem
integrating multi-omics data of the same set of labeled objects is expected to escalate prediction (classification or regression) power
in the above setting but without class labels, the task becomes an unsupervised learning to discover novel groups of samples.
given multiple heterogeneous feature sets observed for the same or group of samples, the interactions among inter-view features could be crucial to understand the pathways of a phenotype.
given homogeneous and heterogeneous relations within and between multiple sets of biological entries from different molecular levels and clinical descriptions, inferring the relations between inter-set entries is named association study in a complex system.

data integration is an urgent need in current and future bioinformatics
Table 1：An overview of integrative analyses that can be conducted by machine learning methods on four types of multi-view data. Details regarding these methods and applications are described in separate sections
table1中主要的任务就是六类：classification, regression, feature selection, pathway analysis, clustering, association study
machine learning methodologies主要是分成七类：feature concatenation, Bayesian models, tree-based ensemble methods, multiple kernel learning, network-based methods, matrix factorizations and deep neural networks.
We list open-source packages and tools for the seven categories of integrative models in Table 2.

正文组织架构：

1. Introduction

2. Feature concatenation

3. Bayesian methods to integrate prior knowledge

4. Bayesian methods for data of mixed types

5. Trees of mixed data types and ensemble learning

6. Kernel learning and metric learning

7. Network-based approaches to integrate multiple homogeneous networks

8. Network-based methods for fusing multiple relational data

9. Feature extractions and matrix factorizations for detecting shared and view-specific components

10. Multi-modal deep learning

11. Discussion and conclusion

正文部分内容摘录(满屏的公式。。。)：