Paper reading (二十三):Machine Learning for Detecting Gene-Gene Interactions: A Review

论文题目:Machine Learning for Detecting Gene-Gene Interactions:A Review

scholar 引用:215

页数:21

发表时间:2006.06

发表刊物:Applied Bioinformatics

作者:Brett A. McKinney, David M. Reif, Marylyn D. Ritchie, and Jason H. Moore

摘要:Keywords: Hide layer, Genetic programming, Multifactor dimensionality reduction, Traditional statistical method, Genomewide association study

Complex interactions among genes and environmental factors are known to play a role in common human disease aetiology(病因学). There is a growing body of evidence to suggest that complex interactions are 'the norm' and, rather than amounting to small perturbation(扰乱) to classical Mendelian genetics(孟德尔遗传学的), interactions may be the predominant effect. Traditional statistical methods are not well suited for detecting such interactions, especially when the data are high dimensional (many attributes or independent variables) or when interactions occur between more than two polymorphisms(多态性). In this review, we discuss machine-learning models and algorithms for identifing and characterising susceptibility genes in common, complex, multifactorial(多因子的) human diseases. We focus on the following machine-learning methods that have been used to detect gene-gene interactions: neural networks, cellular automata(细胞自动机), random forest, and mulfifactor dimensionality reduction. We conclude with some ideas about how these methods and others can be integrated into a comprehensive and flexible framework for data mining and knowledge discovery in human genetics.

结论:

  • New methods are needed to analyse genetic data that not only address the usual challenges posed by real-world data, but
    that also recognise interactions as an important effect rather than a perturbation to independent main effects.
  • we discussed evolution-optimised NNs and CAs, as well as MDR and RFs machine-learning models that have been successfully used to detect gene-gene interactions.(这几个方法可以仔细看看,但是这是2006年的paper,当时机器学习还没有火起来,所以很可能这些方法已经被其他新方法所替代了,但是可以看看当时的人们是怎么思考的)
  • MDR(多因子降维法) is a deterministic and conceptually simple constructive induction method that exhaustively considers every possible combination of variables up to a given order.
  • For higherorder interactions, it would then be necessary to implement an RF approach or a stochastic optimisation method to attempt to traverse the vast search space.
  • Perhaps a similar underlying order waits to be discovered in genetics through the collaborative efforts of geneticists, epidemiologists, bioinformaticists, computer scientists, physicians and others.  感觉现在各种领域都是大数据,然后都觉得机器学习或者深度学习可以应用进来,那么就需要各种交叉领域的专家齐心协力喽~~~~
  • 这篇的conclusion很长,看起来还有点看正文的感觉,然后会想,为什么要写这么长呢?是不是一个好的sci writing?

Introduction:

  • In fact, there are reasons to believe that the effect of gene-gene interactions, or epistasis, plays a more important role than the independent main effect of any one gene in the susceptibility to common human diseases.
  • embraces the complexity of genetic architecture
  • Traditional parametric statistical methods are limited in their ability to identify interacting susceptibility genes in small sample sizes because of the sparseness of the data in high dimensions.
  • Another drawback of traditional statistical methods for identifying interactions is the need to specify a model for the interaction.
  • One of the advantages of logistic regression is the simple physical interpretation of the model and its parameters as they relate genotypes to probability of disease. However, the advantage of interpretability is nullified if the method is unable to determine which variables interact.
  • Classic applications of machine learning include speech and handwriting recognition, game playing and data mining. 2006年的machine learning还主要是这几个方面的应用,经过了13年的发展,应用已经丰富了太多太多!
  • This review focus on four models:neural networks (NNs), cellular automata (CAs), random forests (RFs) and multifactor
    dimensionality reduction (MDR).
  • 浅谈主成分分析与因子分析

正文组织架构:

1. Introduction

2. Optimisation and Evolution

3. Neural Networks

4. Cellular Automata

5. Random Forest

6. Multifactor Dimensionality Reduction

7. A Flexible Stragety for Data Mining and Knowledge Discovery

8. Conclusion

正文部分内容摘录:

2. Optimisation and Evolution

  • In keeping with the biological and genetic theme, we focus on evolutionary algorithms for optimisation.
  • One goal of an optimisation procedure is to find a set of parameters that allows the machinelearning model to most accurately predict class membership.

3. Neural Networks

  • NNs are often considered to be a mysterious black box, 那时候虽然没有大火的deep learning术语,但是black  box已经广为人知了。。。
  • 激活函数还是:sigmoid,还不是Relu
  • GPNN:GP, each GP binary expression tree represents an NN,applied in the following section to NNs, is similar to genetic algorithms, except the evolutionary operators act on binary expression trees instead of binary arrays or chromosomes.
  • GPNN is able to detect the functional SNPs and model the interactions for the epistasis models described.

4. Cellular Automata

  • CAs are discrete, dynamical systems capable of performing computations on a lattice of cells.
  • 细胞自动机(cellular automata)是为模拟包括自组织结构在内的复杂现象提供的一个强有力的方法,也称为元胞自动机(Cellular Automaton)。细胞自动机模型的基本思想是:自然界里许多复杂结构和过程,归根到底只是由大量基本组成单元的简单相互作用所引起。细胞自动机主要研究由小的计算机或部件,按邻域连接方式连接成较大的、并行工作的计算机或部件的理论模型。它分为固定值型、周期型、混沌型以及复杂型。

  • It was shown that CAs have very good power for identifying gene-gene interactions even in the presence of real-world sources of noise such as genotyping error and phenocopy.

5. Random Forest

  • An RF is a collection of individual decision-tree classifiers, where each tree in the forest has been trained using a bootstrap sample of instances from the data, and each split attribute in the tree is chosen from among a random subset of attributes.
  • These models may uncover interactions among genes and/or environmental factors that do not exhibit strong marginal effects.

6. Multifactor Dimensionality Reduction

  • 有几个傻傻分不清的概念如下:
  • 主成分分析(Principal Component Analysis,PCA)通过将原始变量转换为原始变量的线性组合(主成分),在保留主要信息的基础上,达到简化和降维的目的。
  • 因子分析(Factor Analysis,FA)是一种数据简化技术,通过研究众多变量之间的内部依赖关系,探求观测数据的基本结构,并用少数几个假想变量(因子)来表示原始数据。
  • 多因子降维法(MDR,Multifactor Dimensionality Reduction )是近年统计学中发展起来的一种新的分析方法。其中,“因子” 即交互作用研究中的变量,“维” 是指研究中多因子组合的个数。它弥补了Logistic回归在处理高阶交互作用时的局限性。
  • MDR is a machine-learning method specifically designed to identify interacting combinations of genetic variations associated with increased risk of common, complex, multifactorial human diseases.
  • Application of MDR to case-control datasets has routinely yielded evidence of epistasis in
    the absence of main effects.

7. A Flexible Stragety for Data Mining and Knowledge Discovery

  • a four-step framework for data mining and knowledge discovery that can integrate constructive induction algorithms such as MDR with other machine-learning methods such as NNs, CAs and RFs.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值