【论文笔记】Deep Neural Networks for High Dimension, Low Sample Size Data

本文探讨了在生物信息学等领域中,处理高维小样本(HDLSS)数据所面临的挑战。为了解决这个问题,提出了Deep Neural Pursuit(DNP)模型,它结合特征选择和多dropout技术,以减轻由高维度引起的过拟合,并应对小样本大小带来的高方差梯度。DNP通过贪婪和增量的方式选择特征,同时训练分类器,实现端到端的学习。实验表明,DNP在合成数据和真实世界生物数据集上表现出色,优于Lasso、GBFS和HSIC-Lasso等方法。
摘要由CSDN通过智能技术生成


Publication: IJCAI’17: Proceedings of the 26th International Joint Conference on Artificial IntelligenceAugust 2017

code

GBFS算法:http://www.cse.wustl.edu/˜xuzx/research/code/GBFS.zip(已连不上)
HSIC-Lasso code: http://www.makotoyamada-ml.com/software.html(页面中已过期)

dataset

Biological datasets: http://featureselection.asu.edu/datasets.php

Introduction

In bioinformatics, gene expression data suffers from the growing challenges of high dimensionality and low sample size. This kind of high dimension, low sample size (HDLSS) data is also vital for scientific discoveries in other areas such as chemistry, financial engineering, and etc [Fan and Li, 2006]. When processing this kind of data, the severe overfitting and high-variance gradients are the major challenges for the majority of machine learning algorithms [Friedman et al., 2000].

Feature selection has been widely regarded as one of the most powerful tools to analyze the HDLSS data. However, selecting the optimal subset of features is known to be NP-hard [Amaldi and Kann, 1998]. Instead, a large body of compromised methods for feature selection have been proposed.

  1. Lasso [Tibshirani, 1996] pursue sparse linear models:sparse linear models ignore the nonlinear input-output relations and interactions among features.
  2. nonlinear feature selection via kernel methods [Li et al., 2005; Yamada et al., 2014] or gradient boosted tree:address the curse of dimensionality under the blessing of large sample size.

The deep neural networks (DNN) methods light up new scientific discoveries. DNN has achieved breakthroughs in modeling nonlinearity in wide applications. The deeper architecture of a DNN is, the more complex relations it can model. DNN has harvested initial successes in bioinformatics for modeling splicing [Xiong et al., 2015] and sequence specificity [Alipanahi et al., 2015]. Estimating a huge amount of parameters for DNN using abundant samples may suffer from severe overfitting, not to mention the HDLSS setting.

To address the challenges of the HDLSS data, we propose an end-to-end DNN model called Deep Neural Pursuit (DNP). DNP simultaneously selects features and learns a classifier to alleviate severe overfitting caused by high dimensionality. By averaging over multiple dropouts, DNP is robust and stable to high-variance gradients resulting from the small sample size. From the perspective of feature selection, the DNP model selects features greedily and incrementally, similar to the matching pursuit [Pati et al., 1993]. More concretely, starting from an empty subset of features and a bias, the proposed DNP method incrementally selects an individual feature according to the backpropagated gradients. Meantime, once more features are selected, DNP is updated using the backpropagation algorithm.

The main contribution of this paper is to tailor the DNN for the HDLSS setting using feature selection and multiple dropouts.

Related Work

we discuss feature selection methods that are used to analyze the HDLSS data including linear, nonlinear and incremental methods.

  1. sparsity-inducing regularizer is one of the dominating feature selection methods for the HDLSS data.
    Lasso [Tibshirani, 1996] minimizes the objective function penalized by the l_1 norm of feature weights, leading to a sparse model. Unfortunately, Lasso ignores the nonlinearity and interactions among features.
  2. (1) Kernel methods are often used for nonlinear feature selection.
    Feature Vector Machine (FVM) [Li et al., 2005];
    HSIC-Lasso [Yamada et al., 2014] improves FVM by allowing different kernel functions for features and labels;
    LAND [Yamada et al., 2016] further accelerates HSIC-Lasso for data with large sample size via kernel approximation and distributed computation
    (2) Decision tree models are also qualified for modeling nonlinear input-output relations.
    random forests [Breiman, 2001]
    Gradient boosted feature selection (GBFS) [Xu et al., 2014]
    The aforementioned nonlinear methods, including FVM, random forests and GBFS, require training data with large sample size.
    HSIC-Lasso and LAND fits the HDLSS setting. However, compared to the proposed DNP model which is end-to-end, HSIC-Lasso and LAND are two-stage algorithms which separate feature selection from the classification
  3. Besides DNP method, there exist other greedy and incremental feature selection algorithms.
    **SpAM:**sequentially selects an individual feature in an additive manner, thereby missing important interactions among features.
    Grafting method & convex neural networkonly consider single hidden layer;differ from DNP in the motivation.(Grafting focuses on the acceleration of algorithms and convex neural network focuses on the theoretical understanding of neural networks.)
  4. Deep feature selection (DFS):selects features in the context of DNN;However, according to our experiments, DFS fails to achieve sparse connections when facing the HDLSS data.

DNP Model

introduce notations: F ϵ R d FϵR^{d} FϵRd ——input feature space in the d-dimension;
X = ( X 1 , X 2 , … , X n ) X=(X_{1},X_{2},…,X_{n}) X=(X1,X2,,Xn) y = ( y 1 , y 2 , … , y n ) T y=(y_{1},y_{2},…,y_{n})^{T} y=(y1,y2,,yn)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值