Deep Neural Networks for High Dimension, Low Sample Size Data
Publication: IJCAI’17: Proceedings of the 26th International Joint Conference on Artificial IntelligenceAugust 2017
code
GBFS算法:http://www.cse.wustl.edu/˜xuzx/research/code/GBFS.zip(已连不上)
HSIC-Lasso code: http://www.makotoyamada-ml.com/software.html(页面中已过期)
dataset
Biological datasets: http://featureselection.asu.edu/datasets.php
Introduction
In bioinformatics, gene expression data suffers from the growing challenges of high dimensionality and low sample size. This kind of high dimension, low sample size (HDLSS) data is also vital for scientific discoveries in other areas such as chemistry, financial engineering, and etc [Fan and Li, 2006]. When processing this kind of data, the severe overfitting and high-variance gradients are the major challenges for the majority of machine learning algorithms [Friedman et al., 2000].
Feature selection has been widely regarded as one of the most powerful tools to analyze the HDLSS data. However, selecting the optimal subset of features is known to be NP-hard [Amaldi and Kann, 1998]. Instead, a large body of compromised methods for feature selection have been proposed.
- Lasso [Tibshirani, 1996] pursue sparse linear models:sparse linear models ignore the nonlinear input-output relations and interactions among features.
- nonlinear feature selection via kernel methods [Li et al., 2005; Yamada et al., 2014] or gradient boosted tree:address the curse of dimensionality under the blessing of large sample size.
The deep neural networks (DNN) methods light up new scientific discoveries. DNN has achieved breakthroughs in modeling nonlinearity in wide applications. The deeper architecture of a DNN is, the more complex relations it can model. DNN has harvested initial successes in bioinformatics for modeling splicing [Xiong et al., 2015] and sequence specificity [Alipanahi et al., 2015]. Estimating a huge amount of parameters for DNN using abundant samples may suffer from severe overfitting, not to mention the HDLSS setting.
To address the challenges of the HDLSS data, we propose an end-to-end DNN model called Deep Neural Pursuit (DNP). DNP simultaneously selects features and learns a classifier to alleviate severe overfitting caused by high dimensionality. By averaging over multiple dropouts, DNP is robust and stable to high-variance gradients resulting from the small sample size. From the perspective of feature selection, the DNP model selects features greedily and incrementally, similar to the matching pursuit [Pati et al., 1993]. More concretely, starting from an empty subset of features and a bias, the proposed DNP method incrementally selects an individual feature according to the backpropagated gradients. Meantime, once more features are selected, DNP is updated using the backpropagation algorithm.
The main contribution of this paper is to tailor the DNN for the HDLSS setting using feature selection and multiple dropouts.
Related Work
we discuss feature selection methods that are used to analyze the HDLSS data including linear, nonlinear and incremental methods.
- sparsity-inducing regularizer is one of the dominating feature selection methods for the HDLSS data.
Lasso [Tibshirani, 1996] minimizes the objective function penalized by the l_1 norm of feature weights, leading to a sparse model. Unfortunately, Lasso ignores the nonlinearity and interactions among features. - (1) Kernel methods are often used for nonlinear feature selection.
Feature Vector Machine (FVM) [Li et al., 2005];
HSIC-Lasso [Yamada et al., 2014] improves FVM by allowing different kernel functions for features and labels;
LAND [Yamada et al., 2016] further accelerates HSIC-Lasso for data with large sample size via kernel approximation and distributed computation
(2) Decision tree models are also qualified for modeling nonlinear input-output relations.
random forests [Breiman, 2001]
Gradient boosted feature selection (GBFS) [Xu et al., 2014]
The aforementioned nonlinear methods, including FVM, random forests and GBFS, require training data with large sample size.
HSIC-Lasso and LAND fits the HDLSS setting. However, compared to the proposed DNP model which is end-to-end, HSIC-Lasso and LAND are two-stage algorithms which separate feature selection from the classification - Besides DNP method, there exist other greedy and incremental feature selection algorithms.
**SpAM:**sequentially selects an individual feature in an additive manner, thereby missing important interactions among features.
Grafting method & convex neural network :only consider single hidden layer;differ from DNP in the motivation.(Grafting focuses on the acceleration of algorithms and convex neural network focuses on the theoretical understanding of neural networks.) - Deep feature selection (DFS):selects features in the context of DNN;However, according to our experiments, DFS fails to achieve sparse connections when facing the HDLSS data.
DNP Model
introduce notations: F ϵ R d FϵR^{d} FϵRd ——input feature space in the d-dimension;
X = ( X 1 , X 2 , … , X n ) X=(X_{1},X_{2},…,X_{n}) X=(X1,X2,…,Xn), y = ( y 1 , y 2 , … , y n ) T y=(y_{1},y_{2},…,y_{n})^{T} y=(y1,y2,…,yn)