Training set size for neural networks considering curse of dimensionality

I'm learning the ropes of neural networks. Recently, I read stuff about the curse of dimensionality and how it might lead to overfitting (e.g. here).

If I understand correctly, the number of features (dimensions) d of a given dataset with n data points is very important when considering the size t of the training set.

QUESTIONS

(...not sure if all my questions are really connected to the curse of dimensionality)

  1. How do I choose the correct training size t considering d and n? Is t a function of d and n?
  2. Do I have to consider d for regularization?

 

One of the rules of thumb is to have at least 10x more data points as the number of dimensions. Using some intelligent prior information (e.g. good kernel in SVMs), you might even learn a good machine with less data points as dimensions.

The lecture about VC dimension from Yaser Abu-Mostafa [link] motivates this 10x rule with some nice charts. If you are not familiar with VC dimension concept, it is about the capacity of learning. The higher the dimension, the more complex problem we are trying to solve. For example, classical Perceptron has d+1 VC dimensions. Some problems have infinite VC dimensions, such problems are impossible to learn.

A neural net is a linear model in derived variables. Take the regression case, because it is a little bit simpler:

where XX is your data (i.e.: your features), ΓΓ are matrices of weights, γγ are "biases", and ββ are your weights connecting the topmost hidden layer to the output. You see that it is nothing more than a linear model, but in nonlinear functions of XX .

Just like in a linear model, you can overfit when you have too many parameters. A typical strategy for avoiding overfit is regularization. Rather than solving

a

in ridge regression, for example. Selecting λλ by cross-validation, you're effectively letting the data tell you how much to use your many dimensions.

This generalized directly to neural nets, except that there is no closed form solution to the minimization problem, as there is in ridge regression. You'll overfit if you do

where θθ is a concatenated vector of all of your weights.

Note that the quadratic penalty here isn't the only form of regularization. You could also do L1 regularization, or dropout regularization.

But the idea is the same: build a model that will overfit the data, and then find a regularization parameter (by some variant of cross validation) that will constrain the variability such that you don't overfit.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值