Introduction
When I use gradient descent to get the h(x) which is similar to ‘x^2 + 2*x + 1’, I find a problem that the alpha need to be small like 0.000001, otherwise the variable can’t be regressed, so that the training will become unbearable slow, so I use the Feature Scaling.
Concept
For example, there are two features.
The first feature ranges from [1,100], and the second feature ranges from [1,10000].
So if we draw the contour map, it may be like this( the image on the left):
So, if we narrow the feature to the range [-1,1] (or little small or large eg.[-3,3] (max) [-1/3,1/3] (min)), the gradient descent will be like the image on the right, which will take much less time.
So, we find the max and min data in each column to get the standard deviation, and make each column data divide this number.
For example, we have database X like this:
X =
1 90 8100
1 3 9
1 68 4624
1 43 1849
1 4 16
1 88 7744
1 76 5776
1 21 441
1 12 144
1 60 3600
1 5 25
1 35 1225
1 24 576
1 5 25
1 90 8100
1 62 3844
1 6 36
1 82 6724
1 77 5929
1 15 225
1 38 1444
1 48 2304
1 46 2116
1 92 8464
1 21 441
1 45 2025
Just by the feature Scaling:
X = [ones(m,1),X(:,2) ./ (max(X,[],1) - min(X,[],1))(1,2),X(:,3) ./ (max(X,[],1) - min(X,[],1))(1,3)]
We can get the X’:
X =
1.0000e+000 9.0909e-001 8.1008e-001
1.0000e+000 3.0303e-002 9.0009e-004
1.0000e+000 6.8687e-001 4.6245e-001
1.0000e+000 4.3434e-001 1.8492e-001
1.0000e+000 4.0404e-002 1.6002e-003
1.0000e+000 8.8889e-001 7.7448e-001
1.0000e+000 7.6768e-001 5.7766e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 1.2121e-001 1.4401e-002
1.0000e+000 6.0606e-001 3.6004e-001
1.0000e+000 5.0505e-002 2.5003e-003
1.0000e+000 3.5354e-001 1.2251e-001
1.0000e+000 2.4242e-001 5.7606e-002
1.0000e+000 5.0505e-002 2.5003e-003
1.0000e+000 9.0909e-001 8.1008e-001
1.0000e+000 6.2626e-001 3.8444e-001
1.0000e+000 6.0606e-002 3.6004e-003
1.0000e+000 8.2828e-001 6.7247e-001
1.0000e+000 7.7778e-001 5.9296e-001
1.0000e+000 1.5152e-001 2.2502e-002
1.0000e+000 3.8384e-001 1.4441e-001
1.0000e+000 4.8485e-001 2.3042e-001
1.0000e+000 4.6465e-001 2.1162e-001
1.0000e+000 9.2929e-001 8.4648e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 4.5455e-001 2.0252e-001
1.0000e+000 1.1111e-001 1.2101e-002
1.0000e+000 1.9192e-001 3.6104e-002
1.0000e+000 4.5455e-001 2.0252e-001
1.0000e+000 9.5960e-001 9.0259e-001
1.0000e+000 4.4444e-001 1.9362e-001
1.0000e+000 9.3939e-001 8.6499e-001
1.0000e+000 7.9798e-001 6.2416e-001
1.0000e+000 8.7879e-001 7.5698e-001
1.0000e+000 1.0000e+000 9.8020e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 7.1717e-001 5.0415e-001
1.0000e+000 9.7980e-001 9.4099e-001
1.0000e+000 6.4646e-001 4.0964e-001
1.0000e+000 7.4747e-001 5.4765e-001
1.0000e+000 8.7879e-001 7.5698e-001
And this will only take thousands of steps, compared with the millions of steps.
Expand
Except Feature Scaling, we can use another way to optimise the gradient descent.
Some data may range from [0, 5000] ,while some data may range from [-200,200].
It’s obvious that [-1,1] is much better than [0,2].
So we can make every data to the range [-range,range].
Guided by this theropy, we can get the following expression:
This will also make the training faster.