http://blog.csdn.net/pipisorry/article/details/43529845
机器学习Machine Learning - Andrew NG courses学习笔记
多变量线性规划multivariate linear regression
multiple variables 或 multiple features的线性规划
Multiple Features(variables)多特征(变量)
多变量线性回归的假设hypothesis函数的表示
Note: We store each example as a row in the the X matrix in Octave.
为了方便表示,我们在其中添加了一个截断项intercept term (θ0),也就是在X中添加了一个全1的额外第一列,将其看成另一个简单的feature.
多变量的梯度下降Gradient Descent
gradient descent求解cost func最小值来求参数θ
{其中左边是单变量线性规划求解参数的gradient descent algorithm;右边是多变量线性规划求解参数的算法}
octave相关代码
计算J(theta)每次迭代后的值的vectorization代码:
J = ( (X * theta - y)' * (X * theta - y)) / (2 * m);
%J = sum( (X * theta - y).^2 ) / (2 * m);
计算每次迭代后theta的值的vectorization代码:
theta -= alpha / m * (X' * (X * theta - y));Gradient Descent in Practice I - Feature Scaling梯度下降实践1 - 特征缩放
{feature skill : getting features to be on similar ranges of Scales of similar ranges of values of each other.
for making gradient descent work well: make gradient descent run much faster and converge in a lot fewer other iterations.}
why:
if you make sure that the features are on a similar scale, then gradient descents can converge more quickly.(如下右图)
如果不进行feature scaling:
gradients may end up taking a long time and can oscillate(振荡) back and forth and take a long time before it can finally find its way to the global minimum.(如下左图)
也就是不scaling时,下降的方向不是对准最优值方向,这样求解相当慢。
how to feature scaling?
1. 除以最大值max 或者范围range(max - min)
In addition to dividing by so that the maximum value when performing feature scaling sometimes.{每个feature都除以最大值来做feature scaling, 使取值区间在[-1, 1]类似范围内就可以}
If you end up having a different feature that winds being between -2 and + 0.5,this is close enough to minus one and plus one, and that's fine.{x1,x2,x3 不必一定都在区间[-i, i]上, 只要比较接近就可以}
if you have a different feature, say X3 ranges [-100, +100] or if X 4 takes on values between [-0.0001, +0.0001], this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled(poorly scaled) feature and similarly.{但差别不能太大}
关于feature区间的一个好的选择:if a feature takes on the range of values from say [-3, 3] should be just fine.
2. mean normalization
也即
{note:x1 or x2 can actually be slightly larger than .5 but, close enough.any value that gets the features into anything close to these sorts of ranges will do fine.
S1为range 或者标准差standard deviations.The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within 2 standard deviations of the mean); this is an alternative to taking the range of values.
the extra column of 1's corresponding to x0 = 1 has not yet been added to X when scaling.and first column of X is all-ones. Thus, it does not need to be normalized.
When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation. Given a new x value, we must first normalize x using the mean and standard deviation that we had previously computed from the training set.
Code1:
mu = mean(X);
sigma = std(X);
for line = 1:size(X,1)
X_norm(line, :) = (X_norm(line, :) - mu) ./ sigma;
end
Code2:
%for i=1:size(X,2)
% mu(1,i) = mean(X(:,i));
% sigma(1,i) = std(X(:,i));
% X_norm(:,i) = (X(:,i)-mu(1,i))/sigma(1,i);
%end
}
总之:the feature scaling doesn't have to be too exact,in order to get gradient descent to run quite a lot faster.
Gradient Descent in Practice II - Learning Rate α梯度下降实践2 - 学习率
如何判断gradient descent迭代到收敛了Howto make sure gradient descent is working correctly
1. 绘图法(图左):usually by plotting this sort of plot,by looking at these plots that I tried to tell if gradient descent has converged.
for iter = 1:num_iters
theta -= alpha / m * (X' * (X * theta - y));
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
figure;
plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2);
%plot(1:num_iters, J_history);
2. 收敛测试(图右)
but usually choosing what this threshold is is pretty difficult.So, in order to check your gradient descent has converged,tend to look at plots like like this figure on the left.
Features and Polynomial Regression特征和多项式回归
{about the choice of features that you have and how you can get different learning algorithm}
polynomial regression(多项式回归) allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions.
defining new features:sometimes by defining new features you might actually get a better model.Closely related to the idea of choosing your features is this idea calledpolynomial regression.
{这里重新将frontage(x1)和depth(x2)的乘积定义为land area面积,从而将多变量分布转换为单变量分布,建立更好的model}
多项式回归建模: 对house price prediction重新建模,从线性规划建模转换到多项式建模
1. 二次方程建模 you may think the price is a quadratic function(二次方程blue line)
But then you may decide that your quadratic model doesn't make sense, eventually this function comes back down,but we don't think housing prices should go down when the size goes up too high.
2. 三次方程建模 use a cubic function(green line), where we have now a third-order term
Note: if you choose your features like x = size^n, then feature scaling becomes increasingly important.(如蓝色注释)
square root function: (fuchsia line紫红色线)But rather than going to a cubic model there, you have, maybe, other choices of features and there are many possible choices.
Normal Equation普通方程
一些线性回归问题中求解参数θ的另一种更好的analytically方法,而不用运行梯度下降的迭代算法。
set偏导为0, 联立方程组求解theta。注意,因为这里是使用的Square loss,所以才是最小二乘法解,否则可不能这样解!因为目标函数如果是二次的,对于线性回归来说,那实际上是有解析解的,求导并令导数等于零即可得到最优解实际就是最小二乘解。
方程组求解
方程组可以写成矩阵方式求解
最小二乘法解线性回归
首先构建Design matrix
通过公式计算theta的值(可通过octave中自带的函数求解)
Note: 用这种方法时不需要feature scaling
最小二乘法
所以说监督学习的Loss函数如果是Square loss,那就是最小二乘了。[机器学习算法及其损失函数]
[矩阵分析与应用-张]
欠定方程加上L2规则项,就变成了下面这种情况,就可以直接求逆了:
X‘X(也是实对称矩阵[hermite矩阵的实特例])只是保证它是n*n的矩阵可求逆,但不保证可逆。如果当我们的样本X的数目比每个样本的维度还要小的时候(欠定方程),矩阵X'X将会不是满秩的( rank(X'X) = rank(X) ≤ min(m,n) = m,X'X是n*n矩阵,秩<=m<n),也就是X'X会变得不可逆,所以w*就没办法直接计算出来了,或者更确切地说,将会有无穷多个解(因为我们方程组的个数小于未知数的个数)。也就是说,我们的数据不足以确定一个解,如果我们从所有可行解里随机选一个的话,很可能并不是真正好的解,总而言之,我们过拟合了。不可逆的解决方式是加正则项[最优化方法:范数和规则化regularization]
不可逆Noninvertibility的普通方程
{如果X'X 不可逆non-invertible}
Octave has two functions for inverting matrices:
One is called pinv(), and the other is called inv().One's called the pseudo-inverse, one's called the inverse.as long as you use the pinv() function,then this will actually compute the value of theta that you want,even if X transpose X is non-invertible.
if X’X is non-invertible,there are usually two most common causes:(导致XtX奇异的原因及解决方法{证明略})
Note:
1. 两个不同feature不应该存在线性关系,否则就是redundant冗余的feature
2. regularization,which will kind of let you fit a lot of parameters using a lot of features even if you have a relatively small training set.
Gradient descent & normal equation的优劣
Note: for almost computed implementations the cost of inverting the matrix, rose roughly as the cube of the dimension of the matrix.
so 两种 方法都在什么情况下使用?
So if n is large then might usually use gradient descent; and for more complex learning algorithm,e.g. classification algorithm, like a logistic regression algorithm,The normal equation method actually do not work,we will have to resort to gradient descent for those algorithms.
But, if n is relatively small,then the normal equation might give you a better way to solve the parameters.
What does small and large mean?
it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe,some other algorithms that talk about later.
Review复习
Question 1 :
Question 2 :
from:http://blog.csdn.net/pipisorry/article/details/43529845
ref:多变量模型回归——数据分析经典《Data Analysis for Politics and Policy》第四章《Multiple Regression》