ML Notes: Week 2 - Multivariate Linear Regression

最新推荐文章于 2021-11-27 11:38:12 发布

CCrazyGuy

最新推荐文章于 2021-11-27 11:38:12 发布

阅读量177

点赞数 1

分类专栏： ML学习笔记文章标签： machine learning

本文链接：https://blog.csdn.net/jty573894890/article/details/106321922

版权

ML学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1. The basic theory of the multivariate linear regression

Hypothesis: $h_\theta(x)=\theta_0x_0+\theta_1x_1+\ldots+\theta_nx_n = \theta^TX$

Parameters: $\theta_0, \theta_1, \ldots, \theta_n$

Cost Function: $J(\theta_0, \theta_1, \ldots, \theta_n)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$

We also can use the gradient descent methop to come up with the optimzed $\theta$ .

2. Feature scaling

Method1: $\frac{x_i}{\max-\min}$
Method2(Mean Normalization): $\frac{x_i-\mu}{\max-\min}$

The data could be scaled which ranges in $-1\le x_i\le1$ , or in $-0.5\le x_i\le0.5$

3. Learning rate

Too small: slow convergence
Too Large: (a) × convergence; (b) × decreas on every iteration; © slow convergence

TRY！！！
$\alpha = 0.0001, 0.01, 0.1, 1$

4. Normal equation

We can utilize the equation to solve out the $\theta$ directly.
$\theta=(X^TX)^{-1}X^Ty$

Derivation of the formula:
Cost Function: $J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
so, we can vectorization the Cost Function as follows:
$\begin{aligned} J(\theta) &=\frac{1}{2}\underbrace{(X\theta-y)^T}_{1*m} \underbrace{(X\theta-y)}_{m*1}\\ &=\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty） \end{aligned}$
*the $m$ could be igonred.

The $\theta$ that fit to $\frac{\partial J(\theta)}{\partial \theta} =0$ could be considered as the optimum, so
$\begin{aligned} \frac{\partial J(\theta)}{\partial \theta} &=\frac{1}{2}(2X^TX\theta-X^Ty-(y^TX)^T-0)\\ &= \frac{1}{2}(2X^TX\theta-X^Ty-X^Ty-0)\\ &= X^TX\theta-X^Ty=0 \end{aligned}$
$X^TX\theta=X^Ty$
we can solve out that $\mathop \theta\limits_{n*1} =(\mathop {X^T} \limits_{n*m} \mathop X\limits_{m*n})^{-1} \mathop {X^T}\limits_{n*m} \mathop y\limits_{m*1}$

*(1) $\frac{\partial A\theta}{\partial\theta} = A^T$

*(2) $\frac{\partial \theta^T A\theta}{\partial\theta} = 2A\theta$

%% ============= normal equation ==========
theta_normal = zeros(2,1);
theta_normal = inv(X'*X) * X' * y;

More information: Derivation of the Normal Equation for linear regression

5. Vectorization in univariate gradient descent

Vectorization

% Vectorization to calculate the \theta
itera = 3000;
theta_matrix = [0 0];
theta_itera = zeros(itera,2); % record all the theta values during the process
for j = 1:itera
    theta_itera(j,:) = theta_matrix;
    hypothesis = X * theta_matrix';
    theta_matrix = theta_matrix - (alpha/m) * ((hypothesis - y)'* X);
end

“for” Loop

% "for" loop to calculate the \theta
itera = 3000;
theta_itera = zeros(length(y),2);
for j = 1:itera
    theta_itera(j,:) = theta';  % record all the theta values during the process
    hypothesis = X * theta;
    for i = 1:theta_length
        theta(i) = theta(i) - (alpha/m) * ((hypothesis - y)'* X(:,i));  
    end

end

**** What if $X^TX$ is non-invertible?

(1) Delete the linearly dependent features (e.g. $x 2 = 2 x 1$ );
(2) Delete some features to make m(# sample) $\le$ n(# features);
(3) Use regularization.