【机器学习】1 线性回归

最新推荐文章于 2024-07-20 22:58:39 发布

社恐患者

最新推荐文章于 2024-07-20 22:58:39 发布

阅读量263

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_44714521/article/details/107589453

版权

机器学习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

第1章线性回归

1 Linear Regression with One Variable / Univariate Linear Regression（单变量线性回归）
2 Linear Regression with Multiple Variables / Multivariate Linear Regression （多变量线性回归）
3 Gradient Descent in Practice
4 Normal Equation（正规方程）
- 4.1 model
- 4.2 writing
5 梯度下降与正规方程的比较
6 参考

1 Linear Regression with One Variable / Univariate Linear Regression（单变量线性回归）

单变量线性回归

1.1 model

Hypothesis： $h_\theta(x)=\theta_0+\theta_1x$
Parameters： $\theta_0,\theta_1$
Cost Function：square error function / square error cost function $J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2$
Goal（Object Function）： $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1)$

1.2 ‘Batch’ Gradient Descent Algorithm（梯度下降）

梯度下降

to solve $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1} J(\theta_0,\theta_1)$

1.2.1 algorithm

repeat until convergence{
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\text{（for $j=0$ and $j=1$）}$
$\alpha$ —— learning rate
}
Correct: $\begin{aligned} {temp}_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ {temp}_1&:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0&:={temp}_0\\ \theta_1&:={temp}_1 \end{aligned}$
Notice: need to simultaneously update $\theta_0$ and $\theta_1$

1.2.2 use for univariate linear regression

repeat{
$\begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x^{(i)}) \end{aligned}$
}

1.2.3 features

在梯度下降的每一步中，我们都用到了所有的训练样本

2 Linear Regression with Multiple Variables / Multivariate Linear Regression （多变量线性回归）

多变量线性回归

2.1 model

Hypothesis： $x_0=1$ $h_\theta(x)=\theta^Tx=\theta_0x_0+\theta_1x_1+···+\theta_nx_n$
Parameters： $n + 1$ -demensional vector $\theta=\theta_0,\theta_1,···,\theta_n$
Cost Function：square error function / square error cost function $J(\theta)=J(\theta_0,\theta_1,···,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2$
Goal（Object Function）： $\mathop{\text{minimize}}\limits_{\theta_0,\theta_1,···,\theta_n} J(\theta_0,\theta_1,\theta_n)$

2.2 Gradient Descent Algorithm

2.2.1 algorithm

repeat until convergence{
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1,···,\theta_n)\text{（simultaneously update $\theta_j$ for $j=0,1,···,n$）}$
}

2.2.2 use for multiple linear regression

repeat{
$x_0^{(i)}=1$
$\begin{aligned}\theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_1&:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_1^{(i)})\\ ···\\ \theta_n&:=\theta_n-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_n^{(i)})\\ \end{aligned}$
}

3 Gradient Descent in Practice

3.1 Feature Scaling（特征缩放）

Goal: make sure features are on a similar scale
使得所有的特征有统一的量度
Advantages:
(1) make gradient desecent run much faster
使梯度下降更快
(2) converge in a lot fewer iterations
减少收敛次数
Methods:
(1) dividing by the maximum value
根据最大值划分
(2) mean normalization
均值归一化

3.1.1 mean normaliztion（均值归一化）

Theory: replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean
Do not apply to $x_0=1$
$x_i=\frac{x_i-\mu_i}{s_i}$
$\mu_i$ —— average value of $x_i$ in the training sets
$s_i$ —— the range of values of that feature
(1) $\text{maximum value} - \text{minimum value}$
(2) $\text{standard deviation of the variable}$ （标准差）

3.2 Learning Rate

Debugging: How to make sure gradient descent in working correctly
调试：确保梯度下降正常运行

3.2.1 good method: plot

plot the cost function as we increase the number of iterations
画出代价函数值 $J(\theta)$ 随迭代次数变化的曲线
Advantages:
(1) can show gradient descent if working correctly —— $J(\theta)$ should decrease after every iteration
可以观察梯度下降法是否正常运行
(2) can judge whether or not gradient descent has converged
可以判断梯度下降是否收敛

3.2.1 other method

declare convergence if $J(\theta)$ decreases by less than $\varepsilon$ in one iteration
可以设置一个阈值，当代价函数的值低于这个阈值时，停止收敛
Advantage: judge if converging automatically
优点：自动完成收敛过程
Disadvantage: choose $\varepsilon$ can be difficult
缺点：阈值选择存在困难

3.2.3 choose $\alpha$

consider: $\alpha=0.01,0.03,0.1,0.3,1,3,10,···$

3.2.4 plot problem

problem: $J(\theta)$ does not decrease after every iteration
问题：每次迭代后 $J(\theta)$ 没有下降
cause: $\alpha$ is too large that $J(\theta)$ may not decrease on every iteration; may not converge and slow converge is also possible
原因： $\alpha$ 太大，可能导致 $J(\theta)$ 没法收敛，或者收敛速度很慢
Solve: choose sufficiently smaller $\alpha$
解决：适当减少 $\alpha$ 值
Problem caused by solution above: $\alpha$ is too small that gradient descent can be slow to converge
减少 $\alpha$ 值可能造成的问题：收敛速度变得很慢

3.3 Features and Polynomial Regression

linear regression does not adapt to all datas
不是所有数据集都能用线性回归的
polynomial regression can translate into linear regression
多项式回归可以转化为线性回归
we need to observe training sets so that we can choose a sufficient model
我们需要通过观察训练集，从而选择恰当的模型
Notice: feature scaling is necessary if choosing polynomial regresssion
如果使用多项式回归解决问题，必须进行特征缩放

4 Normal Equation（正规方程）

a method to solve for $\theta$ analytically one step to get to the optimal value right
一步求解得到最优值时的 $\theta$ 的方法

4.1 model

$m$ examples $x^{(1)},y^{(1)}),···,(x^{(m)},y^{(m)})$ ; $n$ features
$n + 1$ dimensional vector: $x^{(i)}=\left[ \begin{matrix} x_0^{(i)}\\ ···\\ x_n^{(i)}\\ \end{matrix} \right]$
design matrix ( $m \cdot (n + 1)$ dimensional vector): $X=\left[ \begin{matrix} {(x^{(1)})}^T\\ ···\\ {(x^{(m)})}^T\\ \end{matrix} \right]$
$y=\left[ \begin{matrix} y^{(1)}\\ ···\\ y^{(m)}\\ \end{matrix} \right]$
$\theta={(X^TX)}^{-1}X^Ty$

4.2 writing

Octave

 pinv(X'*X)*X'*y

Python

import numpy as np
def notmalEqn(X,y):
	theta = np.linalg.inv(X.T@X)@X.T@y	# X.T@X 等价于 X.T.dot(X)
	return theta

problem: if $X^TX$ is non-invertible ( singular or degenerate matrices)?

cause:
(1) redundant features (liearly dependent)
(2) too many features —— delete some features or use regularation
出现不可逆矩阵的情况极少发生
伪逆pesudo-inverse：pinv()
逆：inv()

5 梯度下降与正规方程的比较

Gradient Descent 梯度下降	Normal Euqation 正规方程
need to choose $\alpha$ 需要选择 $\alpha$	no need to choose $\alpha$ 不需要选择 $\alpha$
need many iterations 需要多次迭代	no need to iterate 不需要迭代
works well even when $n$ is large 即使特征很多，也能很好地运行	slow if $n$ is very large because we need to compute $X^TX)^{-1}$ 【 $n$ is hard to get】特征很多的话，运行会变得很慢
适用于各种类型的模型	只适用于线性回归，不适合逻辑回归