Big Data Analytics 笔记 2

B栋3食堂

已于 2022-05-05 21:54:19 修改

阅读量1.3k

点赞数

分类专栏：数学文章标签：概率论 r语言数据挖掘

于 2022-03-03 01:54:16 首次发布

本文链接：https://blog.csdn.net/MINGRAN_JIA/article/details/123211862

版权

数学专栏收录该内容

8 篇文章 0 订阅

订阅专栏

本文详细介绍了线性模型，包括参数估计、最小二乘法、正规方程和梯度下降过程。接着讨论了逻辑回归，解释了其决策边界、成本函数及避免过拟合的方法。最后提到了正则化，如Lasso、岭回归和弹性网，以及它们在减少过拟合中的作用。

摘要由CSDN通过智能技术生成

1 Linear Model

$y_i=\theta_1 x_{i1} + \theta_2 x_{i2} + ... + \theta_{n_p} x_{in_p} + \varepsilon_i,~~~~~i=1,2,...,n_d.$
or,
$\theta + \varepsilon$
where $\varepsilon$ is noise, $\theta$ is the vector of unkown parameters.

The linear model is parametric with $n_p$ parameters
If adding an interception $\theta_0$ , then the intercept is a column of 1 in the design matrix $X$
For dimension of design matrix $X:~n_d \times n_p$ or $X:~n_d \times (n_p + 1)$ for adding $\theta_0$ , $n_d > n_p$
The traditional assumption of distribution of noise (iid noise, means noise with distribution) $\varepsilon$ : $\varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d$

1.1 parameter estimation

For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter $\theta$ , the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.

1.2 Cost Function

The goal: reduces the discrepancy between predictions and data
$\Longrightarrow$ reduce the MSE of predictions

$MSE(\hat{y}) = \frac{1}{n_d}||y-\hat{y}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - \hat{y}_i)^2 \\ = \frac{1}{n_d}||y-X\hat{\theta}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\hat{\theta})^2$
Let $J(\theta)$ be cost function, find $\hat{\theta}$ of $\theta$ to minimize $J(\theta)$ , where
$J(\theta) = ||y-X\theta||^2 = \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2$
so $\hat\theta=\arg \underset{\theta} {min} J(\theta)$ , thia is the same as ordinary least squares (OLS) estimator, because OLS for $\theta = ||\varepsilon||^2=\sum^{n_d}_{i=1}\varepsilon^2=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2$ , so OLS estimator for $\hat\theta$ :
$\hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2$

The ordinary least squares (OLS) $\hat\theta$ :

$\hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2$

Cost Function:

$J(\theta)=\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2$ , where $h_\theta(x)$ n is called the hypothesis and $h_\theta(x_i)=x_i^\mathsf{T}\theta$ for linear models. So, the estimator obtained by minimizing the cost function for $h_\theta(x_i)=x_i^\mathsf{T}\theta$ and the OLS estimator obtained by minimizing the squared noise norm $||\varepsilon(\theta)||^2$ coincide

Get $\hat\theta$ :

$\left\{ \begin{array}{cc} \mathrm{gradient ~descent ~approximates~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is ~convex} & \\ \mathrm{ other~ numerical~ optimization~ schemes~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is~ not~ convex} & \end{array} \right.$

1.3 Normal equation to get $\hat\theta$

For invertible $(X^\mathsf{T}X)^{-1}$ ,
$\hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y$
Gauss-Markov assumptions

$E(\varepsilon_i) = 0$
$Var(\varepsilon_i)=\sigma^2$
$Cov(\varepsilon_i,\varepsilon_j)=0$

Under this assumption:

$X$ should be non-singular to allow its columns to be linearly independent.
$E(\hat\theta)=\theta$
$Var(\hat\theta)=\sigma^2(X^\mathsf{T}X)^{-1}$
$\hat\theta~\sim ~ \mathcal{N}(\theta,\sigma^2(X^\mathsf{T}X)^{-1}),~~i=1,2,...,n_d$
( under the assumption of $\varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d$ )
The maximum likelihood is
$\mathcal{L}(y,X|\theta,\sigma^2)=\prod^{n_d}_{i=1}(2\pi\sigma^2)^{-n_d/2} exp\left(-\frac{1}{2\sigma^2}\sum^{n_d}_{i=1}(y_i - x_i^\mathsf{T}\theta)^2\right)$
The associated MLE:
$\hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y$ , coincide with OLS for normal noise, equal to $\underset{\theta} {argmin} (y_i - x_i^{\mathsf{T}}\theta)^2$

Limitations of normal equation with large $n_p$

computational cost of $(X^\mathsf{T}X)^{-1}$
singularity, with large $n_p$ , its columns might turn to be highly correlated

1.4 Gradient Descent Process to get $\hat\theta$

The Logic: start with $\theta = (\theta_0,\theta_1)$ , then keep to change $(\theta_0,\theta_1)$ to reduce $J(\theta_0,\theta_1)$ until reach $\underset{\theta_0,\theta_1} {min}J(\theta_0,\theta_1)$

1.4.1 Algorithm

Initialize $(\theta_1,\theta_2,...,\theta_n)$
$(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)=(\theta_1,\theta_2,...,\theta_n)$
Update $\theta_i$ with $\theta_i=\tilde\theta_i - a \cdot \frac{\partial}{\partial\theta_i}J(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)$ , where $a$ is the learning rate
Repeat 2, 3 until $\theta_i$ converges

Gradient descent algorithm for linear regression

cost function for multiple linear regression model:
$J(\theta)=\frac{1}{2n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T} \theta - y_i)^2$
The updating step of gradient descent for linear regression with the given cost function:
$\theta_j^{(k+1)}=\theta_j^{(k)}-a\frac{1}{n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T}\theta^{(k)}-y_i)x_{ij}$

1.4.2 Learning Rate

The learning rate $a$ used in the algorithm should be > 0, constant.

Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might lead to non-convergence or divergence.
Can be selected using a validation set or cross-validation.

1.4.3 Stoping Convergence

Check the error tolerance close to 0:

Absolute error tolerance:
$\varepsilon_{abs}=|J(\theta^{(k+1)})-J(\theta^{k})|$
Relative error tolerance:
$\varepsilon_{rel}=\left|\frac{J(\theta^{(k+1)})-J(\theta^{k})}{J(\theta^{(k+1)})}\right|$

2 Logistic Regression

Output: ${0,1\}$
Decision boundary: $x_i^{\mathsf{T}}\theta$
Hypothesis: $h_\theta(x_i)=g(x_i^\mathsf{T}\theta) =\frac{1}{1+exp(-x_i^\mathsf{T}\theta)}$

Cost function:
$J(\theta_0,\theta_1)=\left\{ \begin{array}{cc} -log(h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=1 & \\ -log(1-h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=0 & \end{array} \right.$

General form for n samples cost function:
$J(\theta)=-\frac{1}{n}\sum^n_{i=1}(y_i log(h_\theta(x_i))+(1-y_i)log(1-h_\theta(x_i)))$

Classification:

$x_i^\mathsf{T}\theta \ge 0 ~\Longrightarrow ~h_\theta(x_i) \ge 0.5 ~\Longrightarrow ~ \hat{y}=1$
$y = 1$ , then $J(\theta)~\rightarrow~0$
$y=0,~J(\theta)~\rightarrow~\infty$
$x_i^\mathsf{T}\theta < 0 ~\Longrightarrow ~h_\theta(x_i) < 0.5 ~\Longrightarrow ~ \hat{y}=0$
$y = 1$ , then $J(\theta)~\rightarrow~\infty$
$y=0,~J(\theta)~\rightarrow~0$

3 Avoid Overfitting

higher overfitting $\Longrightarrow$ higher variance of the predictions
higher underfitting $\Longrightarrow$ higher bias of the predictions

3.1 Penalizing parameters

$\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2+\lambda r(\theta)$ , where $\lambda r(\theta)$ is the Penalty, $r(\theta)$ is the parameter penalty, $\lambda$ is the regularization parameter.l

regularization parameter $\lambda$

$\lambda ~ \rightarrow~0$ regularized regression tend to linear regression estimates
$\lambda ~ \rightarrow~\infty$ parameters get penalized more, the less the over-fitting to the data
select $\lambda$ via cross-validation

parameter penalty functions $r(\theta)$
Some typical parameter penalty functions (参考distance)

$d - v a r i a t i o n$ function: $r(\theta)=\left(\sum^{n_p}_{i=1}|\theta_i|^d\right)^{1/d}$
$L_{1}~norm:~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|$
$squared~L_2~norm~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|^2$

3.2 Regression model with penalty function

3.2.1 Lasso regression

Lasso regression uses $L_1~norm$ as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.

Cost function
$J_L(\theta)=||y-X\theta||^2+\lambda||\theta||_1= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}|\theta_j|$
Gradient of Cost function
$\frac{\partial J_L{(\theta)}}{\partial \theta_1},... \frac{\partial J_L{(\theta)}}{\partial \theta_{nd}}$

note:
$\frac{\partial \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2}{\partial \theta_1} = 2x_{1i}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)$

Lasso estimate
$\hat\theta_L=\underset{\theta}{argmin}\left(\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 + \lambda\sum^{n_d}_{j=1}|\theta_j| \right)$
Can’t be solved, but $J_L(\theta)$ is convex, using the least angle regression algorithm to solve.

3.2.2 Ridge regression

Ridge regression uses $L_2~norm$ as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.

Cost function
$J_R(\theta)=||y-X\theta||^2+\lambda||\theta||_2^2= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda\sum^{n_d}_{j=1}(\theta_j)^2$
ridge estimate
$\hat\theta_R = (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y$
This is the closed form solution.

The ridge estimate is biased because:

$\begin{aligned} E(\hat\theta_R) &= E\left( (X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}y \right) \\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T} E(y)\\ &=(X^\mathsf{T} X + \lambda I)^{-1}X^\mathsf{T}X \theta\\ &=(X^\mathsf{T} X + \lambda I)^{-1}((X^\mathsf{T}X)^{-1})^{-1}\theta\\ &=\left[(X^\mathsf{T}X)^{-1}(X^\mathsf{T} X + \lambda I)\right]^{-1} \theta\\ &=\left[ I + \lambda(X^\mathsf{T}X)^{-1}\right]^{-1}\theta \end{aligned}$
$X^\mathsf{T}X$ is positive definite, $\lambda > 0$ by definition, so $E(\hat\theta_R)\neq \theta$

Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.

3.2.3 Elastic net

The elastic net penalty is a compromise between Lasso and ridge.

Cost function
$J_E(\theta)=||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2$
if X is an $n_d\times n_p$ design matrix
$J_E(\theta)=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2+\lambda_1\sum^{n_p}_{j=1}|\theta_j|+\lambda_2\sum^{n_p}_{j=1}\theta_j^2$
elastic net estimate
$\hat\theta_E =\underset{\theta}{argmin}\left(||y-X\theta||^2+\lambda_1||\theta||_1+\lambda_2||\theta||_2^2\right)$

no closed form solution for $\theta_E$ , but $J_E(\theta)$ is convex to have a unique minimum

4 Learning curve

Increase prediction accuracy:

variable selection: increase or reduce number of covariates
add polynomial features: ${x^2, x_1x_2\}$
regularized regression (Lasso, ridge regression and elastic nets):
Collect more data

Learning curve:

A learning curve is a metric of prediction accuracy, like cost function $J(\theta)$
Or error metric (function of a parameter that affects the metric)

B栋3食堂

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Big Data Analytics 笔记 2

1 Linear Model1.1 parameter estimation1.2 Cost Function1.3 Normal equation estimation1.4 Gradient Descent Process to get estimation1.4.1 Algorithm1.4.2 Learning Rate1.4.3 Stoping Convergence2 Logistic Regression3 Avoid Overfitting
复制链接

扫一扫