Logistic Regression, L1, L2 regularization, Gradient/Coordinate descent 详细

往期文章链接目录

Generative model v.s. Discriminative model:

Examples:

  • Generative model: Naive Bayes, HMM, VAE, GAN.
  • Discriminative model:Logistic Regression, CRF.

Objective function:

  • Generative model: m a x    p   ( x , y ) max\,\, p\,(x,y) maxp(x,y)
  • Discriminative model: m a x    p   ( y   ∣   x ) max \,\, p\,(y \,|\, x) maxp(yx)

Difference:

  • Generative model: We first assume a distribution of the data in the consideration of computation efficiency and features of the data. Next, the model will learn the parameters of distribution of the data. Then, we can use the model to generate new data. (e.g. generate new data from a normal distribution.)
  • Discriminative model: The only purpose is to classify, that is, to tell the difference. As long as it can find a way to tell the difference, it doesn’t need to learn anything else.

Relation:

  • Generative model: p   ( x , y ) = p   ( y   ∣   x )   p ( y ) p\,(x,y) = p\,(y \,|\, x) \, p(y) p(x,y)=p(yx)p(y), it has a prior term p ( y ) p(y) p(y).
  • Discriminative model: p   ( y   ∣   x ) p\,(y \,|\, x) p(yx)
  • Both models can do classification problems, but discriminative model can do classification problems only. Usually, for classification problems, discriminative model performs better. On the other hand, if we have limited data, generative model might perform better (since it has a prior term, which plays the role of a regularization).

Logistic regression

Formula:

σ ( x ) = 1 1 + e − w T x \sigma(x) = \frac{1}{1 + e^{-w^Tx}} σ(x)=1+ewTx1

Derivative formula:

  • σ ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \sigma'(x) = \sigma(x)(1-\sigma(x)) σ(x)=σ(x)(1σ(x))

Logistic Regression does not have analytic solutions and we need to use iterative optimization to find a solution recursively.

It spends a lot of computational power to calculate e x e^x ex because of floating points. In most code implementation, people will pre-calculate values and then do approximate with a new x x x comes.

Derivation of Logistic Regression

p ( y ∣ x , w , b ) = p ( y = 1 ∣ x , w , b ) y [ 1 − p ( y = 1 ∣ x , w , b ) ] 1 − y w ^ M L E , b ^ M L E = arg max ⁡ w , b ∏ i = 1 n p ( y i ∣ x i , w , b ) = arg max ⁡ w , b ∑ i = 1 n log ⁡ p ( y i ∣ x i , w , b ) = arg min ⁡ w , b − ∑ i = 1 n log ⁡ p ( y i ∣ x i , w , b ) = arg min ⁡ w , b − ∑ i = 1 n log ⁡ [ p ( y i ) ∣ x i , w , b ) y i [ 1 − p ( y i = 1 ∣ x 2 , w , b ) − y i ] = arg min ⁡ w , b − ∑ i = 1 n y i log ⁡ p ( y i = 1 ∣ x i , w , b ) + ( 1 − y i ) log ⁡ [ 1 − p ( y i = 1 ∣ x i w , b ) ] = arg min ⁡ w , b L ∂ L ∂ b = − ∑ i = 1 n ( y i ⋅ σ ( w T x i + b ) ⋅ [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) − ( 1 − y i ) ⋅ − σ ( w T x i + b ) [ 1 − σ ( w T x i + b ) ] 1 − σ ( w T x i + b ) ] = − ∑ i = 1 n ( y i [ 1 − σ ( w T x i + b ) ] − ( 1 − y i ) σ ( w T x i + b ) ) = ∑ i = 1 n ( σ ( w T x i + b ) − y i ) ∂ L ∂ w = ∑ i = 1 n ( σ ( w T x i + b ) − y i ) ⋅ x i \begin{aligned} p(y | x, w, b) &=p(y=1 | x, w, b)^{y}[1-p(y=1 | x, w, b)]^{1-y} \\ \hat{w}_{MLE}, \hat{b}_{MLE} &= \argmax _{w, b} \prod_{i=1}^{n} p\left(y_{i} | x_{i}, w, b\right) \\ &=\argmax _{w, b} \sum_{i=1}^{n} \log p\left(y_{i} | x_{i}, w, b\right) \\ &=\argmin _{w, b}-\sum_{i=1}^{n} \log p\left(y_{i} | x_{i}, w, b\right) \\ &=\argmin _{w, b}-\sum_{i=1}^{n} \log \left[p\left(y_{i}\right) | x_{i}, w, b\right)^{y_{i}}\left[1-p\left(y_{i}=1 | x_{2}, w, b\right)^{-y_{i}}\right] \\ &=\argmin_{w, b}-\sum_{i=1}^{n} y_{i} \log p\left(y_{i}=1 | x_{i}, w, b\right)+\left(1-y_{i}\right) \log \left[1-p\left(y_{i}=1 | x_{i} w, b\right)\right] \\ &=\argmin _{w, b} L \\ \frac{\partial L}{\partial b} &= -\sum_{i=1}^{n}\left(y_{i} \cdot \frac{\sigma (w^{T}x_{i} + b) \cdot [1-\sigma (w^{T}x_{i} + b)]}{\sigma(w^{T}x_i+b)}-(1-y_i) \cdot \frac{-\sigma\left(w^{T} x_{i}+b\right)\left[1-\sigma\left(w^{T} x_{i}+b\right)\right]}{1-\sigma\left(w ^{T}x_i+b\right)}\right] & \\ &=-\sum_{i=1}^{n}\left(y_{i}\left[1-\sigma\left(w^{T} x_{i}+b\right)\right]-\left(1-y_{i}\right)\sigma(w^{T}x_i + b) \right) \\ &= \sum_{i=1}^{n}\left(\sigma\left(w^{T} x_{i}+b\right)-y_{i}\right)\\ \frac{\partial L}{\partial w}&=\sum_{i=1}^{n}\left(\sigma\left(w^{T} x_{i}+b\right)-y_{i}\right) \cdot x_{i} \end{aligned} p(yx,w,b)w^MLE,b^MLEbLwL=p(y=1x,w,b)y[1p(y=1x,w,b)]1y=w,bargmaxi=1np(yixi,w,b)=w,bargmaxi=1nlogp(yixi,w,b)=w,bargmini=1nlogp(yixi,w,b)=w,bargmini=1nlog[p(yi)xi,w,b)yi[1p(yi=1x2,w,b)yi]=w,bargmini=1nyilogp(yi=1xi,w,b)+(1yi)log[1p(yi=1xiw,b)]=w,bargminL=i=1n(yiσ(wTxi+b)σ(wTxi+b)[1σ(wTxi+b)](1yi)1σ(wTxi+b)σ(wTxi+b)[1σ(wTxi+b)]]=i=1n(yi[1σ(wTxi+b)](1yi)σ(wTxi+b))=i=1n(σ(wTxi+b)yi)=i=1n(σ(wTxi+b)yi)xi

Gradient:

True gradient: The true gradient of the population.

Empirical gradient: Use sample gradient to approximate true gradient.

  • SGD, approximate bad. Asymptotically converge to true gradient.
  • GD, approximate good.
  • mini-batch DG, middle.

For smooth optimization, we can use gradient descent.

For non-smooth optimization, we can use coordinate descent (e.g. L1).

Mini-Batch Gradient Descent for Logistic Regression

Way to prevent overfitting:

  • More data.
  • Regularization.
  • Ensemble models.
  • Less complicate models.
  • Less Feature.
  • Add noise (e.g. Dropout)

L1 regularization

L1: Feature Selection, PCA: Features changed.

Why prefer sparsity:

  1. reduce dimension, then less computation.
  2. Higher interpretability.

Problem of L1:

  • Group Effect: If there is collinearity in features, L1 will randomly choose one from each group. Therefore, the best feature in each group might not be selected.

Coordinate Descent for Lasso

  • Intuition: If we are at a point x x x such that f ( x ) f(x) f(x) is minimized along each coordinate axis, then we find a global minimizer.

  • step of Coordinate Descent: Note that we don’t need a learning rate here since we are finding the optimum values.

Large eigenvalue by co-linear columns

If a matrix 𝐴 has an eigenvalue λ \lambda λ, then A − 1 A^{-1} A1 have an eigenvalue 1 λ \frac{1}{\lambda} λ1

A v = λ v    ⟹    A − 1 A v = λ A − 1 v    ⟹    A − 1 v = 1 λ v A\mathbf{v} = \lambda\mathbf{v} \implies A^{-1}A\mathbf{v} = \lambda A^{-1}\mathbf{v}\implies A^{-1}\mathbf{v} = \frac{1}{\lambda}\mathbf{v} Av=λvA1Av=λA1vA1v=λ1v

If a matrix A = X T X A = X^{T}X A=XTX has a very small eigenvalue, then its inverse ( X T X ) − 1 (X^{T}X)^{-1} (XTX)1 has a very big eigenvalue. We know that the formula for β \beta β is
β ^ = ( X T X ) − 1 X T y \hat \beta = (X^{T}X)^{-1}X^{T}y β^=(XTX)1XTy

If we plug in y = X β ∗ + ϵ y = X\beta^{*} + \epsilon y=Xβ+ϵ, then β ^ = β ∗ + ( X T X ) − 1 X T ϵ \hat \beta = \beta^{*} + (X^{T}X)^{-1}X^{T}\epsilon β^=β+(XTX)1XTϵ

Multiplying the noise by ( X T X ) − 1 (X^{T}X)^{-1} (XTX)1 has the potential to blow up the noise. This is called “noise amplification”. If we set the error term ϵ \epsilon ϵ to zero, then there is no noise to amplify. Hence there are no problems with the huge eigenvalues of ( X T X ) − 1 (X^{T}X)^{-1} (XTX)1, and we still recover the correct answer. But even a little bit of error, and this goes out the window.

If we now add some regularization (aka weight decay), then
β ^ = ( X T X + λ I ) − 1 X T y \hat \beta = (X^{T}X + \lambda I)^{-1}X^{T}y β^=(XTX+λI)1XTy

Adding a small multiple of the identity to X T X X^{T}X XTX barely changes the large eigenvalues, but it drastically changes the smallest eigenvalue – it increases it to λ \lambda λ. Thus in the inverse, the largest eigenvalue will be at most 1 λ \frac{1}{\lambda} λ1.

Building Models with prior knowledge (put relation into models via regularization):

  • model + λ ⋅ \lambda \cdot λ Regularization
  • Constraint optimization
  • Probabilistic model (e.g. Probabilistic Graph Model (PGM))

往期文章链接目录

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值