从概率论的角度:
作者:bsdelf
链接:http://www.zhihu.com/question/20447622/answer/25186207
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
- Least Square 的解析解可以用 Gaussian 分布以及最大似然估计求得
- Ridge 回归可以用 Gaussian 分布和最大后验估计解释
- LASSO 回归可以用 Laplace 分布和最大后验估计解释
-------------------------------------------------------------------
下面贴一下我以前的推导给大家参考,相信会有启发。如有错误还望指正 -_-
注意:
下面贴一下我以前的推导给大家参考,相信会有启发。如有错误还望指正 -_-
注意:
- 假设你已经懂得:高斯分布、拉普拉斯分布、最大似然估计,最大后验估计(MAP)。
- 按照李航博士的观点,机器学习三要素为:模型、策略、算法。一种模型可以有多种求解策略,每一种求解策略可能最终又有多种计算方法。以下只讲模型策略,不讲算法。(具体怎么算,convex or non-convex, 程序怎么写,那是数值分析问题)
首先假设线性回归模型具有如下形式:
![f(\mathbf x) = \sum_{j=1}^{d} x_j w_j + \epsilon = \mathbf x \mathbf w^\intercal + \epsilon](http://zhihu.com/equation?tex=f%28%5Cmathbf+x%29+%3D+%5Csum_%7Bj%3D1%7D%5E%7Bd%7D+x_j+w_j+%2B+%5Cepsilon+%3D+%5Cmathbf+x+%5Cmathbf+w%5E%5Cintercal+%2B+%5Cepsilon)
其中
,
,误差
。
当前已知
,
,怎样求
呢?
策略1. 假设
,也就是说
,那么用最大似然估计推导:
![\begin{align*} \text{arg\,max}_{\mathbf w} L(\mathbf w) & = \ln {\prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp(-\frac{1}{2}(\frac{\mathbf y_i - \mathbf x_i \mathbf w^\intercal}{\sigma})^2})\\ & = - \frac{1}{2\sigma^2} \sum_{i=1}^n(\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2 - n \ln \sigma \sqrt{2\pi} \end{align*}](http://zhihu.com/equation?tex=%0A%5Cbegin%7Balign%2A%7D%0A++++++++++++%5Ctext%7Barg%5C%2Cmax%7D_%7B%5Cmathbf+w%7D+L%28%5Cmathbf+w%29+%0A++++++++++++++++%26+%3D+%5Cln+%7B%5Cprod_%7Bi%3D1%7D%5En+%5Cfrac%7B1%7D%7B%5Csigma+%5Csqrt%7B2%5Cpi%7D%7D+%0A++++++++++++++++%5Cexp%28-%5Cfrac%7B1%7D%7B2%7D%28%5Cfrac%7B%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%7D%7B%5Csigma%7D%29%5E2%7D%29%5C%5C%0A++++++++++++++++%26+%3D+-+%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D+%5Csum_%7Bi%3D1%7D%5En%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2%0A++++++++++++++++-+n+%5Cln+%5Csigma+%5Csqrt%7B2%5Cpi%7D%0A++++++++%5Cend%7Balign%2A%7D%0A)
![\text{arg\,min}_{\mathbf w} f(\mathbf w) = \sum_{i=1}^n(\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2= {\left\lVert{\mathbf y - \mathbf X \mathbf w^\intercal}\right\rVert}_2^2](http://zhihu.com/equation?tex=%5Ctext%7Barg%5C%2Cmin%7D_%7B%5Cmathbf+w%7D+f%28%5Cmathbf+w%29+%3D+%5Csum_%7Bi%3D1%7D%5En%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2%3D+%7B%5Cleft%5ClVert%7B%5Cmathbf+y+-+%5Cmathbf+X+%5Cmathbf+w%5E%5Cintercal%7D%5Cright%5CrVert%7D_2%5E2)
这不就是最小二乘么。
其中
![\mathbf w \in \mathbb R^{1 \times d}](https://i-blog.csdnimg.cn/blog_migrate/8c56bfc3847729973f2e4020c4613a92.png+%5Cin+%5Cmathbb+R%5E%7B1+%5Ctimes+d%7D)
当前已知
![\mathbf w](https://i-blog.csdnimg.cn/blog_migrate/8c56bfc3847729973f2e4020c4613a92.png)
策略1. 假设
这不就是最小二乘么。
策略2. 假设
,
,那么用最大后验估计推导:
![\begin{align*}\text{arg\,max}_{\mathbf w} L(\mathbf w) & = \ln \prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp(-\frac{1}{2}(\frac{\mathbf y_i - \mathbf x_i \mathbf w^\intercal}{\sigma})^2) \cdot \prod_{j=1}^d \frac{1}{\tau \sqrt{2\pi}} \exp(-\frac{1}{2}(\frac{\mathbf w_j}{\tau})^2)\\ & = - \frac{1}{2\sigma^2} \sum_{i=1}^n(\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2 - \frac{1}{2\tau^2} \sum_{j=1}^d \mathbf w_j^2 - n \ln \sigma \sqrt{2\pi} - d \ln \tau \sqrt{2\pi} \end{align*}](http://zhihu.com/equation?tex=%5Cbegin%7Balign%2A%7D%0A%5Ctext%7Barg%5C%2Cmax%7D_%7B%5Cmathbf+w%7D+L%28%5Cmathbf+w%29+%0A++++++++++++%26+%3D+%5Cln+%5Cprod_%7Bi%3D1%7D%5En+%5Cfrac%7B1%7D%7B%5Csigma+%5Csqrt%7B2%5Cpi%7D%7D+%0A++++++++++++++++++++%5Cexp%28-%5Cfrac%7B1%7D%7B2%7D%28%5Cfrac%7B%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%7D%7B%5Csigma%7D%29%5E2%29+%5Ccdot%0A++++++++++++++++++++%5Cprod_%7Bj%3D1%7D%5Ed+%5Cfrac%7B1%7D%7B%5Ctau+%5Csqrt%7B2%5Cpi%7D%7D%0A++++++++++++++++++++%5Cexp%28-%5Cfrac%7B1%7D%7B2%7D%28%5Cfrac%7B%5Cmathbf+w_j%7D%7B%5Ctau%7D%29%5E2%29%5C%5C%0A++++++++++++%26+%3D+-+%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D+%5Csum_%7Bi%3D1%7D%5En%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2%0A++++++++++++++++-+%5Cfrac%7B1%7D%7B2%5Ctau%5E2%7D+%5Csum_%7Bj%3D1%7D%5Ed+%5Cmathbf+w_j%5E2%0A++++++++++++++++-+n+%5Cln+%5Csigma+%5Csqrt%7B2%5Cpi%7D%0A++++++++++++++++-+d+%5Cln+%5Ctau+%5Csqrt%7B2%5Cpi%7D%0A++++++++%5Cend%7Balign%2A%7D)
![\begin{align*} \text{arg\,min}_{\mathbf w} f(\mathbf w) &= \sum_{i=1}^n (\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2 + \lambda \sum_{j=1}^d \mathbf w_j^2 \\ &= {\left\lVert\mathbf y - \mathbf X \mathbf w^\intercal\right\rVert}_2^2 + \lambda {\left\lVert\mathbf w\right\rVert}_2^2 \end{align*}](http://zhihu.com/equation?tex=%5Cbegin%7Balign%2A%7D%0A++++++++++++%5Ctext%7Barg%5C%2Cmin%7D_%7B%5Cmathbf+w%7D+f%28%5Cmathbf+w%29+%26%3D+%0A++++++++++++++++%5Csum_%7Bi%3D1%7D%5En+%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2+%2B%0A++++++++++++++++%5Clambda+%5Csum_%7Bj%3D1%7D%5Ed+%5Cmathbf+w_j%5E2+%5C%5C%0A++++++++++++++++%26%3D+%7B%5Cleft%5ClVert%5Cmathbf+y+-+%5Cmathbf+X+%5Cmathbf+w%5E%5Cintercal%5Cright%5CrVert%7D_2%5E2+%2B%0A+++++++++++++++++++%5Clambda+%7B%5Cleft%5ClVert%5Cmathbf+w%5Cright%5CrVert%7D_2%5E2%0A++++++++%5Cend%7Balign%2A%7D)
这不就是 Ridge 回归么?
策略3. 假设
,
,同样用最大后验估计推导:
![\begin{align*} \text{arg\,max}_{\mathbf w} L(\mathbf w) & = \ln \prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}} \exp(-\frac{1}{2} (\frac{\mathbf y_i - \mathbf x_i \mathbf w^\intercal}{\sigma})^2) \cdot \prod_{j=1}^d \frac{1}{2b} \exp(-\frac{|\mathbf w_j|}{b}) \\ & = - \frac{1}{2\sigma^2} \sum_{i=1}^n(\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2 - \frac{1}{2\tau^2} \sum_{j=1}^d |\mathbf w_j| - n \ln \sigma \sqrt{2\pi} - d \ln \tau \sqrt{2\pi} \end{align*}](http://zhihu.com/equation?tex=%5Cbegin%7Balign%2A%7D%0A++++++++++%5Ctext%7Barg%5C%2Cmax%7D_%7B%5Cmathbf+w%7D+L%28%5Cmathbf+w%29++%26+%3D+%5Cln+%5Cprod_%7Bi%3D1%7D%5En+%5Cfrac%7B1%7D%7B%5Csigma+%5Csqrt%7B2%5Cpi%7D%7D%0A++++++++++++++++++++%5Cexp%28-%5Cfrac%7B1%7D%7B2%7D+%28%5Cfrac%7B%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%7D%7B%5Csigma%7D%29%5E2%29+%5Ccdot%0A++++++++++++++++++++%5Cprod_%7Bj%3D1%7D%5Ed+%5Cfrac%7B1%7D%7B2b%7D%0A++++++++++++++++++++%5Cexp%28-%5Cfrac%7B%7C%5Cmathbf+w_j%7C%7D%7Bb%7D%29+%5C%5C%0A++++++++++++%26+%3D+-+%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D+%5Csum_%7Bi%3D1%7D%5En%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2%0A++++++++++++++++-+%5Cfrac%7B1%7D%7B2%5Ctau%5E2%7D+%5Csum_%7Bj%3D1%7D%5Ed+%7C%5Cmathbf+w_j%7C%0A++++++++++++++++-+n+%5Cln+%5Csigma+%5Csqrt%7B2%5Cpi%7D%0A++++++++++++++++-+d+%5Cln+%5Ctau+%5Csqrt%7B2%5Cpi%7D%0A++++++++%5Cend%7Balign%2A%7D)
![\begin{align*} \text{arg\,min}_{\mathbf w} f(\mathbf w) &= \sum_{i=1}^n (\mathbf y_i - \mathbf x_i \mathbf w^\intercal)^2 + \lambda \sum_{j=1}^d |\mathbf w_j| \\ &= {\left\lVert\mathbf y - \mathbf X \mathbf w^\intercal\right\rVert}_2^2 + \lambda {\left\lVert\mathbf w\right\rVert}_1 \end{align*}](http://zhihu.com/equation?tex=%5Cbegin%7Balign%2A%7D%0A++++++++++++%5Ctext%7Barg%5C%2Cmin%7D_%7B%5Cmathbf+w%7D+f%28%5Cmathbf+w%29+%26%3D+%0A++++++++++++++++%5Csum_%7Bi%3D1%7D%5En+%28%5Cmathbf+y_i+-+%5Cmathbf+x_i+%5Cmathbf+w%5E%5Cintercal%29%5E2+%2B%0A++++++++++++++++%5Clambda+%5Csum_%7Bj%3D1%7D%5Ed+%7C%5Cmathbf+w_j%7C+%5C%5C%0A++++++++++++++++%26%3D+%7B%5Cleft%5ClVert%5Cmathbf+y+-+%5Cmathbf+X+%5Cmathbf+w%5E%5Cintercal%5Cright%5CrVert%7D_2%5E2+%2B%0A+++++++++++++++++++%5Clambda+%7B%5Cleft%5ClVert%5Cmathbf+w%5Cright%5CrVert%7D_1%0A++++++++%5Cend%7Balign%2A%7D)
这不就是 LASSO 么?
不知大家看懂没,简直是完美统一啊。
作业:)
这不就是 Ridge 回归么?
策略3. 假设
这不就是 LASSO 么?
不知大家看懂没,简直是完美统一啊。
作业:)
- 策略1和2最终的目标函数都是常规的极值问题,试求出解析解。
- 有一种常见的回归通篇没有提到,也可以纳入上述体系,试找出策略4并推导之。
参考文献:
[1] Machine Learning
[2] The Elements of Statistical Learning
作者:bsdelf
链接:http://www.zhihu.com/question/20447622/answer/25186207
来源:知乎
著作权归作者所有,转载请联系作者获得授权。