ML step by step|20200103

最新推荐文章于 2024-09-06 16:00:36 发布

天天学习的零柒贰幺

最新推荐文章于 2024-09-06 16:00:36 发布

阅读量178

点赞数

文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_42294255/article/details/103819935

版权

Machine Learning problem discussion

Proof that min of the regularized error function is equivalent to minimizing the unregularized sum-of-squares error

(2 points) Using the technique of Lagrange multipliers, show that minimization of the regularized error function
$\frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\frac{\lambda }{2}\sum_{i=1}^{n}\left | \omega_i \right |^{^{q}}$
is equivalent to minimizing the unregularized sum-of-squares error
$\frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}$
subject to the constraint
$\sum_{i=1}^{n}\left | \omega_j \right |^{^{q}}\leqslant \eta$

Proof :
$\frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}\\s.t\sum_{i=1}^{n}\left | \omega_j \right |^{^{q}}\leqslant \eta$
It is a convex optimization problem. Let’s write its Lagrange function.
$=\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\lambda (\left | \omega_j \right |^{^{q}}-\eta)\\$
According to KKT conditions, and let w* , lambda* be the optimal solution for original and dual problem.
$0\leqslant \lambda^*$

$0=\bigtriangledown _{\omega}(\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\lambda (\left | \omega_j \right |^{^{q}}-\eta))\\=\sum_{i=1}^{n}(-x_i)(y_i-\sum_{i=1}^{n}\omega_jx_{ij})+\lambda q\left | \omega_j \right |^{^{q-1}}k_j\\ k_j=\begin{Bmatrix} \subseteq [-1,1], if ~~\omega_j=0 \\ 1, if ~~\omega_j>0 \\ -1, if ~~\omega_j<0 \end{Bmatrix}$

$\therefore \lambda q\left | \omega_j \right |^{^{q-1}}k_j= \sum_{i=1}^{n}\left ((-\lambda_i)(y_{i}-\sum_{j=1}^{n}\omega_jx_{ij})\right)$

For the unconstrained regularized optimization function,

$~~~\frac{\partial }{\partial x} \left [ \frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}} + \frac{\lambda }{2}\sum_{j=1}^{n}\left | \mathbf{\omega }^{q} \right | \right ] =-\sum_{i=1}^{n}x_{ij}(y_i-\sum_{j=1}^{n}x_{ij}\omega_j)+\frac{\lambda }{2}\sum_{j=1}^{n}\left | \mathbf{\omega } \right |^{q-1}k_j=0\\ \therefore \frac{\lambda }{2}q\left | \mathbf{\omega } \right |^{q-1}k_j =\sum_{i=1}^{n}\left ((-\lambda_i)(y_{i}-\sum_{j=1}^{n}\omega_jx_{ij})\right)$

MAP of LASSO

(2 points) (MAP Estimation) We mentioned that when θ is an isotropic Laplace distribution, MAP corresponds to LASSO (L1-regularization). Now you are maximizing the likelihood function $\prod_{i=1}^{n}p(x_i|\theta)$ with prior distribution p(θ).
$p(\theta )= \frac{\lambda }{2}exp(-\lambda\left |\theta \right |)~~~~~~\lambda>0\\$
Please prove that it is equivalent to maximizing
$log\prod_{i=1}^{n}p(x_i|\theta)-\lambda\left |\theta \right |$
**Proof : **
$\theta (\prod_{i=1}^{n}p(x_i|\theta)p(\theta)) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta)p(\theta))) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))+logp(\theta) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))+log\frac{1}{2}\lambda-\lambda\left | \theta \right | \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))-\lambda\left | \theta \right |$

Bias-Variance Tradeoff and its applications

(2 points) (Mean Square Error) We mentioned Bias-Variance Tradeoff in class. We define the MSE of $\hat{X}$ an estimator of X as
$MSE(X^) ≜ E[(X^-X)^{^{2}}]$
The variance of X^ is defined as
$Var(\hat{X}) ≜ E[(\hat{X} -E[\hat{X}])^{^{2}}]$
and the bias is defined as
$Bias(\hat{X}) ≜ E[\hat{X}]-X$ .

(a) Please prove that
$MSE[\hat{X}]=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}}$
(b) Our data are added with an independent Gaussian noise, say, X + N, where E[N] = 0 and E[N2] = σ2 and the estimator is X^. We define the empirical MSE as $E[(\hat{X} - X - N)^{^2}]$ .

Please prove that
$E[(\hat{X} - X - N)^{^2}]=MSE[\hat{X}]+\sigma ^{2}$
The equation tells us that the empirical error is a good estimation of the true error. Thus, we can minimize the empirical error in order to properly minimize the true error.

Proof :

(a)
$MSE[\hat{X}]=E[[\hat{X}-X]^{^{2}}] \\=E[[(\hat{X}-E(\hat{X})+(X+E(\hat{X})]^{^{2}}] \\=E[(\hat{X}-E(\hat{X})^{^{2}}]+E[(X-E(\hat{X})^{^{2}}]-2E[(\hat{X}-E(\hat{X})(X-E(\hat{X})] \\=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}}-2E[\hat{X}X+E(\hat{X})E(\hat{X})-XE(\hat{X})-\hat{X}E(X))] =\\=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}}$
(b)
$E[(\hat{X} - X - N)^{^2}]=E[\hat{X}^{^{2}}]+E[(\hat{X}+N)^{^{2}}]-2E[\hat{X}(X+N)] \\=Var[\hat{X}]+E[\hat{X}]^{^{2}}-2E[\hat{X}X]-2E[\hat{X}N]+E[(X+N)^{^2}] \\=Var[\hat{X}]+E[\hat{X}]^{^{2}}+E[X^{^{2}}]+E[N^{^{2}}]+2E[XN]-2E[\hat{X}X]-2E[\hat{X}N] \\=Var[\hat{X}]+E[(\hat{X}-X)^{^2}]+\sigma ^{2} \\=MSE[\hat{X}]+\sigma ^{2}$

VC Dimension’s application

(4 points) (VC Dimension) Given some finite domain set, $\chi$ , and a number $\chi$ please figure out the VC-dimension of each of the following classes:
(a) (2 points)
$\Eta _{\kappa }^{\chi }=\left \{ h\in {\left\{0,1\right\}^{\kappa}:| \left \{ x:h(x)=1 \right \}|=k } \right \}$
That is, the set of all functions that assign the value 1 to exactly k elements of $\kappa$ .
(b) (2 points)
$\Eta _{\kappa }^{\chi }=\left \{ h\in {\left\{0,1\right\}^{\kappa}:| \left \{ x:h(x)=0 \right \}|\leq k ~~or~~\left \{ x:h(x)=1 \right \}|\leq k } \right \}$
Solution:

(a)

Since for every hypothesis class, there are exactly k elements of $\chi$ to make $h (x) = 1$ . It means when we give k+1 points as a subset to shatter, and labeled all of them with “1”, then we can’t find a hypothesis from the above classes to fit them. Then for any k points or any less than k points set, for any labeling, we always can find a hypothesis to shatter them. Hence, the VC dimension of the $\Eta_{k}^{\chi}$ is k. However, another case, if you think $\left |\chi \right |$ is not large enough, when k points in the whole set is marked as 1, then $\left |\chi \right |-k$ is marked as 0. Hence, there is a labeling(mark all negative) which is not able to shatter. Then, the VC-dimensions will be $\left |\chi \right |-k$ . Hence, according to the above analysis, the VC-dimension will be $min(k,\left |\chi \right |-k)$ .

(b)

Since for every hypothesis class, there are at most k elements of $\chi$ to make $h (x) = 1 o r h (x) = 0$ . It means when we give a set of points in $\chi$ , and if there always exists $m$ points labeled 1 or 0 when $m < k$ , then we always can find a hypothesis from the above classes to fit them. Hence, for any points with size less than $2\chi+1$ , we can shatter them with the hypothesis class. However, for $2\chi+2$ points, for example, mark $k + 1$ points as 1 and another $k + 1$ points 0, then no valid hypothesis in this $\Eta$ . Hence, the VC-dimension of the $\Eta _{\kappa }^{\chi }$ is $2 k + 1$ . If considering the $\left |\chi \right |$ , it should be $min(2k+1,\left |\chi \right |)$ .

reference

欢迎关注二幺子的知识输出通道：
avatar

天天学习的零柒贰幺

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ML step by step|20200103

Machine Learning problem discussionProof that min of the regularized error function is equivalent to minimizing the unregularized sum-of-squares error(2 points) Using the technique of Lagrange multi...
复制链接

扫一扫