ML step by step|20200103

Machine Learning problem discussion

Proof that min of the regularized error function is equivalent to minimizing the unregularized sum-of-squares error

(2 points) Using the technique of Lagrange multipliers, show that minimization of the regularized error function
1 2 ∑ i = 1 n ( y i − ω T x i ) 2 + λ 2 ∑ i = 1 n ∣ ω i ∣ q \frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\frac{\lambda }{2}\sum_{i=1}^{n}\left | \omega_i \right |^{^{q}} 21i=1n(yiωTxi)2+2λi=1nωiq
is equivalent to minimizing the unregularized sum-of-squares error
1 2 ∑ i = 1 n ( y i − ω T x i ) 2 \frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}} 21i=1n(yiωTxi)2
subject to the constraint
∑ i = 1 n ∣ ω j ∣ q ⩽ η \sum_{i=1}^{n}\left | \omega_j \right |^{^{q}}\leqslant \eta i=1nωjqη

Proof :
m i n 1 2 ∑ i = 1 n ( y i − ω T x i ) 2 s . t ∑ i = 1 n ∣ ω j ∣ q ⩽ η min \frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}\\s.t\sum_{i=1}^{n}\left | \omega_j \right |^{^{q}}\leqslant \eta min21i=1n(yiωTxi)2s.ti=1nωjqη
It is a convex optimization problem. Let’s write its Lagrange function.
L = ∑ i = 1 n ( y i − ω T x i ) 2 + λ ( ∣ ω j ∣ q − η ) L =\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\lambda (\left | \omega_j \right |^{^{q}}-\eta)\\ L=i=1n(yiωTxi)2+λ(ωjqη)
According to KKT conditions, and let w* , lambda* be the optimal solution for original and dual problem.
0 ⩽ λ ∗ 0\leqslant \lambda^* 0λ

0 = ▽ ω ( ∑ i = 1 n ( y i − ω T x i ) 2 + λ ( ∣ ω j ∣ q − η ) ) = ∑ i = 1 n ( − x i ) ( y i − ∑ i = 1 n ω j x i j ) + λ q ∣ ω j ∣ q − 1 k j k j = { ⊆ [ − 1 , 1 ] , i f    ω j = 0 1 , i f    ω j > 0 − 1 , i f    ω j < 0 } 0=\bigtriangledown _{\omega}(\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}}+\lambda (\left | \omega_j \right |^{^{q}}-\eta))\\=\sum_{i=1}^{n}(-x_i)(y_i-\sum_{i=1}^{n}\omega_jx_{ij})+\lambda q\left | \omega_j \right |^{^{q-1}}k_j\\ k_j=\begin{Bmatrix} \subseteq [-1,1], if ~~\omega_j=0 \\ 1, if ~~\omega_j>0 \\ -1, if ~~\omega_j<0 \end{Bmatrix} 0=ω(i=1n(yiωTxi)2+λ(ωjqη))=i=1n(xi)(yii=1nωjxij)+λqωjq1kjkj=[1,1],if  ωj=01,if  ωj>01,if  ωj<0

∴ λ q ∣ ω j ∣ q − 1 k j = ∑ i = 1 n ( ( − λ i ) ( y i − ∑ j = 1 n ω j x i j ) ) \therefore \lambda q\left | \omega_j \right |^{^{q-1}}k_j= \sum_{i=1}^{n}\left ((-\lambda_i)(y_{i}-\sum_{j=1}^{n}\omega_jx_{ij})\right) λqωjq1kj=i=1n((λi)(yij=1nωjxij))

For the unconstrained regularized optimization function,

F . O . C     ∂ ∂ x [ 1 2 ∑ i = 1 n ( y i − ω T x i ) 2 + λ 2 ∑ j = 1 n ∣ ω q ∣ ] = − ∑ i = 1 n x i j ( y i − ∑ j = 1 n x i j ω j ) + λ 2 ∑ j = 1 n ∣ ω ∣ q − 1 k j = 0 ∴ λ 2 q ∣ ω ∣ q − 1 k j = ∑ i = 1 n ( ( − λ i ) ( y i − ∑ j = 1 n ω j x i j ) ) F.O.C ~~~\frac{\partial }{\partial x} \left [ \frac{1}{2}\sum_{i=1}^{n}\left ( y_{i}-\mathbf{\omega}^{T}\mathbf{x}_{i}\right )^{^{2}} + \frac{\lambda }{2}\sum_{j=1}^{n}\left | \mathbf{\omega }^{q} \right | \right ] =-\sum_{i=1}^{n}x_{ij}(y_i-\sum_{j=1}^{n}x_{ij}\omega_j)+\frac{\lambda }{2}\sum_{j=1}^{n}\left | \mathbf{\omega } \right |^{q-1}k_j=0\\ \therefore \frac{\lambda }{2}q\left | \mathbf{\omega } \right |^{q-1}k_j =\sum_{i=1}^{n}\left ((-\lambda_i)(y_{i}-\sum_{j=1}^{n}\omega_jx_{ij})\right) F.O.C   x[21i=1n(yiωTxi)2+2λj=1nωq]=i=1nxij(yij=1nxijωj)+2λj=1nωq1kj=02λqωq1kj=i=1n((λi)(yij=1nωjxij))

MAP of LASSO

(2 points) (MAP Estimation) We mentioned that when θ is an isotropic Laplace distribution, MAP corresponds to LASSO (L1-regularization). Now you are maximizing the likelihood function ∏ i = 1 n p ( x i ∣ θ ) \prod_{i=1}^{n}p(x_i|\theta) i=1np(xiθ) with prior distribution p(θ).
p ( θ ) = λ 2 e x p ( − λ ∣ θ ∣ )        λ > 0 p(\theta )= \frac{\lambda }{2}exp(-\lambda\left |\theta \right |)~~~~~~\lambda>0\\ p(θ)=2λexp(λθ)      λ>0
Please prove that it is equivalent to maximizing
l o g ∏ i = 1 n p ( x i ∣ θ ) − λ ∣ θ ∣ log\prod_{i=1}^{n}p(x_i|\theta)-\lambda\left |\theta \right | logi=1np(xiθ)λθ
**Proof : **
a r g m a x θ ( ∏ i = 1 n p ( x i ∣ θ ) p ( θ ) ) = a r g m a x θ l o g ( ∏ i = 1 n p ( x i ∣ θ ) p ( θ ) ) ) = a r g m a x θ l o g ( ∏ i = 1 n p ( x i ∣ θ ) ) + l o g p ( θ ) = a r g m a x θ l o g ( ∏ i = 1 n p ( x i ∣ θ ) ) + l o g 1 2 λ − λ ∣ θ ∣ = a r g m a x θ l o g ( ∏ i = 1 n p ( x i ∣ θ ) ) − λ ∣ θ ∣ argmax \theta (\prod_{i=1}^{n}p(x_i|\theta)p(\theta)) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta)p(\theta))) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))+logp(\theta) \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))+log\frac{1}{2}\lambda-\lambda\left | \theta \right | \\=argmax\theta log(\prod_{i=1}^{n}p(x_i|\theta))-\lambda\left | \theta \right | argmaxθ(i=1np(xiθ)p(θ))=argmaxθlog(i=1np(xiθ)p(θ)))=argmaxθlog(i=1np(xiθ))+logp(θ)=argmaxθlog(i=1np(xiθ))+log21λλθ=argmaxθlog(i=1np(xiθ))λθ

Bias-Variance Tradeoff and its applications

(2 points) (Mean Square Error) We mentioned Bias-Variance Tradeoff in class. We define the MSE of X ^ \hat{X} X^ an estimator of X as
M S E ( X ) ≜ E [ ( X − X ) 2 ] MSE(X^) ≜ E[(X^-X)^{^{2}}] MSE(X)E[(XX)2]
The variance of X^ is defined as
V a r ( X ^ ) ≜ E [ ( X ^ − E [ X ^ ] ) 2 ] Var(\hat{X}) ≜ E[(\hat{X} -E[\hat{X}])^{^{2}}] Var(X^)E[(X^E[X^])2]
and the bias is defined as
B i a s ( X ^ ) ≜ E [ X ^ ] − X Bias(\hat{X}) ≜ E[\hat{X}]-X Bias(X^)E[X^]X.

(a) Please prove that
M S E [ X ^ ] = V a r [ X ^ ] + ( B a i s [ X ^ ] ) 2 MSE[\hat{X}]=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}} MSE[X^]=Var[X^]+(Bais[X^])2
(b) Our data are added with an independent Gaussian noise, say, X + N, where E[N] = 0 and E[N2] = σ2 and the estimator is X^. We define the empirical MSE as E [ ( X ^ − X − N ) 2 ] E[(\hat{X} - X - N)^{^2}] E[(X^XN)2].

Please prove that
E [ ( X ^ − X − N ) 2 ] = M S E [ X ^ ] + σ 2 E[(\hat{X} - X - N)^{^2}]=MSE[\hat{X}]+\sigma ^{2} E[(X^XN)2]=MSE[X^]+σ2
The equation tells us that the empirical error is a good estimation of the true error. Thus, we can minimize the empirical error in order to properly minimize the true error.

Proof :

(a)
M S E [ X ^ ] = E [ [ X ^ − X ] 2 ] = E [ [ ( X ^ − E ( X ^ ) + ( X + E ( X ^ ) ] 2 ] = E [ ( X ^ − E ( X ^ ) 2 ] + E [ ( X − E ( X ^ ) 2 ] − 2 E [ ( X ^ − E ( X ^ ) ( X − E ( X ^ ) ] = V a r [ X ^ ] + ( B a i s [ X ^ ] ) 2 − 2 E [ X ^ X + E ( X ^ ) E ( X ^ ) − X E ( X ^ ) − X ^ E ( X ) ) ] = = V a r [ X ^ ] + ( B a i s [ X ^ ] ) 2 MSE[\hat{X}]=E[[\hat{X}-X]^{^{2}}] \\=E[[(\hat{X}-E(\hat{X})+(X+E(\hat{X})]^{^{2}}] \\=E[(\hat{X}-E(\hat{X})^{^{2}}]+E[(X-E(\hat{X})^{^{2}}]-2E[(\hat{X}-E(\hat{X})(X-E(\hat{X})] \\=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}}-2E[\hat{X}X+E(\hat{X})E(\hat{X})-XE(\hat{X})-\hat{X}E(X))] =\\=Var[\hat{X}]+(Bais[\hat{X}])^{^{2}} MSE[X^]=E[[X^X]2]=E[[(X^E(X^)+(X+E(X^)]2]=E[(X^E(X^)2]+E[(XE(X^)2]2E[(X^E(X^)(XE(X^)]=Var[X^]+(Bais[X^])22E[X^X+E(X^)E(X^)XE(X^)X^E(X))]==Var[X^]+(Bais[X^])2
(b)
E [ ( X ^ − X − N ) 2 ] = E [ X ^ 2 ] + E [ ( X ^ + N ) 2 ] − 2 E [ X ^ ( X + N ) ] = V a r [ X ^ ] + E [ X ^ ] 2 − 2 E [ X ^ X ] − 2 E [ X ^ N ] + E [ ( X + N ) 2 ] = V a r [ X ^ ] + E [ X ^ ] 2 + E [ X 2 ] + E [ N 2 ] + 2 E [ X N ] − 2 E [ X ^ X ] − 2 E [ X ^ N ] = V a r [ X ^ ] + E [ ( X ^ − X ) 2 ] + σ 2 = M S E [ X ^ ] + σ 2 E[(\hat{X} - X - N)^{^2}]=E[\hat{X}^{^{2}}]+E[(\hat{X}+N)^{^{2}}]-2E[\hat{X}(X+N)] \\=Var[\hat{X}]+E[\hat{X}]^{^{2}}-2E[\hat{X}X]-2E[\hat{X}N]+E[(X+N)^{^2}] \\=Var[\hat{X}]+E[\hat{X}]^{^{2}}+E[X^{^{2}}]+E[N^{^{2}}]+2E[XN]-2E[\hat{X}X]-2E[\hat{X}N] \\=Var[\hat{X}]+E[(\hat{X}-X)^{^2}]+\sigma ^{2} \\=MSE[\hat{X}]+\sigma ^{2} E[(X^XN)2]=E[X^2]+E[(X^+N)2]2E[X^(X+N)]=Var[X^]+E[X^]22E[X^X]2E[X^N]+E[(X+N)2]=Var[X^]+E[X^]2+E[X2]+E[N2]+2E[XN]2E[X^X]2E[X^N]=Var[X^]+E[(X^X)2]+σ2=MSE[X^]+σ2

VC Dimension’s application

(4 points) (VC Dimension) Given some finite domain set, χ \chi χ, and a number k ≤ χ k ≤ \chi kχ please figure out the VC-dimension of each of the following classes:
(a) (2 points)
H κ χ = { h ∈ { 0 , 1 } κ : ∣ { x : h ( x ) = 1 } ∣ = k } \Eta _{\kappa }^{\chi }=\left \{ h\in {\left\{0,1\right\}^{\kappa}:| \left \{ x:h(x)=1 \right \}|=k } \right \} Hκχ={h{0,1}κ:{x:h(x)=1}=k}
That is, the set of all functions that assign the value 1 to exactly k elements of κ \kappa κ.
(b) (2 points)
H κ χ = { h ∈ { 0 , 1 } κ : ∣ { x : h ( x ) = 0 } ∣ ≤ k    o r    { x : h ( x ) = 1 } ∣ ≤ k } \Eta _{\kappa }^{\chi }=\left \{ h\in {\left\{0,1\right\}^{\kappa}:| \left \{ x:h(x)=0 \right \}|\leq k ~~or~~\left \{ x:h(x)=1 \right \}|\leq k } \right \} Hκχ={h{0,1}κ:{x:h(x)=0}k  or  {x:h(x)=1}k}
Solution:

(a)

Since for every hypothesis class, there are exactly k elements of χ \chi χ to make h ( x ) = 1 h(x)=1 h(x)=1. It means when we give k+1 points as a subset to shatter, and labeled all of them with “1”, then we can’t find a hypothesis from the above classes to fit them. Then for any k points or any less than k points set, for any labeling, we always can find a hypothesis to shatter them. Hence, the VC dimension of the H k χ \Eta_{k}^{\chi} Hkχ is k. However, another case, if you think ∣ χ ∣ \left |\chi \right | χ is not large enough, when k points in the whole set is marked as 1, then ∣ χ ∣ − k \left |\chi \right |-k χk is marked as 0. Hence, there is a labeling(mark all negative) which is not able to shatter. Then, the VC-dimensions will be ∣ χ ∣ − k \left |\chi \right |-k χk. Hence, according to the above analysis, the VC-dimension will be m i n ( k , ∣ χ ∣ − k ) min(k,\left |\chi \right |-k) min(k,χk).

(b)

Since for every hypothesis class, there are at most k elements of χ \chi χ to make h ( x ) = 1   o r   h ( x ) = 0 h(x)=1~or~h(x)=0 h(x)=1 or h(x)=0. It means when we give a set of points in χ \chi χ , and if there always exists m m m points labeled 1 or 0 when m < k m<k m<k, then we always can find a hypothesis from the above classes to fit them. Hence, for any points with size less than 2 χ + 1 2\chi+1 2χ+1 , we can shatter them with the hypothesis class. However, for 2 χ + 2 2\chi+2 2χ+2 points, for example, mark k + 1 k+1 k+1 points as 1 and another k + 1 k+1 k+1 points 0, then no valid hypothesis in this H \Eta H . Hence, the VC-dimension of the H κ χ \Eta _{\kappa }^{\chi } Hκχ is 2 k + 1 2k+1 2k+1. If considering the ∣ χ ∣ \left |\chi \right | χ , it should be m i n ( 2 k + 1 , ∣ χ ∣ ) min(2k+1,\left |\chi \right |) min(2k+1,χ).

reference

欢迎关注二幺子的知识输出通道:
avatar

已标记关键词 清除标记
相关推荐
DirectX修复工具(DirectX Repair)是一款系统级工具软件,简便易用。本程序为绿色版,无需安装,可直接运行。 本程序的主要功能是检测当前系统的DirectX状态,如果发现异常则进行修复。程序主要针对0xc000007b问题设计,可以完美修复该问题。本程序中包含了最新版的DirectX redist(Jun2010),并且全部DX文件都有Microsoft的数字签名,安全放心。 本程序为了应对一般电脑用户的使用,采用了易用的一键式设计,只要点击主界面上的“检测并修复”按钮,程序就会自动完成校验、检测、下载、修复以及注册的全部功能,无需用户的介入,大大降低了使用难度。在常规修复过程中,程序还会自动检测DirectX加速状态,在异常时给予用户相应提示。 本程序适用于多个操作系统,如Windows XP(需先安装.NET 2.0,详情请参阅“致Windows XP用户.txt”文件)、Windows Vista、Windows 7、Windows 8、Windows 8.1、Windows 8.1 Update、Windows 10,同时兼容32位操作系统和64位操作系统。本程序会根据系统的不同,自动调整任务模式,无需用户进行设置。 本程序的V4.0版分为标准版、增强版以及在线修复版。所有版本都支持修复DirectX的功能,而增强版则额外支持修复c++的功能。在线修复版功能与标准版相同,但其所需的数据包需要在修复时自动下载。各个版本之间,主程序完全相同,只是其配套使用的数据包不同。因此,标准版和在线修复版可以通过补全扩展包的形式成为增强版。本程序自V3.5版起,自带扩展功能。只要在主界面的“工具”菜单下打开“选项”对话框,找到“扩展”标签,点击其中的“开始扩展”按钮即可。扩展过程需要Internet连接,扩展成功后新的数据包可自动生效。扩展用时根据网络速度不同而不同,最快仅需数秒,最慢需要数分钟,烦请耐心等待。如扩展失败,可点击“扩展”界面左上角小锁图标切换为加密连接,即可很大程度上避免因防火墙或其他原因导致的连接失败。 本程序自V2.0版起采用全新的底层程序架构,使用了异步多线程编程技术,使得检测、下载、修复单独进行,互不干扰,快速如飞。新程序更改了自我校验方式,因此使用新版本的程序时不会再出现自我校验失败的错误;但并非取消自我校验,因此程序安全性与之前版本相同,并未降低。 程序有更新系统c++功能。由于绝大多数软件运行时需要c++的支持,并且c++的异常也会导致0xc000007b错误,因此程序在检测修复的同时,也会根据需要更新系统中的c++组件。自V3.2版本开始使用了全新的c++扩展包,可以大幅提高工业软件修复成功的概率。修复c++的功能仅限于增强版,标准版及在线修复版在系统c++异常时(非丢失时)会提示用户使用增强版进行修复。除常规修复外,新版程序还支持C++强力修复功能。当常规修复无效时,可以到本程序的选项界面内开启强力修复功能,可大幅提高修复成功率。请注意,请仅在常规修复无效时再使用此功能。 程序有两种窗口样式。正常模式即默认样式,适合绝大多数用户使用。另有一种简约模式,此时窗口将只显示最基本的内容,修复会自动进行,修复完成10秒钟后会自动退出。该窗口样式可以使修复工作变得更加简单快速,同时方便其他软件、游戏将本程序内嵌,即可进行无需人工参与的快速修复。开启简约模式的方法是:打开程序所在目录下的“Settings.ini”文件(如果没有可以自己创建),将其中的“FormStyle”一项的值改为“Simple”并保存即可。 新版程序支持命令行运行模式。在命令行中调用本程序,可以在路径后直接添加命令进行相应的设置。常见的命令有7类,分别是设置语言的命令、设置窗口模式的命令,设置安全级别的命令、开启强力修复的命令、设置c++修复模式的命令、控制Direct加速的命令、显示版权信息的命令。具体命令名称可以通过“/help”或“/?”进行查询。 程序有高级筛选功能,开启该功能后用户可以自主选择要修复的文件,避免了其他不必要的修复工作。同时,也支持通过文件进行辅助筛选,只要在程序目录下建立“Filter.dat”文件,其中的每一行写一个需要修复文件的序号即可。该功能仅针对高级用户使用,并且必须在正常窗口模式下才有效(简约模式时无效)。 本程序有自动记录日志功能,可以记录每一次检测修复结果,方便在出现问题时,及时分析和查找原因,以便找到解决办法。 程序的“选项”对话框中包含了7项高级功能。点击"常规”选项卡可以调整程序的基本运行情况,包括日志记录、安全级别控制、调试模式开启等。只有开启调试模式后才能在C
©️2020 CSDN 皮肤主题: 数字20 设计师:CSDN官方博客 返回首页