机器学习笔记1-Supervised learning

最新推荐文章于 2023-09-11 09:11:25 发布

xdgs_2005

最新推荐文章于 2023-09-11 09:11:25 发布

阅读量278

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/xdgs_2005/article/details/52423798

版权

人工智能专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1.1 Classification and logistic regression
classification problem is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classiﬁcation problem in which y can take on only two values, 0 and 1. For instance, if we are trying to build a spam classiﬁer for email, then x(i) may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given x(i), the corresponding y(i) is also called the label for the training example.
1.2 Logistic regression
logistic function

h θ (x) = g (θ T x) = 1 1 + e θ T x

$h_{\theta}(x)=g({\theta}^Tx)={\frac{1}{1+e^{{\theta}^Tx}}}$
where

g (z) = 1 1 + e - z

$g(z)={\frac{1}{1+e^{-z}}}$
这里写图片描述

useful property

g' (z) = d d z 1 1 + e - z

$g^{\prime}(z)={\frac{d}{dz}}{\frac{1}{1+e^{-z}}}$

= 1 ( 1 + e - z ) 2 (e - z)

$={{\frac{1}{({1+e^{-z}})^2}}}({e^{-z}})$

= 1 ( 1 + e - z ) (1 - 1 ( 1 + e - z ))

$={{\frac{1}{({1+e^{-z}})}}}(1-{{\frac{1}{({1+e^{-z}})}}})$

= g (z) (1 - g (z))

$=g(z)(1-g(z))$
given the logistic regression model, how do we ﬁt θ for it?
Let us assume that:

P (y = 1 | x; θ) = h θ (x)

$P(y = 1 | x;{\theta}) = h_{\theta}(x)$

P (y = 0 | x; θ) = 1 - h θ (x)

$P(y = 0 | x;{\theta}) = 1−h_{{\theta}}(x)$
written more compactly as:

p (y | x; θ) = (h θ (x)) y (1 - h θ (x)) (1 - y)

$p(y | x;{\theta}) = (h_{\theta}(x))^y (1−hθ(x))^{(1−y)}$

L (θ) = p (Y | X; θ)

$L({\theta})=p(Y|X;{\theta})$

= \prod i = 1 m p (y (i) | x (i); θ)

$={\prod_{i=1}^{m}}p(y^{(i)}|x^{(i)};{\theta})$

= \prod i = 1 m (h θ (x (i))) y (1 - h θ (x (i))) (1 - y (i)))

$={\prod_{i=1}^{m}}(h_{\theta}(x^{(i)}))^y (1−hθ(x^{(i)}))^{(1−y^{(i)})})$
maximize the log likelihood

l (θ) = l o g L (θ)

$l({\theta}) = logL(θ)$

= \sum i = 1 m y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))

$={\sum_{i=1}^{m}}{y^{(i)}}logh_{\theta}(x^{(i)}) + (1-y^{(i)})log(1-h_{\theta}(x^{(i)}))$
maximize the likelihood

θ : = θ + α \nabla θ l (θ)

${\theta}:= {\theta} + {\alpha}{\nabla}{\theta} l({\theta} )$

\partial l ( θ ) \partial θ j = (y 1 g ( θ T x ) - (1 - y) 1 1 - g ( θ T x )) \partial \partial θ j g (θ T x)

$\frac{\partial {l({\theta})}}{\partial {{\theta}_j}}=(y{\frac{1}{g({\theta}^Tx)}}-(1-y){\frac{1}{1-g({\theta}^Tx)}}){\frac{\partial {}}{\partial {{\theta}_j}}}g({\theta}^Tx)$

= (y 1 g ( θ T x ) - (1 - y) 1 1 - g ( θ T x )) g (θ T x) (1 - g (θ T x)) \partial \partial θ j θ T x

$=(y{\frac{1}{g({\theta}^Tx)}}-(1-y){\frac{1}{1-g({\theta}^Tx)}})g({\theta}^Tx)(1-g({\theta}^Tx)){\frac{\partial {}}{\partial {{\theta}_j}}}{{\theta}^Tx}$

= (y (1 - g (θ T x)) - (1 - y) g (θ T x)) θ j

$=(y(1-g({\theta}^Tx))-(1-y)g({\theta}^Tx)){{\theta}_j}$

= (y - h θ (x)) x j

$=(y-h_{\theta}(x)){x_j}$
This therefore gives us the stochastic gradient ascent rule:

θ j : = θ j + α (y (i) - h θ (x (i))) x (i) j

${\theta}_j :={\theta}_j + {\alpha}(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}$
1.3 Digression: The perceptron learning algorithm
Consider modifying the logistic regression method to “force” it to output values that are either 0 or 1 or exactly. To do so, it seems natural to change the deﬁnition of g to be the threshold function: