机器学习笔记1-Supervised learning

最新推荐文章于 2022-07-06 22:51:19 发布

xdgs_2005

最新推荐文章于 2022-07-06 22:51:19 发布

阅读量465

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/xdgs_2005/article/details/52431642

版权

人工智能专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1.1 Generalized Linear Models
both of these methods mentioned before are special cases of a broader family of models, called Generalized Linear Models (GLMs). We will also show how other models in the GLM family can be derived and applied to other classiﬁcation and regression problem.
The exponential family
We say that a class of distributions is in the exponential family if it can be written in the form :

p (y : η) = b (y) e x p (η T T (y) - a (η))

$p(y:{\eta})=b(y)exp({{\eta}^T}T(y)-a({\eta}))$

η ${\eta}$ is called the natural parameter(also called the canonical parameter);

T(y) $T(y)$ is the suﬃcient statistic ;

a(η) $a({\eta})$ is the log partition function;
The quantity

e−a(η) $e^{-a({\eta})}$ essentially plays the role of a normalization constant, that makes sure the distribution

p(y;η) $p(y;{\eta})$ sums/integrates over y to 1.
A fixed choice of T, a and b defines a family (or set) of distributions that is parameterized by

η ${\eta}$ ; as we vary

η ${\eta}$ , we then get diﬀerent distributions within this family.
We now show that the Bernoulli and the Gaussian distributions are examples of exponential family distribution:

Bernoulli(ϕ) $Bernoulli({\phi})$ :
the Bernoulli distribution with mean

ϕ ${\phi}$ ;

p(y=1;ϕ)=ϕ;p(y=0;ϕ)=1−ϕ $p(y = 1;{\phi}) = {\phi}; p(y = 0;{\phi}) = 1−{\phi}$
We write the Bernoulli distribution as:

p (y; ϕ) = ϕ y (1 - ϕ) (1 - y)

$p(y;{\phi})={{\phi}^y{(1-{\phi})}^{(1-y)}}$

= e x p (y l o g ϕ + (1 - y) l o g (1 - ϕ))

$=exp(ylog{\phi} + (1-y)log(1-{\phi}))$

= e x p ((l o g (ϕ 1 - ϕ)) y + l o g (1 - ϕ))

$=exp((log({\frac{\phi}{1-{\phi}}}))y+log(1-{\phi}))$

η = (l o g (ϕ 1 - ϕ)) = l o g (1 + e η)

${\eta}=(log({\frac{\phi}{1-{\phi}}}))=log(1+e^{\eta})$

b (y) = 1

$b(y)=1$

T (y) = y

$T(y)=y$

a (η) = - l o g (1 - ϕ)

$a({\eta})=-log(1-{\phi})$
Gaussian distribution
Recall that, when deriving linear regression, the value of σ2 had no eﬀect on our ﬁnal choice of θ and hθ(x). Thus, we can choose an arbitrary value for σ2 without changing anything. To simplify the derivation below, lets set σ2 = 1 We then have:

p (y; μ) = 1 ( 2 π ) - - - - \sqrt e x p (- 1 2 (y - μ) 2)

$p(y;{\mu})={\frac{1}{\sqrt{(2{\pi})}}}exp(-{\frac{1}{2}}{(y-{\mu})}^2)$

= 1 ( 2 π ) - - - - \sqrt e x p (- 1 2 y 2) e x p (μ y - 1 2 μ 2)

$={\frac{1}{\sqrt{(2{\pi})}}}exp(-{\frac{1}{2}}{y}^2)exp({\mu}y-{\frac{1}{2}}{\mu}^2)$
Thus, we see that the Gaussian is in the exponential family, with:

η = μ

${\eta}={\mu}$

T (y) = y

$T(y)=y$

a (η) = 1 2 μ 2 = 1 2 η 2

$a({\eta})={\frac{1}{2}}{\mu}^2={\frac{1}{2}}{\eta}^2$

b (y) = 1 ( 2 π ) - - - - \sqrt e x p (- 1 2 y 2)

$b(y)={\frac{1}{\sqrt{(2{\pi})}}}exp(-{\frac{1}{2}}{y}^2)$
There’re many other distributions that are members of the exponential family: The multinomial (which we’ll see later), the Poisson (for modelling count-data; also see the problem set); the gamma and the exponential (for modelling continuous, non-negative random variables, such as timeintervals); the beta and the Dirichlet (for distributions over probabilities); and many more. In the next section, we will describe a general “recipe” for constructing models in which y (given x and θ) comes from any of these distributions.
Constructing GLMs
More generally, consider a classification or regression problem where we would like to predict the value of some random variable y as a function of x. To derive a GLM for this problem, we will make the following three assumptions about the conditional distribution of y given x and about our model:
1.

y|x;θ $y | x;{\theta}$ ∼ ExponentialFamily(

η ${\eta}$ ). I.e., given x and θ, the distribution of y follows some exponential family distribution, with parameter

η ${\eta}$ .
2. Given x, our goal is to predict the expected value of T(y) given x. In most of our examples, we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy

h(x)=E[y|x] $h(x) = E[y|x]$ . (Note that this assumption is satisﬁed in the choices for

hθ(x) $h_θ(x)$ for both logistic regression and linear regression. For instance, in logistic regression, we had

θTx $θ_Tx$ . (Or, if η is vector-valued, then

ηi=θiTx $η_i = {θ_i}^Tx$ .)
Ordinary Least Squares
To show that ordinary least squares is a special case of the GLM family of models, consider the setting where the target variable y (also called the response variable in GLM terminology) is continuous, and we model the conditional distribution of y given x as as a Gaussian N(µ,σ2). (Here, µ may depend x.) So, we let the ExponentialFamily(η) distribution above be the Gaussian distribution. As we saw previously, in the formulation of the Gaussian as an exponential family distribution, we had µ = η. So, we have:

h θ (x) = E [y | x; θ]

$hθ(x) = E[y|x;θ]$

= µ

$= µ$

= η

$= η$

= θ T x

$= θ^Tx$
Logistic Regression
We now consider logistic regression. Here we are interested in binary classiﬁcation, so y ∈{0,1}. Given that y is binary-valued, it therefore seems natural to choose the Bernoulli family of distributions to model the conditional distribution of y given x. In our formulation of the Bernoulli distribution as an exponential family distribution, we had

ϕ=1/(1+e−η) ${\phi} = 1/(1 + e^{−η})$ . Furthermore, note that if y|x;θ ∼ Bernoulli(

ϕ ${\phi}$ ), then E[y|x;θ] =

ϕ ${\phi}$ . So, following a similar derivation as the one for ordinary least squares, we get:

h θ (x) = E [y | x; θ]

$h_{\theta}(x) = E[y|x;{\theta}]$

= ϕ

$= {\phi}$

= 1 / (1 + e - η)

$= 1/(1 + e^{−η})$

= 1 / (1 + e - θ T x)

$= 1/(1 + e^{−θ^Tx})$
So, this gives us hypothesis functions of the form

hθ(x)=1/(1+e−θTx) $h_θ(x) = 1/(1 + e^{−θ^Tx})$ . If you are previously wondering how we came up with the form of the logistic function

1/(1+e−z) $1/(1 + e^{−z})$ , this gives one answer: Once we assume that y conditioned on x is Bernoulli, it arises as a consequence of the deﬁnition of GLMs and exponential family distributions. To introduce a little more terminology, the function g giving the distribution’s mean as a function of the natural parameter (g(η) = E[T(y);η]) is called the canonical response function. Its inverse, g−1, is called the canonical link function. Thus, the canonical response function for the Gaussian family is just the identify function; and the canonical response function for the Bernoulli is the logistic function