机器学习笔记2-Supervised learning

最新推荐文章于 2022-01-02 15:26:57 发布

xdgs_2005

最新推荐文章于 2022-01-02 15:26:57 发布

阅读量432

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/xdgs_2005/article/details/52422845

版权

人工智能专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1.1 Probabilistic interpretation
why might the least-squares cost function J, be a reasonable choice?
Let us assume that the target variables and the inputs are related via the equation :

y (i) = θ T x (i) + ϵ (i)

${y^{(i)}}={{\theta}^{T}}{x^{(i)}}+{{\epsilon}^{(i)}}$

ϵ(i) ${{\epsilon}^{(i)}}$ is an error term that captures either unmodeled eﬀects or random noise.
Let us further assume that the

ϵ(i) ${{\epsilon}^{(i)}}$ are distributed IID (independently and identically distributed) .

ϵ (i) \sim N (0, σ 2)

${{\epsilon}^{(i)}} ∼ N(0,σ2)$

p (ϵ (i)) = 1 2 π - - \sqrt σ e x p (- ( ϵ ( i ) ) 2 σ 2)

$p({{\epsilon}^{(i)}})={\frac{1}{\sqrt{2{\pi}}{\sigma}}}exp{(-{\frac{({{\epsilon}^{(i)}})^2}{{\sigma}^2}})}$

p (y (i) | x (i); θ) = 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 σ 2)

$p({y^{(i)}}|{x^{(i)}};{{\theta}})={\frac{1}{\sqrt{2{\pi}}{\sigma}}}exp{(-{\frac{({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2}{{\sigma}^2}})}$
likelihood function：

L (θ) = L (θ; X, Y) = p (Y | X, θ)

$L({\theta})=L({\theta};X,Y)=p(Y|X,{\theta})$

= \prod i = 1 m p (y (i) | x (i); θ)

$={\prod_{i=1}^{m}}p({y^{(i)}}|{x^{(i)}};{{\theta}})$

= \prod i = 1 m 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 σ 2)

$={\prod_{i=1}^{m}}{\frac{1}{\sqrt{2{\pi}}{\sigma}}}exp{(-{\frac{({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2}{{\sigma}^2}})}$
maximize the log likelihood

l (θ) = l o g L (θ)

$l({\theta})=logL({\theta})$

= l o g \prod i = 1 m 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 σ 2)

$=log{\prod_{i=1}^{m}}{\frac{1}{\sqrt{2{\pi}}{\sigma}}}exp{(-{\frac{({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2}{{\sigma}^2}})}$

= \sum i = 1 m l o g 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 σ 2)

$={\sum_{i=1}^{m}}log{\frac{1}{\sqrt{2{\pi}}{\sigma}}}exp{(-{\frac{({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2}{{\sigma}^2}})}$

= m l o g 1 2 π - - \sqrt σ - 1 2 1 σ 2 \sum i = 1 m (y (i) - θ T x (i)) 2

$=mlog{\frac{1}{\sqrt{2{\pi}}{\sigma}}}-{\frac{1}{2}}{\frac{1}{{\sigma}^2}}{\sum_{i=1}^{m}}({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2$
Hence, maximizing

l(θ) $l({\theta})$ gives the same answer as minimizing

12∑mi=1(y(i)−θTx(i))2 ${\frac{1}{2}}{\sum_{i=1}^{m}}({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2$
summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to ﬁnding the maximum likelihood estimate of θ.
1.2 Locally weighted linear regression
The leftmost ﬁgure below shows the result of ﬁtting a

y=θ0+θ1x $y = θ_0 + θ_1x$ to a dataset. We see that the data doesn’t really lie on straight line, and so the ﬁt is not very good.
这里写图片描述

Instead, if we had added an extra feature

x2 $x^2$ , and ﬁt

y=θ0+θ1x1+θ2x2 $y = θ_0 +θ_1x_1+θ_2x_2$ , then we obtain a slightly better ﬁt to the data. (See middle ﬁgure)
it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost ﬁgure is the result of ﬁtting a 5-th order polynomial

y=∑5j=0θjxj $y={\sum_{j=0}^{5}{{\theta}_jx^j}}$ we’ll say the ﬁgure on the left shows an instance of underﬁtting—in which the data clearly shows structure not captured by the model—and the ﬁgure on the right is an example of overﬁtting.
As discussed previously, and as shown in the example above, the choice of features is important to ensuring good performance of a learning algorithm. (When we talk about model selection, we’ll also see algorithms for automatically choosing a good set of features.) In this section, let us talk brieﬂy talk about the locally weighted linear regression (LWR) algorithm which, assuming there is suﬃcient training data, makes the choice of features less critical.
In the original linear regression algorithm, to make a prediction at a query point x .
1. Fit θ to minimize

∑mi=1(y(i)−θTx(i))2 ${\sum_{i=1}^{m}}({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2$
2. Output

θTx $θ^Tx$ .
the locally weighted linear regression algorithm
1. Fit θ to minimize

∑mi=1ω(i)(y(i)−θTx(i))2 ${\sum_{i=1}^{m}}{\omega(i)}({y^{(i)}}-{{\theta}^{T}}{x^{(i)}})^2$
2. Output

θTx $θ^Tx$ .
A fairly standard choice for the weights is：

ω (i) = e x p (- ( x ( i ) - x ) 2 2 τ 2)

${\omega(i)}=exp{(-{\frac{({x^{(i)}}-x)^2}{2{\tau}^2}})}$
if |x(i) − x| is small, then w(i) is close to 1; and if |x(i) − x| is large, then w(i) is small. Hence, θ is chosen giving a much higher “weight” to the (errors on) training examples close to the query point x.
The parameter τ controls how quickly the weight of a training example falls oﬀ with distance of its x(i) from the query point x; τ is called the bandwidth parameter.
Locally weighted linear regression is a non-parametric algorithm.The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a ﬁxed, ﬁnite number of parameters (the

θi ${\theta}_i$ ), which are ﬁt to the data. Once we’ve ﬁt the

θi ${\theta}_i$ and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric” (roughly) refers to the fact that the amount of stuﬀ we need to keep in order to represent the hypothesis h grows linearly with the size of the training set.