机器学习笔记2-Supervised learning

1.1 Probabilistic interpretation
why might the least-squares cost function J, be a reasonable choice?
Let us assume that the target variables and the inputs are related via the equation :

y(i)=θTx(i)+ϵ(i)

ϵ(i) is an error term that captures either unmodeled effects or random noise.
Let us further assume that the ϵ(i) are distributed IID (independently and identically distributed) .
ϵ(i)N(0,σ2)

p(ϵ(i))=12πσexp((ϵ(i))2σ2)

p(y(i)|x(i);θ)=12πσexp((y(i)θTx(i))2σ2)

likelihood function:
L(θ)=L(θ;X,Y)=p(Y|X,θ)

=i=1mp(y(i)|x(i);θ)

=i=1m12πσexp((y(i)θTx(i))2σ2)

maximize the log likelihood
l(θ)=logL(θ)

=logi=1m12πσexp((y(i)θTx(i))2σ2)

=i=1mlog12πσexp((y(i)θTx(i))2σ2)

=mlog12πσ121σ2i=1m(y(i)θTx(i))2

Hence, maximizing l(θ) gives the same answer as minimizing 12mi=1(y(i)θTx(i))2
summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ.
1.2 Locally weighted linear regression
The leftmost figure below shows the result of fitting a y=θ0+θ1x to a dataset. We see that the data doesn’t really lie on straight line, and so the fit is not very good.
这里写图片描述
Instead, if we had added an extra feature x2 , and fit y=θ0+θ1x1+θ2x2 , then we obtain a slightly better fit to the data. (See middle figure)
it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a 5-th order polynomial y=5j=0θjxj we’ll say the figure on the left shows an instance of underfitting—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting.
As discussed previously, and as shown in the example above, the choice of features is important to ensuring good performance of a learning algorithm. (When we talk about model selection, we’ll also see algorithms for automatically choosing a good set of features.) In this section, let us talk briefly talk about the locally weighted linear regression (LWR) algorithm which, assuming there is sufficient training data, makes the choice of features less critical.
In the original linear regression algorithm, to make a prediction at a query point x .
1. Fit θ to minimize mi=1(y(i)θTx(i))2
2. Output θTx .
the locally weighted linear regression algorithm
1. Fit θ to minimize mi=1ω(i)(y(i)θTx(i))2
2. Output θTx .
A fairly standard choice for the weights is:
ω(i)=exp((x(i)x)22τ2)

if |x(i) − x| is small, then w(i) is close to 1; and if |x(i) − x| is large, then w(i) is small. Hence, θ is chosen giving a much higher “weight” to the (errors on) training examples close to the query point x.
The parameter τ controls how quickly the weight of a training example falls off with distance of its x(i) from the query point x; τ is called the bandwidth parameter.
Locally weighted linear regression is a non-parametric algorithm.The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters (the θi ), which are fit to the data. Once we’ve fit the θi and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric” (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis h grows linearly with the size of the training set.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值