PRML-3

given a trainning data set comprising N observations {Xn} together with corresponding target values {tn}, the goal is to predict the value of t for a new value of x. from a probabilistic perspective, we aim to model the predictive distribution p(t|x) because this expresses the uncertainty about the value of t for each value of x and to minimize the expected value of a suitably chosen loss function (like a squared loss).

linear basis function models

a simple linear regression is here:

the outcome is dependent on both parameters W and input variables X --> limitation!

extend the above model by considering linear combination of fixed nonlinear functions of input variables in the form of:

 -----> 

where  is known as basis functions is called bias parameter. so the prediction is basically N weighed non-linear functions combination, and each non-linear function applies an operation on the input.

why using nonlinear basis functions? --> making y(x,w) also nonlinear function of input vector x to gain more flexibility, BUT note this is still linear function of parameters w. so y(x,w) is called linear models.

some common basis functions:

  • Gaussian basis function:

  • sigmoidal basis function:

 

where the logistic sigmoid function is defined as: 

  • tanh function: (note because tanh(a) is related to logistic sigmoid in this way, a general linear regression of logistic sigmoid function is equivalent to a general linear combination of tanh function)

  • Fourier basis: an expansion in sinusoidal functions. if it is of interest to consider basis functions that are located in both space and frequency, then it will lead to a class of functions known as wavelets.

maximum likelihood and least squares:

(single) target variable: t; (single) input variable: x; find function: y(x,w). 

if a squared loss function is assumed, then the optimal prediction, for a new value of x, will be given by the conditional mean of the target variable:

extend the above conclusion to N inputs and N target values:

input variables: 

target values: ={ }

then the likelihood function is: (note the nonlinear function is used here)

by taking the logarithm and differentiate it with respect to W, the ML mean is:

-----> normal equation

  : this is an N x M matrix, and each ROW, not COLUMN, is the basis functions for each input.

 is known as Moore-Penrose pseudo-inverse of the matrix PHI. it's regarded as a generalization of the notion of matrix inverse to non-square matrices (for squared matrix, it's simplifed to .

similarily, the precision parameter is obtained under ML condition:

sequential learning:

sequential algorithm <==> on-line algorithm

the sequential learning algorithm is called LMS (least mean squares) algorithm. it's obtained by using a technique called stochastic gradient descent, or sequential gradient descent:

if the error function comprises a sum over data points then after presentation of pattern n, the stochastic gradient descent algorithm updates the parameter vector w using:

for the following sum-of-squares error function, the learning algorithm is obtained as:

 this is known as LMS algorithm. the value of  needs to be chosen with care to make sure the algorithm converges.

regularized least squares:

when a regularization term is introduced to control over-fitting, the error function becomes:

lamda here is the regularization coefficient that controls the relative importance of the data-dependet error  and the regularization term .

one of the simplest forms of regularizer is the sum-of-squares of the weight vector elements:

so the total error function becomes:

(this particular choice of regularizer is known in the machine learning literature as weight decay because in sequential learning algor, it encourages weight values to decay towards zero, unless supported by the data.)

solving w in the above error function:

   

a more general regularizer is used, and then the erro function is:

 

to minimize the error here is equivalent to minimize the unragularized sun-of-squares error  subject to the constraint:  for an appropriate value of parameter  (for example, use Lagrange multipliers to solve this)

multiple outputs:

a more common way is to use the same set of basis functions to model all of the components of the target vector with K targets.

then the conditional distribution of the target vector is an isotropic Gaussian:

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值