PRML-3

最新推荐文章于 2022-07-12 17:20:30 发布

zhangxlubc

最新推荐文章于 2022-07-12 17:20:30 发布

阅读量195

点赞数

分类专栏： PRML

本文链接：https://blog.csdn.net/zhangxlubc/article/details/85271870

版权

PRML 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

given a trainning data set comprising N observations {Xn} together with corresponding target values {tn}, the goal is to predict the value of t for a new value of x. from a probabilistic perspective, we aim to model the predictive distribution p(t|x) because this expresses the uncertainty about the value of t for each value of x and to minimize the expected value of a suitably chosen loss function (like a squared loss).

linear basis function models

a simple linear regression is here:

the outcome is dependent on both parameters W and input variables X --> limitation!

extend the above model by considering linear combination of fixed nonlinear functions of input variables in the form of:

----->

where is known as basis functions. is called bias parameter. so the prediction is basically N weighed non-linear functions combination, and each non-linear function applies an operation on the input.

why using nonlinear basis functions? --> making y(x,w) also nonlinear function of input vector x to gain more flexibility, BUT note this is still linear function of parameters w. so y(x,w) is called linear models.

some common basis functions:

Gaussian basis function:

sigmoidal basis function:

where the logistic sigmoid function is defined as:

tanh function: (note because tanh(a) is related to logistic sigmoid in this way, a general linear regression of logistic sigmoid function is equivalent to a general linear combination of tanh function)

Fourier basis: an expansion in sinusoidal functions. if it is of interest to consider basis functions that are located in both space and frequency, then it will lead to a class of functions known as wavelets.

maximum likelihood and least squares:

(single) target variable: t; (single) input variable: x; find function: y(x,w).

if a squared loss function is assumed, then the optimal prediction, for a new value of x, will be given by the conditional mean of the target variable:

extend the above conclusion to N inputs and N target values:

input variables:

target values: t ={ }

then the likelihood function is: (note the nonlinear function is used here)

by taking the logarithm and differentiate it with respect to W, the ML mean is:

-----> normal equation

: this is an N x M matrix, and each ROW, not COLUMN, is the basis functions for each input.

is known as Moore-Penrose pseudo-inverse of the matrix PHI. it's regarded as a generalization of the notion of matrix inverse to non-square matrices (for squared matrix, it's simplifed to .

similarily, the precision parameter is obtained under ML condition:

sequential learning:

sequential algorithm <==> on-line algorithm

the sequential learning algorithm is called LMS (least mean squares) algorithm. it's obtained by using a technique called stochastic gradient descent, or sequential gradient descent:

if the error function comprises a sum over data points , then after presentation of pattern n, the stochastic gradient descent algorithm updates the parameter vector w using:

for the following sum-of-squares error function, the learning algorithm is obtained as:

this is known as LMS algorithm. the value of needs to be chosen with care to make sure the algorithm converges.

regularized least squares:

when a regularization term is introduced to control over-fitting, the error function becomes:

lamda here is the regularization coefficient that controls the relative importance of the data-dependet error and the regularization term .

one of the simplest forms of regularizer is the sum-of-squares of the weight vector elements:

so the total error function becomes:

(this particular choice of regularizer is known in the machine learning literature as weight decay because in sequential learning algor, it encourages weight values to decay towards zero, unless supported by the data.)

solving w in the above error function:

a more general regularizer is used, and then the erro function is:

to minimize the error here is equivalent to minimize the unragularized sun-of-squares error subject to the constraint: for an appropriate value of parameter (for example, use Lagrange multipliers to solve this)

multiple outputs:

a more common way is to use the same set of basis functions to model all of the components of the target vector with K targets.

then the conditional distribution of the target vector is an isotropic Gaussian:

zhangxlubc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
PRML-3

given a trainning data set comprising N observations {Xn} together with corresponding target values {tn}, the goal is to predict the value of t for a new value of x. from a probabilistic perspective, ...
复制链接

扫一扫