机器学习学习笔记 PRML Chapter 1.1 Introduction

Chapter 1-1 : Introduction

PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition
Christopher M. Bishop, PRML

1. Basic Terminology

  • a training set: x1,...,xN , where each xi is a d-dimension column vector, i.e, xiRd .
  • target vector: t , resulting in a pair: (x,t) for supervised learning.
  • generalization: The ability to categorize correctly new examples that differ from those used for training is known as generalization.
  • pre-process stage, aka feature extraction. Why pre-processing? Reasons: 1) transform makes the pattern recognition problem be easier to solve; 2) pre-processing might also be performed in order to speed up computation due to dimensionality reduction.
  • reinforcement learning: is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment. In many cases, the current action not only affects the immediate reward but also has an impact on the reward at all subsequent time steps. A general feature of reinforcement learning is the trade-off between exploration, in which the system tries out new kinds of actions to see how effective they are, and exploitation, in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results.

2. Different Applications:

  • 1) classification in supervised learning, training data (x,t) , to learn the model y=y(x) , where {y} consists of a finite number of discrete categories;
  • 2) regression in supervised learning, training data (x,t) , to learn the model y=y(x) , where the output y consists of one or more continuous variables.
  • 3) unsupervised learning, training data x, without tag vector t , including:
    • clustering, to discover groups of similar examples within the data;
    • density estimation, to determine the distribution of data within the input space;
    • visualization, to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

3. Linear supervised learning: Linear Prediction/Regression

3.1 Flow-work:

Here the model is represented by parameters θ, for unseen input xN+1 , to make a prediction y^(xN+1)

这里写图片描述

3.2 Linear Prediction

这里写图片描述

3.3 Optimization approach

Error function J(θ)

J(θ)(yXθ)T(yXθ)=i=iN(yixTiθ)2

Finding the solution by differentiation:

Note : Matrix differentiation, Aθθ=AT , and θTAθθ=2ATθ .

We get
这里写图片描述

The optimal parameter θ=(xTx)1xTy .

4. A Regression Problem: Polynomial Curve Fitting

4.1 Training data:

Given a training data set comprising N observations o x, written x(x1,...,xN)T , together with corresponding observations of the values of t, denoted t(t1,...,tN)T .

4.2 Synthetically generated data:

这里写图片描述

Method:
i.e., function value y(x) (e.g., sin(2πx) ) + Gaussian noise.
The input data set x in Figure 1.2 was generated by choosing values of xn , for n=1,...,N , spaced uniformly in range [0, 1], and the target data set t was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise having a Gaussian distribution to each such point in order to obtain the corresponding value tn .

Discussion:
By generating data in this way, we are capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved.

4.3 Why called Linear Model?

1) polynomial function

这里写图片描述

where M is the order of the polynomial, and xj denotes x raised to the power of j. The polynomial coefficients w0,...,wM are collectively denoted by the vector w .


Question: Why called linear model or linear prediction? Why “linear”?
Answer: Note that, although the polynomial function y(x,θ) is a nonlinear function of x , it does be a linear function of the coefficients θ. Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are called linear models and will be discussed extensively in Chapters 3 and 4.


2) Error Function E(w)

这里写图片描述
where the factor of 1/2 is included for later convenience. We can solve the quadratic function of the coefficients w , and find a unique optimal solution in closed form, demoted by w .

  • Model Comparison or Model Selection: choosing the order M of the polynomial. a dilemma: large M causes over-fitting, small M gives rather poor fits to the distribution of training data.
  • over-fitting: In fact, the polynomial passes exactly through each data point and E(w ) = 0. However, the fitted curve oscillates wildly and gives a very poor representation of the function sin(2πx) . This latter behavior is known as over-fitting.
  • Model complexity: ??? # of parameters in the model.

4.5 Bayesian perspective

Least Squares (i.e., Linear Regression ) Estimate V.S. Maximum Likelihood Estimate:

We shall see that the least squares approach (i.e., linear regression) to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5), and that the over-fitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set.

How to formulate the likelihood for linear regression? (to be discussed in later sections.)

4.6 Regularization, Regularizer : to control over-fitting

  • regularization: to involve adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values.
  • form of regularizers: e.g., a quadratic regularizer, called ridge regression. In the context of neural networks, this approach is
    known as weight decay.

    这里写图片描述

The modified error function includes two terms:
+ the first term : sum-of-squares error;
+ the second term: regularizer term, which has the desired effect of reducing the magnitude of the coefficients.


End of Chapter 1.1 Introduction

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Linear algebra is a pillar of machine learning. You cannot develop a deep understanding and application of machine learning without it. In this new laser-focused Ebook written in the friendly Machine Learning Mastery style that you’re used to, you will finally cut through the equations, Greek letters, and confusion, and discover the topics in linear algebra that you need to know. Using clear explanations, standard Python libraries, and step-by-step tutorial lessons, you will discover what linear algebra is, the importance of linear algebra to machine learning, vector, and matrix operations, matrix factorization, principal component analysis, and much more. This book was designed to be a crash course in linear algebra for machine learning practitioners. Ideally, those with a background as a developer. This book was designed around major data structures, operations, and techniques in linear algebra that are directly relevant to machine learning algorithms. There are a lot of things you could learn about linear algebra, from theory to abstract concepts to APIs. My goal is to take you straight to developing an intuition for the elements you must understand with laser-focused tutorials. I designed the tutorials to focus on how to get things done with linear algebra. They give you the tools to both rapidly understand and apply each technique or operation. Each tutorial is designed to take you about one hour to read through and complete, excluding the extensions and further reading. You can choose to work through the lessons one per day, one per week, or at your own pace. I think momentum is critically important, and this book is intended to be read and used, not to sit idle. I would recommend picking a schedule and sticking to it.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值