Machine Learning Trail Exam

PCA

Given mean-centered data in 3D for which the covariance matrix is given by C = ( 1 0 0 0 2 0 0 0 4 ) C = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4 \end{pmatrix} C=100020004. Also given is a data transformation matrix R = ( 1 0 0 0 1 2 − 3 2 0 3 2 1 2 ) , R = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \tfrac{1}{2} & -\tfrac{\sqrt{3}}{2} \\ 0 & \tfrac{\sqrt{3}}{2} & \tfrac{1}{2} \end{pmatrix}, R=10002123 023 21, by which we can linearly transform every data vector x \mathbf{x} x (taken as a column vector) to a new 3D column vector z \mathbf{z} z through z = R x \mathbf{z} = \mathbf{Rx} z=Rx. We note that R \mathbf{R} R is actually a rotation matrix that rotates in the second and third coordinate. Also note that for its inverse, we have R − 1 = ( 1 0 0 0 1 2 3 2 0 − 3 2 1 2 ) . R^{-1} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \tfrac{1}{2} & \tfrac{\sqrt{3}}{2} \\ 0 & -\tfrac{\sqrt{3}}{2} & \tfrac{1}{2} \end{pmatrix}. R1=10002123 023 21.

1. principal component

Q: What is the first principal component of the original data for which we have the covariance matrix C \mathbf{C} C?

A: According to the covariance matrix, we choose the maximum variance as our principal component. Therefore, ( 0 0 1 ) \begin{pmatrix} 0\\ 0 \\ 1 \end{pmatrix} 001.

2. covariance of the transformed data

Q: Assume we transform all the data by the transformation matrix R \mathbf{R} R what does the covariance of the transformed data become?

A: C ′ = R X ( R X ) T = R X X T R T = R C R T = ( 1 0 0 0 7 2 − 3 2 0 − 3 2 5 2 ) \mathbf{C'} = \mathbf{RX(RX)}^T = \mathbf{RXX}^T\mathbf{R}^T = \mathbf{RCR}^T = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \tfrac{7}{2} & -\tfrac{\sqrt{3}}{2} \\ 0 & -\tfrac{\sqrt{3}}{2} & \tfrac{5}{2} \end{pmatrix} C=RX(RX)T=RXXTRT=RCRT=10002723 023 25

3. Principal component of the transformed data

Q: What is the first principal component for the transformed data? [not sure]

A: R ( 0 0 1 ) = ( 0 − 3 2 1 2 ) \mathbf{R}\begin{pmatrix} 0\\ 0 \\ 1 \end{pmatrix} = \begin{pmatrix} 0 \\ \tfrac{-\sqrt{3}}{2} \\ \frac{1}{2} \end{pmatrix} R001=023 21 or ( 0 3 2 − 1 2 ) \begin{pmatrix} 0 \\ \tfrac{\sqrt{3}}{2} \\ -\frac{1}{2} \end{pmatrix} 023 21

2D Classification

Assume we have 2-dimensional two-class classification problem. The first class is distributed uniformly in a square between 0 ≤ x 1 ≤ 2 0≤x_1≤2 0x12 and 0 ≤ x 2 ≤ 2 0≤x_2≤2 0x22. The second class is distributed uniformly in a circle with center ( 2 , 2 ) (2,2) (2,2), and radius 1. See picture below:

1. Bayes error

Q: Assume that both classes are equally likely. What is the Bayes error for this problem?

A: First question is where is the decision boundary. That is when p ( x ∣ ω 1 ) p ( ω 1 ) > < p ( x ∣ ω 2 ) p ( ω 2 ) p(x|\omega_1)p(\omega_1)><p(x|\omega_2)p(\omega_2) p(xω1)p(ω1)><p(xω2)p(ω2). Because both classes are equally likely, the prior of both classes are p ( ω 1 ) = p ( ω 2 ) = 1 2 p(\omega_1) = p(\omega_2) = \frac{1}{2} p(ω1)=p(ω2)=21 respectively. The height of the pdf of class 1 N o t e ^{Note} Note is p ( x ∣ ω 1 ) = 1 2 ∗ 2 = 0.25 p(x|\omega_1)=\frac{1}{2*2}=0.25 p(xω1)=221=0.25, and the height of the pdf of class 2 is 1 π 2 = 0.32 \frac{1}{\pi^2}=0.32 π21=0.32. So the overlap region is assigned to class 2. We make an error of class 1 of: v a r ϵ 1 = overlapped area of square = 1 4 π r 2 2 ∗ 2 = 0.196 var_{\epsilon_1} = \frac{\textrm{overlapped}}{\textrm{area of square}} = \frac{\frac{1}{4}\pi r^2}{2*2} = 0.196 varϵ1=area of squareoverlapped=2241πr2=0.196 Total error ϵ = ϵ 1 ∗ 1 2 + ϵ 2 ∗ 1 2 = 0.098 \epsilon = \epsilon_1*\frac{1}{2} + \epsilon_2*\frac{1}{2} = 0.098 ϵ=ϵ121+ϵ221=0.098

Note:

The area under the graph of a probability density function is 1. The use of ‘density’ in this term relates to the height of the graph.
The height of the probability density function represents how closely the values of the random variable are packed at places on the x-axis.
Probability density function (for a continuous random variable)

2. Changed prior

Q: Now assume that the prior of class 1 is changed to 0.8. What will be the Bayes error now?

A: Now p ( x ∣ ω 1 ) p ( ω 1 ) = 0.8 2 ∗ 2 = 0.2 p(x|\omega_1)p(\omega_1)=\frac{0.8}{2*2}=0.2 p(xω1)p(ω1)=220.8=0.2, and p ( x ∣ ω 2 ) p ( ω 2 ) = 0.2 π r 2 = 0.06 p(x|\omega_2)p(\omega_2)=\frac{0.2}{\pi r^2}=0.06 p(xω2)p(ω2)=πr20.2=0.06.
So the overlap region is assigned to class 1.
We make an error of class 2 of: v a r ϵ 2 = overlapped area of circle = 1 4 var_{\epsilon_2} = \frac{\textrm{overlapped}}{\textrm{area of circle}} = \frac{1}{4} varϵ2=area of circleoverlapped=41
Total error ϵ = ϵ 1 ∗ 0.8 + ϵ 2 ∗ 0.2 = 0.05 \epsilon = \epsilon_1*0.8 + \epsilon_2*0.2 = 0.05 ϵ=ϵ10.8+ϵ20.2=0.05

3. Logistic classifier

Q: Assume we fit a logistic classifier: p ( ω 1 ∣ x ) = 1 1 + exp ⁡ ( − w T x − w 0 ) p(\omega_1|x)=\frac{1}{1+\exp(-w^T x-w_0)} p(ω1x)=1+exp(wTxw0)1 on a very large training set. In which direction will 𝐰 point towards?

A: The more you move to the upper right corner of the feature space, the more likely it is to find class 2, and then you will have the lower p ( ω 1 ∣ x ) p(\omega_1|x) p(ω1x).
When x gets larger and larger ( x = [ 4 , 4 ] x=[4,4] x=[4,4] for instance) we want that the denominator becomes larger (then p ( ω 1 ∣ x ) p(\omega_1|x) p(ω1x) gets smaller). Therefore − w T x − w 0 -w^Tx-w_0 wTxw0 should get very large (positive) and w T x + w 0 w^Tx+w_0 wTx+w0 should get very negative . Because all elements in x are positive, all elements in w should get negative.
So w = [ − 1 , − 1 ] T w = [-1,-1]^T w=[1,1]T

4. Choose a classifier

Q: Now we have three classifiers available: (1) the nearest mean classifier, (2) the quadratic classifier and (3) the 1-nearest neighbour classifier. What classifier should you choose for (a) very small training set sizes, and for (b) very large training set sizes?

A:
If we have a small number of training samples, we need a very simple, inflexible and stable classifier: the nearest mean.
If we havevery large raining samples, we can afford a complex, flexible classifier.
The most flexible of all given classifiers is the 1NN.

Alternative perceptron classifier

Q: Assume we optimise a linear classifier y ^ = s i g n ( w T x + w 0 ) \hat{y}=sign(\mathbf{w}^T x+w_0) y^=sign(wTx+w0) by minimising an alternative perceptron loss: J ( w , w 0 ) = ∑ m i s c l a s s i f i e d    x i − y i ( w T x i + w 0 ) J(\mathbf{w},w_0)=\sum_{misclassified\,\, \mathbf{x}_i} \sqrt{-y_i (\mathbf{w}^T\mathbf{x}_i + w_0)} J(w,w0)=misclassifiedxiyi(wTxi+w0)
We start with initialisation w = [ 1 , 0 ] T \mathbf{w}=[1,0]^T w=[1,0]T, w 0 = 0.01 w_0=0.01 w0=0.01, and we use a learning rate of η = 0.1 \eta=0.1 η=0.1.
Given dataset ( x 1 = [ 0 , − 1 ] T , y 1 = − 1 ) , ( x 2 = [ 1.5 , 0 ] T , y 2 = + 1 ) , ( x 3 = [ 0 , + 1 ] T , y 3 = + 1 ) (\mathbf{x}_1=[0,-1]^T,y_1=-1), (\mathbf{x}_2=[1.5,0]^T,y_2=+1), (\mathbf{x}_3=[0,+1]^T,y_3=+1) (x1=[0,1]T,y1=1),(x2=[1.5,0]T,y2=+1),(x3=[0,+1]T,y3=+1), what are the parameters values after one update step?

A: The (alternative) perceptron should minimise the given loss. This is done by gradient descent: w new = w old − η ∂ J ∂ w \mathbf{w}_{\textrm{new}} = \mathbf{w}_{\textrm{old}} -\eta \frac{\partial J}{\partial w} wnew=woldηwJ.So we need the derivative of the loss w.r.t. w \mathbf{w} w and w 0 w_0 w0.
If we fill in the derivative: ∂ J ∂ w = ∑ m i s c l a s s i f i e d    x i − x i y i 2 − y i ( w T x i + w 0 ) \frac{\partial J}{\partial \mathbf{w}} = \sum_{misclassified\,\, \mathbf{x}_i} \frac{-\mathbf{x}_i \mathbf{y}_i}{2\sqrt{-y_i (\mathbf{w}^T\mathbf{x}_i + w_0)}} wJ=misclassifiedxi2yi(wTxi+w0) xiyi,
we need to find which objects are misclassified. Only x 1 x_1 x1 is misclassified, so ∂ J ∂ w = 1 2 [ 0 , − 1 ] 0.01 = [ 0 , − 5 ] \frac{\partial J}{\partial w} = \frac{1}{2}\frac{[0,-1]}{\sqrt{0.01}} = [0,-5] wJ=210.01 [0,1]=[0,5]
For w 0 w_0 w0 we get the derivative like ∂ J ∂ w 0 = ∑ m i s c l a s s i f i e d    x i − y i 2 − y i ( w T x i + w 0 ) = 1 2 1 0.01 = 5 \frac{\partial J}{\partial w_0} = \sum_{misclassified\,\, \mathbf{x}_i} \frac{-\mathbf{y}_i}{2\sqrt{-y_i (\mathbf{w}^T\mathbf{x}_i + w_0)}}=\frac{1}{2}\frac{1}{\sqrt{0.01}}=5 w0J=misclassifiedxi2yi(wTxi+w0) yi=210.01 1=5
w = w − η [ 0 , − 5 ] = [ 1 , 0.5 ] \mathbf{w} = \mathbf{w} - \eta[0,-5] = [1,0.5] w=wη[0,5]=[1,0.5]
w 0 = w − η ∗ 5 = − 0.49 w_0 = w-\eta * 5 = -0.49 w0=wη5=0.49

Note: 线性分类算法:感知器Perceptron

1D Regression

Given are 5 one-dimensional input data points X = ( − 1 , − 1 , 0 , 1 , 1 ) T \mathbf{X}=(−1,−1,0,1,1)^T X=(1,1,0,1,1)T and their 5 corresponding outputs Y = ( 0 , 0 , 1 , 0 , 1 ) T \mathbf{Y}=(0,0,1,0,1)^T Y=(0,0,1,0,1)T.
We are going to have a look at linear regression using polynomial basis functions.

1. bias term

Q: Fit a linear function (including the bias term) to this data under the standard least-squares loss. What value does the bias term take on?

A: Add a column vector with ones to the original X to model the intercept, then X T X = [ 4 0 0 5 ] X^TX =\begin{bmatrix} 4 & 0 \\ 0 & 5 \end{bmatrix} XTX=[4005], X T Y = [ 1 2 ] X^TY = \begin{bmatrix} 1 \\ 2 \end{bmatrix} XTY=[12]. ( X T X ) − 1 X T Y = [ 1 4 2 5 ] (X^TX)^{-1}X^TY =\begin{bmatrix} \frac{1}{4}\\ \frac{2}{5} \end{bmatrix} (XTX)1XTY=[4152]. So, the intercept equals 2 5 \frac{2}{5} 52.

2. slope of linear function

Q: Fit a linear function (including the bias term) to this data under the standard least-squares loss. What value does the slope take on (i.e. what is the coefficient for the linear term)?

A: According to the result calculated in the above question, the slope is 1 4 \frac{1}{4} 41.

3. total loss

Q: Let us now fit a parabola, a second-order polynomial, to this data. Again, we use the standard squared loss as our optimality criterion. What total loss (i.e., the loss added over all training data point) does the optimal second-order polynomial attain? (Rather than doing the computations, you may want to have a look at a sketch of the situation.)

A: With three degrees of freedom, the squared loss can fit three points perfectly, if they are not in the same input location. Input -1 occurs twice, but the corresponding output is the same, so this is basically one point that needs to be fitted. And +1 we find two different outputs, so the best we can do there is to go right in between. All in all, we can get 0 error on the left point, 0 error on the right point, and 2 ∗ ( 1 / 2 ) 2 = 1 / 2 2*(1/2)^2 = 1/2 2(1/2)2=1/2 due to the errors in the middle.
So the total loss is 1 2 \frac{1}{2} 21.

4. total loss with higher order polynomial

Q:Again determine the total loss over the training data, but now assume we optimally fitted a third-order polynomial.

A: Since going to a third-order polynomial [or any higher-order for that matter] cannot improve the performance of the second-order polynmoial, the total loss remains the same 1 2 \frac{1}{2} 21.

5. MLE

Q: Rather than just fitting a least-squares model, we consider a maximum likelihood solution under an assumed Gaussian noise model. That is, we assume that outputs are obtained as a function f f f from x x x plus some fixed-variance, independent Gaussian noise.
If our fit to the 5 data point equals the constant zero function, i.e. f ( x ) = 0 f(x)=0 f(x)=0, what then is the maximum likelihood estimate for the variance of the Gaussian noise?

A: The variance equal 1 precision \frac{1}{\textbf{precision}} precision1 and is simply estimated by the average squared loss achieved on the training data, i.e., (0 + 0 + 1 + 0 + 1)/5 = 2/5.

Note:

Maximum Likelihood Estimation Explained - Normal Distribution

Curves

Assume that we have a two-class classification problem. Each of the classes has a Gaussian distribution in k k k dimensions: p ( x ∣ ω i ) = N ( x ; μ i , I ) p(\mathbf{x}|\omega_i) = \mathcal{N}(\mathbf{x};\mu_i,\mathbf{I}) p(xωi)=N(x;μi,I), where I \mathbf{I} I is a k × k k \times k k×k identity matirx. The means of the two classes are μ 1 = [ 0 , 0 , … , 0 ] T \mu_1 = [0,0,…,0]^T μ1=[0,0,,0]T and μ 2 = [ 2 , 2 , … , 2 ] T \mu_2 = [2,2,…,2]^T μ2=[2,2,,2]T. Per class, we have n n n objects per class. On this data a nearest mean classifier is trained.

1. number of feature

Q: When the number of features increases
A: The bayes error decreases as well. Because each of the features contributes a bit to the discrimination between the classes. If we know the distributions perfectly, the class overlap would decrease and decrease.
A: the true error first decreases, then increases again. Because the classifier is trained on some training data, which is finite. So at a certain moment it will suffer from the curse of dimensionality, and the performance will deteriorate. The true error will first go down (more useful information) and later goes up (overfitting in a too large feature space).

2. feature reduction & influence of the number of feature

Q: Before we train a classifier, we also perform a forward feature selection to reduce the number of features to m = ⌈ k 2 ⌉ m=⌈\frac{k}{2}⌉ m=2k. When the number of features increases…
A: the true error first decreases, then increases again. Because feature reduction may be tried to combat the curse of dimensionality, but to when you push it too far (you increase the number of features further and further), it will anyway suffer from the curse. Fundamentally nothing has changed; first the true error goes down, but at a certain moment it will increase again.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值