Review & Gradient Descent -- L4 for Data Science

Goal: We want to find a combination of betas such that it minimized the residual.

 The interpretatoin of the betas after we get them fitted:

  1. Quantify the relation between x_i and y. Like when we change a unit of x_1 in the set of x_i , what is the corresponding change of y.
  2. We know that the betas we get are just point estimate. It is possible that the actual value of beta for some x_i is 0. This means they don't have relation to y at all. In this case we need to calculate the confidence interval of the point estimate \hat{\beta}. If that interval includes 0, we cannot reject the null hypothesis (beta_i = 0). 

 So confidence interval is the way we quantify the uncertainty of predicting the coefficients (betas)

This kind of uncertainty will carry on when we try to make new prediction of y. The Prediction Interval is on top of the uncertainty from estimation of betas, because when we make only one prediction, there will be some inherent noise ( recall that we have an assumption that there is a normal distribution of noise that centers at 0). We have to consider such noise as we just make one prediction.

Think back to confidence interval, when we calculate coefficients, we have a lot of measurements from the given dataset. It will average out the normal noise we just mentioned but this will never average out the uncertainty from betas. This uncertainty comes inherently from the original dataset and it cannot be removed by us once given. 

So when we get the beta fitted and try to make one new prediction(measurement), we will encounter uncertainty from betas and from the normal noise as well. This explains why Prediction intervals will be larger than Confidence Interval.


 Interpretation example:

When we spend for ad 82.5$ on Facebook, 112.8$ on Google, 53.7$ on TV, the revenue is 345$.

We will notice that for Google, every dollar we spend on its channel, it will reduce our revenue by 0.3 dollar. Is that true? Then we hava to do hypothesis test, which means we want to look at either the p value of the coefficient or the 95% confidence interval.

We find that the p value for both Google and Facebook are quite large (remember the benchmark we usually use is 0.05). If the p value is larger than 0.05, in the hypothesis test, it is not in the rejection region, and we cannot reject the null hypothesis where the coefficient is 0.

 In this case, we cannot draw conclusion about the relationship. What makes this happen?

If we check the raw data, we find that there is a sudden increase Facebook and Google for investment in some day between 20 and 40 (x_axis indicates date). What usually happens is that there is a marketing team in the company that decided to increase the spend on both channels simultaneously. This means that the expense on Google and Facebook are highly correlated.

Remember that if the input independent variables are linear to each other, there would be collinearity that leads to wrong point estimate in the linear regression.


In this graph, we find that for each department, the accepted rate of women is always not less than men. 

But if we add up these data, we see that the total accepted rate of women is less than men, how is it possible? 

If someone just comes in and look at the last row, they will make a conclusion that differs from someone sees the first 5 rows. This is called a paradox.

Let's extend this into linear regression case.

We have a completely identical dataset for the two graph. If we look at them in black dots, we can find a downward relation between x and y. We try to fit the data and we finally get a line with negative slope, which seems good.

But if we know more information about the dataset, for example we know that some of the data points are clustered together. In the graph on the right, we can imagine that different colors of data points can come from different departments. And we can run the linear regression by each department, this results in a completely different situation. For each of the colored lines, there is a upward relation. 

 On the surface, there's nothing wrong about the two methods. We can either see all black data as a whole entity, or if we have more information about the data, like we find some data are clustered significantly, it might make more sense to break the data down for differnent linear regressions. So it really depends on the problem settings.

This give us a idea that the data is fixed, but we sometimes need to go out and learn more about the detail or nuance of the data before we apply linear regression.



Recall the first order Taylor expansion:

For x in the vicinity of x0, f(x) =  f(x0) + f'(x0)(x-x0)

If we plug the expression of \Delta \beta back to the Taylor expansion, we will have:

f(\beta) = f(\beta_0) -\alpha (\bigtriangledown_\beta f(\beta)_{\beta_{0}^{}}^{})^2

Then f(\beta) will be smaller than f(\beta_0) because we minus a positive value. We keep doing this and f(\beta)  keeps decreasing, eventually we will find a minimum.


 Let's find the relation between horsepower and mpg for example:

 if we fix the slope beta_1 and keep changing the intercept beta_0, we will have a lot of parallel red lines, for each red line, we calculate the loss function (residual), and we will have a parabola. Then we will see there is a value of intercept that makes the smallest value of loss.

We now try to find this intercept beta_0 by gradient descent:

* we use super script to indicate the iteration. \beta^{(k)} means the kth iteration.

We also fix beta_1 = -0.2, and we start with a random value of beta_0, it is 60 here.

So in the first iteration, we just plug in the value to get the loss:

L^{(1)} = L(\beta^{(1)})=\sum_{i=1}^{n}(y_i-(\beta_0^{(1)}+\beta_1 x_{i1}))^2,   \beta^{(1)}  --> \beta_0^{(1)} = 60, \beta_1 = -0.2

Then we update the value of beta_0 by the gradient of loss function in aspect of beta_0 (because beta_1 is fixed)

\beta_0^{(2)} = \beta_0^{(1)} - \alpha(-2\sum_{i=1}^{n}(y_i-(\beta_0^{(1)}+\beta_1 x_{i1})))

Then we use the new beta_0 we get to calculate the loss functoin again.

 L^{(2)} = L(\beta^{(2)})=\sum_{i=1}^{n}(y_i-(\beta_0^{(2)}+\beta_1 x_{i1}))^2

Finally we get the difference between the two loss function : L^{(2)}-L^{(1)}.

For the same sense, we can get \beta_0^{(3)} from \beta_0^{(2)} 

\beta_0^{(3)} = \beta_0^{(2)} - \alpha(-2\sum_{i=1}^{n}(y_i-(\beta_0^{(2)}+\beta_1 x_{i1})))

Then we get L^{(3)}, then we have L^{(3)}-L^{(2)}... We keep updating beta_0 until the diff in loss converges, which means its value stops changing significantly. That's where we set the threshold.

So the last update of beta_0 will give us the smallest value of loss function.

An important point: every time we do the update, we have to go through all the data points to get the new beta_0. (every x_i and y_i)

Remember we fix one coefficient and just update one scalar value. We can no long fix beta_1 and expand the problem into higher dimension:

The only difference is that we are now updating beta_0 and beta_1 at the same time. Assume the black point is the minimum. 

We can visualize the process in two different ways. 

Thoughts:

In reality, we could face the problem that has hundreds, thousands of coefficients. If we use normal equation to do linear regression, we will have a large matrix. Recall that doing normal equation involves taking the inverse of the matrix. This could make the computation (time complexity) really large. It is cubic to the size of dataset, which would quickly make you to wait forever for matrix calculation.

On the other hand, if we use gradient descent, every update we only go through our data once, so the computation complexity goes linear with the size of dataset. In this way, though we can not get the most accurate result of coefficients, we will still get the best result of them in a reasonable amount of time.


Let's move back to our cancer example. Notice the negative sign is for making the problem of maximizing the log likelihood to a minimization problem, which fits gradient descent method. 

So different combinations of beta_0 and beta_1 will give us different logistic curve (s- shape red curves on the right figure). What we want is a combination of beta_0 and beta_1 such that we will end up at the minumum point on the left figure.

The essence of the slide is showing you that when using logistic function, the gradient of loss function can be expressed in a very compact manner.


Remember the amout we move from one point to another in the loss function each time is the product of learning rate and gradient. If the learning rate is large, we will take a large step. In the downside figure, in the start point, we know the gradient let us to go down if we move to the left, but we overstep in the case of lr = 1.01, this makes us move even further away from the minimum. We will end up oscillating and the value diverges instead of converging.

 

Remember that we start our initial point randomly, and the step size we take also depends on learning rate, so it is possible for us to fall in to a local munimum and cannot escape since the value still converges.

 

Why we can add noise?

The gradient is the status value that tells us what direction we should go. If we assign noise, maybe randomly we can go a little bit sidewards. By doing so it might push us to a different path that leads to the global minimum. 

Why SGD?

  •  By just calculating the gradient by one data point, we will not going down to the stiffest direction of loss function. This gives us a chance for new path.
  • It allows us to update our coefficient faster because we don't have to go through the whole dataset during the update calculation. 

So batch gradient descent and SGD are two extreme method, we can imagine in between we can doing update by the amount of data between 0 and the size of data. This is call mini-batch gradient descent

So the green path is the best gradient descent from previous example. And if we do SGD, we will have a zig-zag path just like the red curves. Sometimes the next update might not guarantee to decrease the loss function, the loss sometimes even increases! But it's normal in SGD. And the hope is that in the long run, it will still lead us to the global minimum and we will have more chance to jump out from a local minimum if it exists.

 

A saddle point can be descibed as above.

We will also stuck at the saddle point if we simply follow the stiffest gradient descent. Think about we goes down from the red line, and when we arrive at the saddle point, it is also a maximum of the grey direction, so the gradient at that point is 0.

We can simply escape from the saddle point by using the noise method we mentioned.


A Commonly used method: 

 Here we are just changing the update rule: previously we only consider the gradient at the current position. Now we are adding some cool momentum term that remembers what is our previous work before the current update (the red box term). This is like a real momentum.

See the figure above, by adding momentum, instead of doing zig-zag, in every update we carry some momentum in the direction that we go down the hill. This would usually help us to achieve to the minimum faster. (Detail Provement: here)

For \eta, we always take the value between 0 and 1.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值