2.4 理解指数加权平均

最新推荐文章于 2020-12-02 17:06:22 发布

布纸所云

最新推荐文章于 2020-12-02 17:06:22 发布

阅读量439

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/XindiOntheWay/article/details/82255247

版权

深度学习专栏收录该内容

22 篇文章 0 订阅

订阅专栏

这里写图片描述
If beta equals 0.9 you got the red line. If it was much closer to one, if it was 0.98, you get the green line. And it it’s much smaller, maybe 0.5, you get the yellow line.

这里写图片描述

Let’s look a bit more than that to understand how this is computing averages of the daily temperature. So here’s that equation again, and let’s set beta equals 0.9 and write out a few equations that this corresponds to.

v 100 = 0.1 θ 100 + 0.9 v 99 = 0.1 θ 100 + 0.9 (0.1 θ 99 + 0.9 v 98) = 0.1 θ 100 + 0.9 (0.1 θ 99 + 0.9 (0.1 θ 98 + 0.9 v 97)) ⋮ = 0.1 θ 100 + 0.1 \cdot 0.9 θ 99 + 0.1 \cdot 0.9 2 θ 98 + 0.1 \cdot 0.9 3 θ 97 + \dots

$\begin{align*} v_{100}&=0.1\theta_{100}+0.9v_{99}\\ &=0.1\theta_{100}+0.9(0.1\theta_{99}+0.9v_{98})\\ &=0.1\theta_{100}+0.9(0.1\theta_{99}+0.9(0.1\theta_{98}+0.9v_{97}))\\ &\vdots\\ &=0.1\theta_{100}+0.1\cdot 0.9 \theta_{99}+0.1\cdot 0.9^2\theta_{98}+0.1\cdot 0.9^3 \theta_{97}+\cdots \end{align*}$

So one way to draw this in pictures would be if, let’s say we have some number of days of temperature. So this is theta and this is T. So theta 100 will be some value, then theta 99 will be some value, theta 98, so these are, so this is T equals 100, 99, 98, and so on, ratio of sum number of days of temperature. And what we have is then an exponentially decaying function. So starting from 0.1 to 0.9, times 0.1 to 0.9 squared, times 0.1, to and so on. So you have this exponentially decaying function. And the way you compute V100, is you take the element wise product between these two functions and sum it up. So you take this value, theta 100 times 0.1, times this value of theta 99 times 0.1 times 0.9, that’s the second term and so on. So it's really taking the daily temperature, multiply with this exponentially decaying function, and then summing it up. And this becomes your V100.

It turns out that, up to details that are for later. But all of these coefficients, add up to one or add up to very close to one, up to a detail called bias correction which we’ll talk about in the next video. But because of that, this really is an exponentially weighted average.

And finally, you might wonder, how many days temperature is this averaging over. Well, it turns out that $0.9^{10}\approx 0.35$ and this turns out to be about $1/e$ , one of the base of natural algorithms. And, more generally, if you have $1-\epsilon$ , so in this example, $\epsilon$ would be 0.1, so if this was 0.9, then

(1 - ϵ) 1 ϵ \approx 1 e \approx 0.35

$(1-\epsilon)^{\frac{1}{\epsilon}}\approx \frac{1}{e} \approx 0.35$

And so, in other words, it takes about 10 days for the height of this to decay to around 1/3 already $1/e$ of the peak. So it’s because of this, that when $\beta$ equals 0.9, we say that, this is as if you’re computing an exponentially weighted average that focuses on just the last 10 days temperature. Because it's after 10 days that the weight decays to less than about a third of the weight of the current day.

Whereas, in contrast, if beta was equal to 0.98, then, well, what do you need 0.98 to the power of in order for this to really small? Turns out that 0.98 to the power of 50 will be approximately equal to $1/e$ . So the way to be pretty big will be bigger than $1/e$ for the first 50 days, and then they’ll decay quite rapidly over that. So intuitively, this is the hard and fast thing, you can think of this as averaging over about 50 days temperature. Because, in this example, to use the notation here on the left, it’s as if epsilon is equal to 0.02, so one over epsilon is 50.

And this, by the way, is how we got the formula, that we're averaging over one over one minus beta or so days. Right here, epsilon replace a row of $1-\beta$ . It tells you, up to some constant roughly how many days temperature you should think of this as averaging over. But this is just a rule of thumb for how to think about it, and it isn’t a formal mathematical statement.

Finally, let’s talk about how you actually implement this. Recall that we start over $v_0$ initialized as zero, then compute $v_1$ on the first day, $v_2$ , and so on. Now, to explain the algorithm, it was useful to write down $v_0$ , $v_1$ , $v_2$ , and so on as distinct variables.
这里写图片描述
So just to say this again but for a new format, you set $v_0$ equals zero, and then, repeatedly, have one each day, you would get next $\theta_T$ , and then set to $v_T$ , gets updated as beta, times the old value of V theta, plus one minus beta, times the current value of V theta.

So one of the advantages of this exponentially weighted average formula, is that it takes very little memory. You just need to keep just one row number in computer memory, and you keep on overwriting it with this formula based on the latest values that you got. And it’s really this reason, the efficiency, it just takes up one line of code basically and just storage and memory for a single row number to compute this exponentially weighted average.

It’s really not the best way, not the most accurate way to compute an average. If you were to compute a moving window, where you explicitly sum over the last 10 days, the last 50 days temperature and just divide by 10 or divide by 50, that usually gives you a better estimate. But the disadvantage of that, of explicitly keeping all the temperatures around and sum of the last 10 days is it requires more memory, and it’s just more complicated to implement and is computationally more expensive.

So for things, we’ll see some examples on the next few videos, where you need to compute averages of a lot of variables. This is a very efficient way to do so both from computation and memory efficiency point of view which is why it’s used in a lot of machine learning. Not to mention that there’s just one line of code which is, maybe, another advantage.

So, now, you know how to implement exponentially weighted averages. There’s one more technical detail that’s worth for you knowing about called bias correction.

布纸所云

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2.4 理解指数加权平均

If beta equals 0.9 you got the red line. If it was much closer to one, if it was 0.98, you get the green line. And it it’s much smaller, maybe 0.5, you get the yellow line.Let’s look a bit more ...
复制链接

扫一扫