字体大小变化_变小变大

最新推荐文章于 2021-06-17 10:28:35 发布

weixin_26707803

最新推荐文章于 2021-06-17 10:28:35 发布

阅读量414

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/go-big-by-being-small-618d2da54b49

版权

字体大小变化

When I was in my final year as a university student, I was preparing and collecting sufficient datasets for my research paper as my final year project. I was just casually scrolling through the internet and voila! It didn’t take me long to gather all of the datasets I needed. But when I thought everything went smooth sailing with my boat, a Kraken appeared — of course not the sea monster but it required tons of brainstorming sessions. The dataset that I’ve been collecting is too small to work with, I’m talking 20 to 30 periodic observations, yikes. You may ask, why didn’t you realize that it’s insufficient just by looking at the number of observations? Well, to be frank, I did feel a little bit worried when I saw the “handful” amount of observations. But it hit me when I realized it’s not enough to be implemented in the model I was researching.

当我是大学生的最后一年时，我正在为我的研究论文准备和收集足够的数据集，作为我的最后一个项目。我只是随便滚动浏览互联网，瞧！我花了很长时间收集了我需要的所有数据集。但是，当我以为一切都顺利进行时，出现了KrakenD-当然不是海怪，而是需要大量的头脑风暴会议。我一直在收集的数据集太小而无法使用，我说的是20到30次定期观测， yikes 。您可能会问，为什么不仅仅观察观察数就意识到不足？好吧，坦率地说，当我看到“少量”的观察结果时，我确实有点担心。但是当我意识到不足以在我正在研究的模型中实施时，这让我感到震惊。

After quite a few hours, a book, and a glass of coffee, I’ve finally found inspiration on how to work with these small datasets, extrapolate it, appropriately. At first, I genuinely thought my idea is going to cause quite an error in the model, but thankfully, it went well and I finished my paper. So in this article, I wanted to share the methods that I used working with a univariate dataset and a new method that I’ve developed for a multivariate dataset.

几个小时后，再读一本书，再喝一杯咖啡，我终于找到了灵感，学习如何使用这些小型数据集，进行适当的推断 。刚开始，我确实以为我的想法会在模型中引起很大的错误，但是值得庆幸的是，它进展顺利，我完成了论文。因此，在本文中，我想分享用于单变量数据集的方法以及为多变量数据集开发的新方法。

让我们从容错率(MOE)开始简单的单变量数据集 (Let’s start easy, Univariate Dataset with Margin of Error (MOE))

A dataset with provided MOE is so useful in this extrapolation method because the MOE is one of the key factors on how accurate the extrapolated values will be. In this case, I’ll be using the US Annual Mean Income, gathered from the United States Census Bureau, Table S1901. With the MOE on board, we can easily get the minimum and maximum values of mean income for each year. By knowing these values, we extrapolate it according to its annual values by generating random variates from the Uniform(0,1) Distribution, to represent the standardized values of the mean income. Then, we convert the standardized values back to the actual values using the minimum and the maximum values like so

具有MOE的数据集在此外推方法中非常有用，因为MOE是外推值的准确性的关键因素之一。在这种情况下，我将使用从美国人口调查局表S1901收集的美国年平均收入。有了教育部，我们可以轻松获得每年平均收入的最小值和最大值。通过了解这些值，我们通过从Uniform(0,1)分布中生成随机变量来根据其年值推断它，以表示平均收入的标准化值。然后，我们使用最小值和最大值将标准化值转换回实际值，如下所示

Say that I wanted to extrapolate the dataset because I want to recreate monthly mean income, I’ll be needing 12 random uniform variates to be converted each year. Here’s a side by side plot comparison of the real and the extrapolated datasets.

假设我要推断数据集是因为我想重新创建每月平均收入，那么我每年将需要12个随机均值变量进行转换。这是真实数据集和外推数据集的并排图比较。

As we can see, the increasing trend is still there, it’s just noisier since now it has monthly instead of annual values. And if we check the difference between the statistical properties

我们可以看到，增长趋势仍然存在，因为现在它是按月而不是按年的值，所以只是比较嘈杂。如果我们检查统计属性之间的差异

it doesn’t differ much :)

它相差不大:)

没有MOE的单变量数据集 (Univariate Dataset without MOE)

Now, this condition was the problem I mentioned before. I was confused about how I’m supposed to get info on the periodical variance of the data that I was working on. Luckily, the solution only requires two main features: A time-series model that fits the distribution of the dataset and some randomizing standardized values.

现在，这种情况就是我之前提到的问题。我对于应该如何获取有关正在处理的数据的定期变化的信息感到困惑。幸运的是，该解决方案仅需要两个主要功能：适合数据集分布的时间序列模型和一些随机化的标准化值。

In this example, I’m going to use the monthly sunspots dataset which you can acquire here. And yes, it’s already a huge dataset so no need for extrapolation, am I right? But let’s say you’re only given the last 3 years of observations and was told to generate daily values for the last 3 years based on that.

在此示例中，我将使用您可以在此处获取的每月黑子数据集。是的，它已经是一个庞大的数据集，因此无需进行推断，对吗？但是，假设您只获得了最近3年的观测值，并被告知要根据此得出最近3年的每日值。

Now let’s pick the model. From the beginning, we know that this is a monthly dataset. So why don’t we pick something simple? We’re going to use a linear seasonal regression model to be fitted to the dataset. Here’s the result:

现在让我们选择模型。 从一开始，我们就知道这是每月的数据集。那么，为什么我们不选择简单的东西呢？我们将使用线性季节性回归模型来拟合数据集。结果如下：

That’s quite a great fit. Now we’re going to use the estimate and the standard error from this result to extrapolate the data. In other words, if we look back to the previous example, we can use the estimates and standard errors as the “mean income” and MOE respectively. Since we’re going to generate daily values, the values will be generated according to the number of days in the month along with the estimate and standard error — I’m using a confidence level of 95% from this point on. Here are the extrapolated daily values:

非常适合。现在，我们将使用此结果的估计值和标准误差来推断数据。换句话说，如果我们回顾前面的示例，可以将估计值和标准误分别用作“平均收入”和MOE。由于我们将要生成每日值，因此将根据当月的天数以及估算值和标准误差来生成值-从现在开始，我将使用95％的置信度。以下是推断的每日值：

One thing that immediately feels off is the lack of a decreasing trend in the original dataset. I’m doing it on purpose to show how important it is to pick an appropriate model according to the dataset we’re working on. By this result, we can conclude that the linear seasonal regression model is not the perfect fit for this dataset. Moreover, by using a regression we immediately assume a stationary condition in the dataset, which causing the extrapolated values to look like a stationary time series.

立刻感觉到的一件事是原始数据集中缺乏下降趋势。我这样做是为了表明根据我们正在研究的数据集选择合适的模型有多么重要。通过此结果，我们可以得出结论，线性季节性回归模型不是此数据集的理想选择。此外，通过使用回归，我们立即假定数据集中的平稳条件，这导致外推值看起来像平稳的时间序列。

多元数据集 (Multivariate Dataset)

Down to the last example, it took me quite a while to think of a way to extrapolate a multivariate dataset. Nevertheless, here’s one of the methods of doing it. In this last example, I’m using New Delhi Climate Training Dataset from Kaggle.

直到最后一个示例，我花了相当长的时间才想到一种推断多元数据集的方法。但是，这是执行此操作的方法之一。在最后一个示例中，我使用了Kaggle的 New Delhi气候培训数据集。

Likewise, let’s investigate the dataset first. Since I was expecting a correlation between the variables, I’ll start with the scatterplots between the variables.

同样，让我们先研究数据集。由于我期望变量之间具有相关性，因此我将从变量之间的散点图开始。

Now my eyes immediately make its way to the pressure section albeit the apparent negative correlation between the temperature and humidity. Something feels off with the plot, and I immediately realize it must be some outliers knowing some values differ much from the rest. I understand that I’m no expert in this climate section of knowledge, so I’m calling our best friend and jack-of-all-trades, Google, to help me to find out the normal values for air pressure, and it sent me here. Turns out, the values should be around 1013.25 millibars. Hence, according to the dataset and the website, pressure values that lie between 990 and 1024 will be considered as normal. Then, the outliers will be replaced according to the distribution of the dataset.

现在，尽管温度和湿度之间明显存在负相关关系，但我的眼睛立即进入压力区域。情节让人感觉有些不对劲，我立即意识到一定是一些离群值，知道某些值与其他值有很大不同。我了解我不是这个气候知识领域的专家，所以我打电话给我们最好的朋友和千篇一律的交易商Google ，以帮助我找出气压的正常值，我在这里。事实证明，该值应在1013.25毫巴左右。因此，根据数据集和网站，位于990和1024之间的压力值将被认为是正常的。然后，将根据数据集的分布替换异常值。

You might be wondering, there must be a twist to this example since there are already a lot of observations. YOU GUESSED IT RIGHT! (really sorry for my corny jokes trying to get your attention back lol)

您可能想知道，由于已经有很多观察结果，因此本示例必须有所不同。您猜对了！ (真的很抱歉，我的顽皮笑话试图引起您的注意，哈哈)

The twist here is that you’re actually given the monthly average from each variable and you need to convert it back to daily values. Now, based on the last two examples I gave out before, please answer this question

这里的问题是，实际上您会获得每个变量的每月平均值，并且需要将其转换回每日值。现在，根据我之前给出的最后两个示例，请回答此问题

Is it going to work? Is it possible to do so?

它会起作用吗？有可能这样做吗？

Save your answer until the end of this article, and let’s see.

保存您的答案，直到本文结尾，让我们看看。

First, as we did earlier, let’s take a look at the scatterplots between the variables.

首先，就像我们之前所做的那样，让我们看一下变量之间的散点图。

Well, seems like our dataset is correlated to each other. Here’s what I can see from this plot:

好吧，好像我们的数据集是相互关联的。这是我从图中看到的内容：

The most definite relation is between temperature and pressure, it’s a negative correlation.
最明确的关系是温度和压力之间的关系，它是负相关的关系。
The rest might have quite a moderate correlation and it looks like it might fit into a quadratic model.
其余的可能具有适度的相关性，看起来可能适合二次模型。

With these in mind, I decided to create a linear and quadratic regression model for every possible pair of variables, then compare their R-Squared and Adjusted R-Squared values. Also, I’m going to create a linear seasonal regression model for each variable since it definitely has a seasonal pattern based on the plots below.

考虑到这些因素，我决定为每个可能的变量对创建一个线性和二次回归模型，然后比较其R平方和调整后的R平方值。另外，我将为每个变量创建一个线性季节性回归模型，因为根据以下图表，它肯定具有季节性模式。

Before doing the regressions, it’s best for the values to be standardized since the variation of values isn’t similar. Here’s the result of the model fitting:

在进行回归之前，最好对值进行标准化，因为值的变化不相似。这是模型拟合的结果：

Let’s focus on the relation between the variables. Excluding the seasonal regression results (row 1–4), the highest R-Squared value is the quadratic model where pressure as the independent variable and temperature as the dependent one. Whereas the other model doesn’t seem to have a great fit albeit the scatterplot showed an indication of correlation. Fortunately, the seasonal model is a great fit for all variables. With these in mind, here’s my plan:

让我们关注变量之间的关系。不包括季节性回归结果(第1-4行)，最高R平方值是二次模型，其中压力为自变量，温度为因变量。尽管散点图显示了相关性，但其他模型似乎不太适合。幸运的是，季节性模型非常适合所有变量。考虑到这些，这是我的计划：

And now, the moment you’ve been waiting for, the comparison of the real versus the extrapolated values (the blue line is the extrapolated one).

现在，您等待的那一刻，将实数值与外推值进行比较(蓝线是外推值)。

Each extrapolated values fit well with the actual values, and not so bad with the temperature. But, our million-dollar question hasn’t been answered yet. To convert the values back to daily values, we’re going to need a little bit of math here.

每个外推值都与实际值非常吻合，而与温度相差不大。但是，我们尚未回答数百万美元的问题。要将值转换回每日值，这里我们需要一些数学运算。

in which n is the number of samples. Then, we can acquire the variance of the monthly averages, which is

其中n是样本数。然后，我们可以获得月平均值的方差，即

in which Yj^s is the standardized version of the monthly averages. Finally, we derive the standard error of the daily values with this set of equations:

其中Yj ^ s是月平均值的标准化版本。最后，我们通过这组方程得出每日值的标准误差：

Aaaandd without further ado, let’s see how the daily extrapolated values turned out.

事不宜迟，让我们看看每日推断值的结果。

My first reaction was “What kind of noisy time-series is this? This is nuts!”. I don’t think we need to explain anything to answer the question, it’s a definite no, at least using this method. The extrapolated values become too noisy and only effective for the short-term since we use extrapolated data to extrapolate — #extrapo-ception. Moreover, the monthly average values don’t carry the “jumps” as the daily values do, causing the extrapolated daily values unable to capture it.

我的第一个React是“这是什么嘈杂的时间序列？真是疯了！”。我认为我们无需解释任何问题即可回答这个问题，这是肯定的，至少使用此方法是可以的。外推值变得过于嘈杂，并且仅在短期内有效，因为我们使用外推数据进行外推-＃外差感知。此外，月平均值不像日平均值那样“跳跃”，导致外推的日平均值无法捕获。

结论 (Conclusion)

This extrapolation method is only able to create values according to the dataset used in the calculations and the generated values will follow the characteristics of it.
这种外推方法只能根据计算中使用的数据集来创建值，并且生成的值将遵循其特征。
The stationarity assumption might be affecting the inability of detecting “jump”(s). Therefore, a more appropriate model might be a solution to generate more fitting extrapolated values.
平稳性假设可能会影响无法检测到“跳跃”。因此，更合适的模型可能是生成更多拟合外推值的解决方案。
Even if the extrapolated values are perfect, it doesn’t mean it would be a perfect representation of the population. Nevertheless, it’s still better to get an estimated depiction of the population might be.
即使推断的值是完美的，也并不意味着它将完美地代表总体。尽管如此，最好还是对人口进行大概的描述。

下一步是什么？ (What’s next?)

I might be not the expert in this, but I did learn to work creatively with a time-series dataset. Even so, I would like to hear your suggestions that may improve this method even more. So, below is my GitHub repo of this time-series extrapolation method. I will definitely post more data science or actuarial science projects in the near future, so stay tuned!

我可能不是这方面的专家，但是我确实学会了创造性地使用时间序列数据集。即使这样，我还是想听听您的建议，这些建议可能会进一步改善此方法。因此，以下是我的该时间序列外推方法的GitHub存储库。我一定会在不久的将来发布更多的数据科学或精算科学项目，敬请期待！