公司生日会生日礼物_你的生日有多受欢迎?

公司生日会生日礼物

In the years before 2020, it was common for a large number of school children (20–30 or more) to physically colocate for their math lessons. And in many a class, students were asked to compute the probability that two of them had the same birthday.

在2020年之前的几年中,大量的学童(20至30岁或更多)为他们的数学课程进行物理布置很普遍。 在许多班级中,要求学生计算他们两个有相同生日的概率。

To do this computation, students typically make the default assumption that everybody is equally likely to be born on any given day (the uniform distribution). Without this assumption, the problem becomes hard to do by hand. On the other hand, you will often see visualizations floating around the Internet similar to the below.

为了进行此计算,学生通常会默认默认条件,即每个人在任何给定的一天都有相同的可能性出生(均匀分布)。 没有这个假设,问题就很难手工解决。 另一方面,您经常会看到类似于以下内容的可视化在Internet上浮动。

Image for post
Relative Popularity of Birthdays
相对生日

I’ll discuss the data used to generate the figures and values in this article more below. This chart shows the relative likelihood of someone having a birthday on a given day.

我将在下文中进一步讨论用于生成图形和值​​的数据。 此图表显示某人在给定日期生日的相对可能性。

  • A value of 1 (like Dec 16) means that a random person has the same probability of being born on that day as you would expect from a uniform distribution.

    值为1(如12月16日)表示随机人当天出生的概率与您从均匀分布中预期的概率相同。
  • A value lower than one (notably Jan 1, Jul 4, and Dec 24–26) means that a random person is that many times less likely to be born on that day than if the day were chosen uniformly at random.

    小于1的值(尤其是1月1日,7月4日和12月24日至26日)意味着随机人在那一天出生的可能性比随机选择同一天的人少很多倍。
  • A value above one (notably mid September) means that a random person is more likely to be born on that day versus what you would naïvely expect.

    大于1的值(特别是9月中旬)意味着与您天真的预期相比,当天随机的人更有可能出生。

Note that for February 29th, which occurs approximately¹ every 4 years, the uniform distribution would imply that one is one-fourth as likely to be born on February 29th as on February 28th. This is accounted for in the chart above (without the correction, a Feb 29 birthday has a relative frequency of 0.92/4 = 0.23 compared to the probability of being born on any other day of the year assuming the uniform distribution).

请注意,对于2月29日(大约每4年发生一次)而言,均匀分布意味着2月29日出生的可能性是2月28日的四分之一。 这在上表中得到了说明(如果不进行更正,则2月29日的生日的相对频率为0.92 / 4 = 0.23,与假设均匀分布的一年中其他一天的出生概率相比)。

The unpopularity of July 4th, a major holiday in the United States, should tip you off that this data is for the US only.

7月4日(美国主要的假期)的人气不足,应该提示您该数据仅适用于美国。

In the first part of this article, we will focus on the probability question, namely: what are the odds that two people in a room have the same birthday? First, we will compute the solution to the birthday problem using the assumption that birthdays are uniformly distributed through the year. Second we will use Monte Carlo methods to compute the solution given the observed distribution.

在本文的第一部分中,我们将关注概率问题,即:一个房间中两个人有相同生日的几率是多少? 首先,我们将假设生日在一年中均匀分布,从而计算出生日问题的解决方案。 第二,我们将使用蒙特卡洛方法根据给定的分布来计算解。

The results will be that the answer is unchanged even considering the empirical distribution (though it gets close!). For each version of the problem we consider, the answer is that 23 is just enough to make the probability of two people having the same birthday 50%.

结果将是,即使考虑经验分布,答案仍然是不变的(尽管它接近!)。 对于我们考虑的每种问题,答案都是23,足以使两个人有相同生日的概率为50%。

In the second part of the article, we’ll explore the distribution of birthdays a little bit more and build a simple model to understand the variation in births.

在本文的第二部分,我们将进一步探讨生日的分布,并建立一个简单的模型来了解出生的变化。

统一生日问题 (The Uniform Birthday Problem)

In school, the question could be given as follows:

在学校里,可以提出以下问题:

Suppose there are n people in a room. Assume that nobody has a birthday on February 29th and that their birthdays are random variables uniformly distributed across the other 365 days of the year. Find the minimum value of n such that the probability of at least two students sharing a birthday is at least 50%.

假设一个房间里有n个人。 假设没有人在2月29日过生日,并且他们的生日是一年中其他365天均匀分布的随机变量。 求n的最小值,以使至少两个学生共享生日的概率至少为50%。

To solve this problem, we instead compute the probability that no two students share a birthday. We number the students 1 to n. The probability of the first student not sharing a birthday with any previous student is 365/365=1. For the second student, there are 364 days not overlapping with previous students, so the probability is 364/365 that they don’t share a birthday with a previous student. The next student is 363/365 and so on.

为了解决此问题,我们改为计算没有两个学生共享生日的概率。 我们给学生编号1到n 。 第一个学生不与任何以前的学生过生日的概率为365/365 = 1。 对于第二个学生,有364天不与先前的学生重叠,因此他们与先前的学生没有生日的概率为364/365。 下一个学生是363/365,依此类推。

The result is the following quantity which we will denote q. The goal is to find the minimal n such that q ≤0.5.

结果是以下数量,我们将用q表示。 目的是找到最小的n ,使q≤0.5。

Image for post
Probability no two students have a birthday in common
没有两个学生有共同的生日

At this point most people reach for their calculator. But we can continue to estimate by hand by taking a logarithm and a Taylor expansion:

在这一点上,大多数人都可以使用他们的计算器。 但是我们可以继续采用对数和泰勒展开来手动估计:

Image for post
Approximation of ln q using a first-order Taylor expansion for the logarithm
使用一阶泰勒展开数对数的ln q逼近

In the first line, we take the logarithm, changing the product into a sum. We also rewrite each fraction in the form of 1–x. The first order Taylor Expansion is a good approximation when x is small, ln(1–x) = –x. This is what we apply in the second line. The third line uses the formulas for the sum of an arithmetic series (we are just adding up 1 + 2 + 3 + ⋯ + (n–1) in the numerator). ln(2) is famously 0.69. Multiplying 0.69⨉365≈252 (round up). n should be a bit over the square root of 2⨉252 = 504, so n = 23 is our guess based on these approximations (yielding 253 in the numerator of the ln q approximation). This is in fact the correct answer as the following chart (computed exactly/to numerical precision) shows. Code is linked at the end.

在第一行中,我们采用对数,将乘积更改为总和。 我们还以1-x的形式重写了每个分数。 当x较小(ln(1–x)= –x)时,一阶泰勒展开式是一个很好的近似值。 这就是我们在第二行中应用的内容。 第三行将公式用于算术级数的总和(我们只是将分子中的1 + 2 + 3 +⋯+(n-1)相加)。 ln(2)是著名的0.69。 乘以0.69⨉365≈252(向上舍入)。 n应该在2 ^ 252 = 504的平方根上,所以n = 23是我们基于这些近似值的猜测(在ln q近似值的分子中产生253)。 实际上,这是正确的答案,如下图所示(精确计算/精确到数值精度)。 代码链接在末尾。

Image for post

经验生日问题 (The Empirical Birthday Problem)

Now we turn to solving the birthday problem using real data. The data in question is provided by FiveThirtyEight and is based on Social Security Administration (government) data giving birth counts on each day from Jan 1, 2000 to Dec 31, 2014. It’s reasonable to assume this covers nearly every birth in the United States during that time frame.

现在我们转向使用实际数据解决生日问题。 有问题的数据由FiveThirtyEight提供,并基于社会保障局(政府)数据提供了2000年1月1日至2014年12月31日期间的每一天的出生计数。可以合理地假设此数据涵盖了在美国期间几乎所有的出生那个时间范围。

Taking the data as given, we turn to estimating the probability of two people having the same birthday. We assume the birthdays of our n people are iid distributed according to the empirical distribution of our data. There isn’t a feasible direct way to compute this from the data so, we turn to Monte Carlo methods.

根据给定的数据,我们转向估计两个人有相同生日的概率。 我们假设根据数据的经验分布来分布我们n个人的生日。 没有一种可行的直接方法可以从数据中计算出来,因此,我们转向蒙特卡洛方法。

Specifically, for each n, we will sample from the distribution and observe if any birthdays are shared. Again, code (and test cases) are in the repo linked at the end. Somewhat surprisingly, the results are substantively unchanged. The results are graphed below, with the values for the uniform birthday problem plotted as well. The empirical result also has a 99% confidence interval, which is so small you cannot see it (using 100,000 simulations per n).

具体来说,对于每个n ,我们将从分布中采样并观察是否共享了任何生日。 同样,代码(和测试用例)位于最后的仓库中。 令人惊讶的是,结果基本上没有变化。 结果显示在下面,并绘制了统一生日问题的值。 经验结果还具有99%的置信区间,这个置信区间是如此之小,您看不到它(每n使用100,000个模拟)

Image for post

This distribution is close enough to uniform that the result n=23 continues to hold for the empirical case! In a sense, this was, in hindsight, an unnecessary analysis. The difference between the two curves is graphed below, and you can see that they are statistically indistinguishable.

这种分布足够接近均匀,以至于对于经验情况,结果n = 23继续成立! 从某种意义上说,从事后看,这是不必要的分析。 两条曲线之间的差异如下图所示,您可以看到它们在统计上是无法区分的。

Image for post

To be clear, the distributions are not the same. However, this simulation was run with 100,000 draws per value of n, not enough statistical power to reveal the differences. Larger numbers of simulations would reveal statistically significant (but substantively small) differences.² For example, running 5,000,000 draws just for n=23 finally yields enough statistical power to distinguish. The empirical distribution has P=50.786%±0.022%, while the uniform distribution has P=50.730%. (The margin of error is 1 standard deviation, not a 95% confidence interval).

需要明确的是,分布是不相同的。 但是,此模拟使用每个n值100,000次抽奖进行,没有足够的统计能力来揭示差异。 大量的仿真将显示出统计学上显着的差异(但实质上很小)。²例如,仅在n = 23时运行5,000,000次绘制,最终会产生足够的统计能力来区分。 经验分布具有P = 50.786%±0.022%,而均匀分布具有P = 50.730%。 (误差幅度为1个标准偏差,而不是95%的置信区间)。

一年生日问题 (The Single-Year Birthday Problem)

Now, the deviation of the birth distribution from uniform in any fixed year should be higher than the overall. Much of the variation that we would expect to see in any given year will be smoothed out. For example:

现在,任何固定年份的出生分布与统一的偏差都应该大于整体。 我们希望在任何一年中看到的大部分差异都将被消除。 例如:

  • Any given date will cycle through the days of the week over the 14 year period, so the deviation from uniform due to days of the week (discussed below) will be hidden.

    任何给定的日期都将在14年的周期中循环到星期几,因此将隐藏由于星期几(下文讨论)而导致的统一偏差。
  • Thanksgiving (the 4th Thursday in November in the US) is not on the same date every year. Reviewing the table of relative popularities of birthdays, we can see that there is a dip towards the end of November, but it is smoothed out by the slight change in the precise date each year.

    感恩节(美国11月的第4个星期四)不是每年的同一天。 查看生日的相对受欢迎程度表,我们可以看到到11月底有所下降,但每年的确切日期略有变化可以消除这种情况。

Going back to the Birthday problem, this might make us worry. If you ask a typical class of 4th graders in fall 2020, most will be born between September 2010 and August 2011. We would expect higher variability than in the empirical distribution because each of them is much less likely to have been born on a date that, say, happened to be Labor Day in the fall of 2010. This could lead to a different solution (perhaps n=22?) of the Birthday problem.

回到生日问题,这可能使我们担心。 如果您在2020年秋季要求一个典型的四年级学生,那么大多数人将出生在2010年9月至2011年8月之间。我们期望其可变性高于经验分布,因为他们每个人的出生日期都不太可能,例如,正好是2010年秋天的劳动节。这可能导致生日问题的另一种解决方案(也许n = 22?)。

To explore this, we can proceed as follows.

为了探索这一点,我们可以进行如下操作。

  1. We can quantify the difference. The Kullback–Leibler divergence (KL–div) is a measure of the difference between two probability distributions. For the dataset overall and for each year we can compute the variation of the distribution from uniform.

    我们可以量化差异。 Kullback-Leibler散度 (KL-div)是两个概率分布之间差异的度量。 对于总体数据集和每年的数据集,我们可以计算出均匀分布的变化。

  2. We can re-solve the birthday problem using Monte–Carlo methods and drawing from a distribution based on 1 year. For simplicity, we’ll keep it to a calendar year.

    我们可以使用蒙特卡洛方法并根据1年的分布来重新解决生日问题。 为简单起见,我们将其保留为日历年。

For the first one, we confirm what we expected in the chart below. The divergence from uniform is about 10 times higher on an annual basis (0.021±.006) than on an overall basis (.002). There also appears to be an interesting trend to the data. The only possible explanation I can think of is the gradual cycling of the calendar through the 7 possible start dates of the year. The cycle is messed up by leap years, but roughly speaking we expect, for example, July 4th to move around from Monday to Tuesday to Wednesday etc. on a 6–7 year period.

对于第一个,我们确认下图中的预期。 每年(0.021±.006)与统一的差异大约是整体(.002)的10倍。 数据似乎也有有趣的趋势。 我能想到的唯一可能的解释是日历从一年的7个可能开始日期开始逐渐循环。 这个周期被leap年弄乱了,但是粗略地说,我们希望例如7月4日在6-7年内从星期一到星期二到星期三等移动。

Image for post

In terms of re-estimating the birthday problem with annual data, we can do so (using a smaller simulation size of 10000 runs per n). The curves are slightly higher than the uniform distribution (shown in blue). The solution almost changes, but in fact, n=23 still holds up for a room full of people all born in the same year. For n=22, the probability creeps up to as much as 49.1%±.05% for people born in 2009, compared to 47.6% for the uniform distribution.

就用年度数据重新估算生日问题而言,我们可以这样做(使用每n个10000次运行的较小模拟大小)。 曲线略高于均匀分布(以蓝色显示)。 解决方案几乎发生了变化,但是实际上,n = 23仍然容纳了一个房间,这些房间里都是同一年出生的。 在n = 22的情况下,2009年出生的人的概率上升到49.1%±.05%,而均匀分布的概率为47.6%。

Image for post

So, it turns out that the uniform assumption is good enough for this problem and there’s no need to fret.

因此,事实证明,统一假设足以解决该问题,因此无需担心。

生日的分布 (The Distribution of Birthdays)

Since we’re here, let’s see what other insights we can get from the data.

既然我们在这里,让我们看看我们可以从数据中获得什么其他见解。

一周中的天 (Day of the Week)

You might be surprised to learn that weekends are not a popular time to give birth. The figure below shows the mean number of births on any given day of the week in our dataset. The 95% confidence intervals are shown and are tight, indicating a statistically significant³ difference between the weekend and weekdays, as well as a noticeably smaller number of births on Mondays.

您可能会惊讶地发现,周末不是生育的热门时间。 下图显示了数据集中一周中任何指定天的平均出生人数。 显示了95%的置信区间,该区间很小,表明在周末和工作日之间存在统计学上的显着性差异³,并且星期一的出生人数明显减少。

Image for post

Two hypotheses you might want to keep in mind. Let’s call them the negative and positive hypotheses (for how you might feel about the reasons)

您可能需要牢记两个假设。 我们称它们为否定假设和肯定假设(以了解您对原因的看法)

  • The negative one, based on previous conversations I’ve had with physicians, is that it’s inconvenient for the obstetricians (the doctors delivering the baby) to have a birth on a weekend. Why be on call all weekend when you can go to your summer house instead? If true, this would be a bit sad because C-sections are expensive to the US healthcare system and they would represent unnecessary surgeries with the attendant risks for the mother.

    根据我之前与医生的交谈得出的负面结论是, 产科医生 (分娩婴儿的医生)在周末出生很不方便。 当您可以去避暑别墅时,为什么整个周末都在待命? 如果为真,这将使您有些伤心,因为剖腹产对于美国医疗保健系统而言是昂贵的,而且会代表不必要的手术,并给母亲带来随之而来的风险。

  • The positive one, which you can find in this FiveThirtyEight analysis of the data, is that hospitals are fully staffed during the week, so it’s safer and easier to schedule a C-section then.

    您可以从此FiveThirtyEight数据分析中找到积极的一面,那就是医院在一星期内工作人员配备齐全,因此安排一个剖腹产手术更安全,更容易。

You can read a bit more about this in that FiveThirtyEight article.

您可以在FiveThirtyEight文章中阅读更多有关此内容。

回归模型 (Regression Model)

Overall, the general question that you might want to ask about this data is: what explains the variability in births on any given day? Stated this way, we have a classic regression problem. We want to model the data and see what we learn.

总体而言,您可能要问的有关此数据的一般问题是:是什么解释了给定日期的出生变异性? 这样说,我们有一个经典的回归问题。 我们想对数据建模,看看我们学到了什么。

My goal here is to do a very quick and simple model. For example, we could use sophisticated time-series/forecasting tools like Facebook’s Prophet keep track of a variety of holidays and the run-up to them, each of which could be entered as a regressor. I don’t want to get this fancy, so for covariates we’ll use:

我的目标是建立一个非常快速且简单的模型。 例如,我们可以使用复杂的时间序列/预测工具,例如Facebook的Prophet来跟踪各种假期以及假期的运行情况,每个假期都可以作为回归值输入。 我不想花哨,所以对于协变量,我们将使用:

  1. The month.

    这个月。
  2. The year (to account for secular trends/variation in births).

    年(考虑到世俗趋势/出生变化)。
  3. The day of the week.

    星期几。
  4. A handful of holidays non-scientifically chosen by looking at the chart and thinking about major US holidays. The full list is in an appendix below. In total there are 17 holiday covariates.

    通过查看图表并考虑美国的主要假期,可以非科学地选择一些假期。 完整列表在下面的附录中。 总共有17个假期协变量。

Again, a more sophisticated analysis would use geographically disaggregated data and consider a variety of factors like the weather, local sports teams’ performance⁴, the (local) economy, and just about anything that broadly affects people and makes them and their partner (or surrogates, I guess) decide to get pregnant (or do so accidentally).

同样,更复杂的分析将使用按地理位置分类的数据,并考虑各种因素,例如天气,当地运动队的表现⁴,(当地)经济以及几乎任何会广泛影响人们并使其与他们的伴侣(或代孕)的因素。 (我想)决定怀孕(或意外怀孕)。

The resulting model is pretty good with an R² of 94%.⁵ This means that 94% of the variation in births is explained by the model. Here are the coefficients:

结果模型非常好,R²为94%。⁵这意味着该模型可以解释94%的出生变异。 以下是系数:

Image for post
Coefficient for our Regression Model for Births
我们的出生回归模型的系数

Each coefficient represents the expected change in number of births holding everything else constant. You can find the exact coefficients in the GitHub repo linked at the end. The orange bars show 95% confidence intervals. For example, on Christmas, the coefficient of approximately –5300 means that about 5300 fewer babies are born on Christmas in our dataset than we would otherwise expect after accounting for the month (December), day of the week (varying), and year.

每个系数代表在其他条件不变的情况下出生人数的预期变化。 您可以在最后链接的GitHub存储库中找到确切的系数。 橙色条显示95%的置信区间。 例如,在圣诞节,系数约为–5300,这意味着在考虑了月份(12月),星期几(不同)和年份之后,我们的数据集中圣诞节出生的婴儿比我们期望的少了5300个。

For the Months table, the coefficient represents the change relative to January. Thus if we took a day in September versus January in the same year and held everything else constant (day of week, holidays), we would expect to see about 1140 more births on a given day in September. Likewise, for the day of the week, everything is compared to a Monday.

对于“月”表,系数表示相对于一月的变化。 因此,如果我们将9月的一天与同年的1月相比较,并且将其他所有条件保持不变(星期几,节假日),那么我们预计9月的某一天将会多出1140胎。 同样,一周中的每一天,所有内容都将与星期一进行比较。

The intercept (not shown) is the number of births on a regular Monday in January, approximately 12,000 per day. There is also a slight decrease over time, with about 33 fewer births per day for each year after 2000, though further examination (not shown) shows the effect isn’t linear and shouldn’t be over-interpreted.

截距(未显示)是一月的常规星期一的出生人数,大约每天12,000。 随着时间的流逝,这种现象也略有减少,2000年以后,每年的出生数每天减少约33,尽管进一步检查(未显示)显示,这种影响不是线性的,不应过度解释。

Valentines is popular for giving birth, while Memorial day weekend seems just as good as any other weekend. For the remaining holidays, Thanksgiving, Christmas, Labor Day, and Memorial day are quite unpopular. We see substantial but smaller effects for the other days around major holidays.

情人最喜欢生孩子,而阵亡将士纪念日周末似乎和其他任何一个周末一样好。 在剩下的假期中,感恩节,圣诞节,劳动节和阵亡将士纪念日并不受欢迎。 在重大假期前后的其他几天,我们看到实质性但较小的影响。

In terms of months, we see a clear trend for births starting in the spring and going into late summer early/fall, peaking in September. Presumably, this represents what parents were choosing to do with their time the previous winter. The day-of-week analysis is substantively similar to what we saw earlier.

就月份而言,我们看到从Spring开始到夏末早期/秋季进入出生的明显趋势,在9月达到高峰。 据推测,这代表了父母在上个冬天选择与自己的时间在一起的事情。 一周中的一天的分析与我们之前看到的基本上类似。

The next question we want to ask is, are there any other days we missed? Displayed below are the residuals of the model, averaged over dates. You should interpret this as the relative popularity of a day after accounting for day of the week, month, and a limited selection of holidays.

我们要问的下一个问题是,我们还有其他日子吗? 下面显示的是模型的残差,这些残差是按日期平均的。 您应该将其解释为考虑了一周中的某天,某月和某天有限的假期 ,一天的相对受欢迎程度。

Specifically, it shows the mean anomalous number of babies born/not born on each day. These residuals are all rounded to the nearest 10.

具体来说,它显示了每天出生/未出生的婴儿的平均异常数。 这些残差均四舍五入至最接近的10。

Image for post
Unexplained Variation in Births Per Day, by date
每天出生率的无法解释的变化,按日期

A few things jump out immediately. Halloween (October 31) is an unpopular day to give birth that we forgot to include. July 5 is unpopular. In my quick-and-dirty coding of holidays, the Monday after a weekend July 4th (which is often when the day is observed) would not have been included, likely explaining this. December 23rd is also unpopular.

一些事情立即跳出来。 万圣节(10月31日)是不受欢迎的分娩日,我们忘记了分娩。 7月5日不受欢迎。 在我对假期的快速编码中,将不包括7月4日周末之后的星期一(通常是一天中的某天),这可能解释了这一点。 12月23日也不受欢迎。

The vertical blue stripe on the 13th indicates that, all other things equal, people also don’t like giving birth on the 13th! Nor (since 2001 at least) do they like giving birth on September 11. Looking at the month of December, we see that people prefer to give birth towards the later end of the month: either the week before Christmas or the week between it and New Years.

13号垂直的蓝色条纹表示,在所有其他条件相同的情况下,人们也不喜欢13号生育! 他们也(至少从2001年起)不喜欢在9月11日分娩。从12月份开始,我们看到人们更喜欢在该月下旬分娩:圣诞节前的一周或圣诞节与圣诞节之间的一周。新年。

结论 (Conclusion)

It’s pretty fascinating that people can somehow control when they give birth. It’s hard to say immediately whether it’s a subconscious phenomenon, something about the differing levels of stress, or a deliberate medical intervention, or something else.

人们在出生时可以通过某种方式进行控制,这非常令人着迷。 很难立即说出这是否是一种潜意识现象,是关于压力水平不同的某种东西,还是故意的医疗干预,还是其他的东西。

We’ve seen that overall, the distribution of births is pretty close to uniform (close enough that the solution to the birthday problem is still n=23). On the other hand, we saw major variation on smaller time scales: in days of the week and when examined on an annual basis. But, even looking on an annual basis, it still wasn’t sufficient to change the solution to the birthday problem.

我们已经看到,总体而言,出生的分布非常接近均匀(足够接近,生日问题的解决方案仍然是n = 23 )。 另一方面,在较小的时间范围内,我们看到了很大的差异:在一周中的几天以及每年进行一次检查时。 但是,即使按年度来看,仍不足以改变生日问题的解决方案。

Finally, we looked into modeling the variation in births and saw that it is mostly explained by holidays, months, and days of the week. To the remaining unexplained variance, we can point to a few days we forgot to model: the 13th, Halloween, July 4 weekend, Sep 11, and the days around Christmas.

最后,我们研究了出生变化的模型,发现它主要是由假期,月份和一周中的几天解释的。 对于其余无法解释的差异,我们可以指出一些我们忘记建模的日子:13日,万圣节,7月4日周末,9月11日以及圣诞节前后的日子。

附录:假期清单 (Appendix: Holiday List)

There are a total of 17 covariates introduced by holidays. For each, if any associated weekend days are included, all weekend days are coded as the same. For example, there is two indicator variable for Labor Day: one for Labor Day itself, and one for Labor Day Weekend.

假期总共引入了17个协变量。 对于每个周末,如果包括任何关联的周末,则将所有周末编码为相同。 例如,劳动节有两个指标变量:一个是劳动节本身,另一个是劳动节周末。

  1. New Years, New Years Eve, and the Day After (3)

    元旦,除夕和After日(3)
  2. Valentines Day (1)

    情人节(1)
  3. Leap Day (1)

    日(1)
  4. Memorial Day and the preceding weekend (2)

    阵亡将士纪念日及之前的周末(2)
  5. July 4th and any adjacent weekend days, if they exist (2)

    7月4日及以后的任何周末(如果有的话)(2)
  6. Labor Day and Labor Day Weekend (2)

    劳动节和劳动节周末(2)
  7. Thanksgiving, the day before, and the Fri–Sun following (3)

    感恩节,前一天,以及其后的周五至周日(3)
  8. Christmas, Christmas Eve, and the Day After (3)

    圣诞节,平安夜和后天(3)

笔记 (Notes)

[1] It is a sometimes forgotten fact that leap years occur every 4th year unless the year is divisible by 100 but with the caveat that the exception does not apply if the year is divisible by 400. That means that 1900 and 2100 are not leap years, but 2000 is alongside the usual ones: 1996, 2004, 2008 etc. There is no one currently known to be alive that has not experienced a leap year every 4th year during their life (and the data is from 2000–2014), so we can safely treat a leap year as occurring every 4 years for the purpose of our analysis. We will definitely ignore esoteric things like leap seconds which would not substantively affect our analysis.

[1]有时被遗忘的事实是leap年每第4年发生一次, 除非该年可以被100整除, 但是要注意的是,如果年份可以被400整除,则该例外不适用。这意味着1900年和2100年不是leap年年份,但2000年是通常年份:1996年,2004年,2008年等。目前尚无一个活着的人在其生命中每4年经历过一次a年(且数据来自2000-2014年),因此,出于分析目的,我们可以安全地将a年视为每4年发生一次。 我们一定会忽略诸如leap秒之类的神秘事物,这些事物不会实质性地影响我们的分析。

[2] I am not going to give a formal proof that a non-uniform distribution should give a different curve. But, intuitively, the uniform distribution minimizes the probability of two people having the same birthday across all distributions supported on the days of the year. Any deviation from this (quantified by the Kullback–Leibler-divergence) must necessarily cause the probability of two people having the same birthday to go up. In the repo, you can find an example where we keep skewing the data more and more and the difference eventually becomes visible to the naked eye.

[2]我不会正式给出不均匀分布应该给出不同曲线的证明。 但是,从直观上讲,在一年中支持的所有分布中,统一分布最大程度地降低了两个人生日相同的可能性。 对此的任何偏离(由Kullback-Leibler-分歧量化)必然会导致两个人有相同生日的可能性增加。 在回购中,您可以找到一个示例,在该示例中,我们不断使数据倾斜,最终差异变得肉眼可见。

[3] Technically we should account for multiple comparisons when assigning p-values. However, it is quite clear that the results will be statistically significant just from looking at the data.

[3]从技术上讲,在分配p值时,我们应该考虑多个比较。 但是,很明显,仅查看数据,结果在统计上就很有意义。

[4] Personally speaking I was born in Minneapolis 252 days after the Minnesota Twins won the world series (the standard gestation time for a human is 280 days).

[4]就我个人而言,我是在明尼苏达州双胞胎赢得世界大赛后252天出生于明尼阿波利斯的(人的标准妊娠时间为280天)。

[5] For more data about the goodness of fit, see the Github repo. The RMSE of the fit is about 550 births per day though the distribution has fat tails.

[5]有关适合度的更多数据,请参见Github存储库。 尽管分布具有肥大的尾巴,但适合的RMSE每天约550例。

翻译自: https://towardsdatascience.com/how-popular-is-your-birthday-91ab133f7fc4

公司生日会生日礼物

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值