软通纽约_纽约地铁数学

最新推荐文章于 2023-12-08 14:20:03 发布

cumei1658

最新推荐文章于 2023-12-08 14:20:03 发布

阅读量499

点赞数

文章标签： java 人工智能机器学习 python 区块链

原文链接：https://www.pybloggers.com/2017/01/nyc-subway-math/

版权

软通纽约

About Erik: Dad and CTO (Chief Troll Officer) at a fintech startup in NYC. Ex-Spotify, co-organizing NYC ML meetup, open source sometimes (Luigi, Annoy), blogs random stuff

关于埃里克 （ Erik） ：爸爸和首席技术官（首席巨魔官）在纽约市的一家金融科技初创公司。 Ex-Spotify，共同组织NYC ML聚会，有时是开源的（Luigi，Annoy），博客是随机的

纽约地铁数学 (NYC Subway math)

Apparently MTA (the company running the NYC subway) has a real-time API. My fascination for the subway takes autistic proportions and so obviously I had to analyze some of the data. The documentation is somewhat terrible, but here’s some relevant code for how to use the API:

显然， MTA （运行NYC地铁的公司）具有实时API 。我对地铁的迷恋是自闭症，因此显然我不得不分析一些数据。该文档有些糟糕，但是下面是一些有关如何使用API的相关代码：

from google.transit import gtfs_realtime_pb2
import urllib

for feed_id in [1, 2, 11]:
    feed = gtfs_realtime_pb2.FeedMessage()
    response = urllib.urlopen('http://datamine.mta.info/mta_esi.php?key=%s&feed_id=%d' % (os.environ['MTA_KEY'], feed_id))
    feed.ParseFromString(response.read())
    print feed

from google.transit import gtfs_realtime_pb2
import urllib

for feed_id in [1, 2, 11]:
    feed = gtfs_realtime_pb2.FeedMessage()
    response = urllib.urlopen('http://datamine.mta.info/mta_esi.php?key=%s&feed_id=%d' % (os.environ['MTA_KEY'], feed_id))
    feed.ParseFromString(response.read())
    print feed

I started tracking all subway trains one day and completely forgot about it. Several weeks later I had a 3GB large data dump full of all the arrivals for 1, 2, 3, 4, 5, 6, L, SI and GC (the latter two being Staten Island railway and Grand Central Shuttle).

我有一天开始跟踪所有地铁列车，完全忘记了。几周后，我进行了一次3GB的大数据转储，其中包括1、2、3、4、5、6，L，SI和GC（后两者分别是史坦顿岛铁路和中央车站）的所有进站。

让我们用这些数据做一些很酷的事情！ (Let’s do some cool stuff with this data!)

For instance, here are a bunch of subway trains for a while on the 1 line:

例如，以下是一线地铁上一堆的地铁：

The reason I started looking at this data was to understand if to what extent waiting for a subway is “sunk cost” vs. an investment. In particular, what is the optimal strategy if you’re waiting for the subway? My intuition told me that there’s a T such that the expected additional time you have to wait goes down as you approach T, but then goes up afterwards. Until T, every second you wait gets you closer to the next subway. After T, there’s most likely some random issue with the subway and you should just give up.

我开始查看此数据的原因是要了解在多大程度上等待地铁是“沉没成本”还是投资成本。特别是，如果您正在等待地铁，什么是最佳策略？我的直觉告诉我，存在一个T，这样您接近T时必须等待的预期额外时间会减少，但随后会增加。直到T，您等待的每一秒都会使您更靠近下一地铁。在T之后，地铁很可能会出现一些随机问题，您应该放弃。

Turns out there is such a thing. But let’s start by just looking at a plot of subway delays. The distribution of time between two trains P(t) looks like this:

原来有这样的事情。但是，让我们从查看地铁延误的情节开始。两列火车之间的时间分布P（t）如下所示：

This a probability distribution made from the distplot function in Seaborn. It’s a histogram (with 1 minute bins) combined with a kernel density estimation of the probability distribution.

这是由Seaborn中的distplot函数得出的概率分布。这是一个直方图（带有1分钟的框），结合了概率分布的核密度估计。

An interesting thing is that the distribution is multimodal with the biggest peak around 5 minutes and another around 20 minutes. I suspect this reflect rush hour vs night traffic. There’s also a peak just after 0 which I suspect is just what happens during rush our traffic when subways end up clustering.

有趣的是，分布是多峰分布，最大峰值在5分钟左右，另一个峰值在20分钟左右。我怀疑这反映了高峰时间与夜间交通的关系。在0后面还有一个峰值，我怀疑这是地铁最终聚集时在交通繁忙时发生的情况。

Note that this is not the distributions of waiting times, which is a bit different. If you assume that you are equally likely to arrive at any subway stop at any time of day, then the waiting time until the next subway looks like the distribution below. This represents a probability distribution where at any time of day, you pick a subway line, go to a random subway station, and wait for the next train.

请注意，这不是等待时间的分布，这有点不同。如果您假设您同样有可能在一天中的任何时间到达任何一个地铁站，那么直到下一个地铁的等待时间看起来像下面的分布。这表示概率分布，您可以在一天中的任何时间选择一条地铁线路，前往随机的地铁站，然后等下一班火车。

This distribution is a bit more regular. The most likely time you have to wait (the mode) is actually about 1 minute, although the mean and the median are much larger.

这种分布更加规律。尽管平均数和中位数大得多，但您最有可能需要等待的时间（模式）实际上是大约1分钟。

埃里克请离题并谈谈两条曲线之间的关系 (Erik please digress and talk about the relationship between the two curves)

Complete side note, but I realized in general you can take the distribution of time between events P(t) and can convert to the distribution of time to the next event using the relation

完整的旁注，但我意识到一般而言，您可以获取事件P（t）之间的时间分布，并可以使用下式转换为下一个事件的时间分布

$$Q(t) = frac{ int_t^infty P(s) ds }{ int_0^infty sP(s) ds }$$

$$ Q（t）= frac {int_t ^ infty sP ds} {int_0 ^ infty sP（s）ds $$

My math is a bit rusty so please don’t use this for heart surgery. But it seems to work — if you plug in a Dirac delta P(t)=δ(t−d) then you get the uniform distribution back: Q(t)=1/d,0≤t≤d. In the data above I just implemented it in a dumb way by sampling.

我的数学有点生疏，所以请不要用于心脏手术。但这似乎可行—如果插入狄拉克三角洲 P（t）=δ（t-d），那么您将获得均匀分布：Q（t）= 1 / d，0≤t≤d。在上面的数据中，我只是通过采样以愚蠢的方式实现了它。

排队等候时间 (Waiting time by line)

Let’s plot the average time to arrival by line. This is limited to the lines in the API. Let’s switch to a violin plot using Seaborn.

让我们按行绘制平均到达时间。这仅限于API中的行。让我们使用Seaborn切换到小提琴图。

Interestingly, L stacks up pretty well against the other subway lines, despite its notorious delays (and websites such as is the L train f*$cked). The median waiting time is the smallest out of all the lines, and even the extreme case compares favorably.

有趣的是，尽管L地铁的臭名昭著的延误（以及L火车f * $ cked等网站），但L地铁与其他地铁线路的堆叠情况相当不错。中位等待时间是所有行中最小的，即使是极端情况也可以比较。

(Btw the key data set of chart is MTA’s offical color schema. Did you know that the color of L is not a perfect gray but actually #A7A9AC — marginally more blue? Amazing)

（顺便说一下，图表的关键数据集是MTA的官方颜色架构。您是否知道L的颜色不是完美的灰色，但实际上是＃A7A9AC-比蓝色略多？）

每天的等待时间 (Waiting time by time of day)

Obviously time of day is an extremely important factor here so let’s look at the waiting time by time of day. Each point in time gives us a probility distribution over waiting time so let’s plot some of the quartiles and how it changes over the day!

显然，一天中的时间是一个非常重要的因素，因此让我们按一天中的时间查看等待时间。每个时间点都为我们提供了等待时间的概率分布，因此让我们绘制一些四分位数以及它在一天中的变化情况！

The 50 percentile line (blue) describes the median time you have to wait based on the time of day. The 90 percentile line (yellow) describes how long you have to wait if you are unlucky and a 90% event happens. It depends on your risk averseness what line you pick — if you have to make it to a flight you should probably pick the 90th percentile, but if it doesn’t matter if you are late, pick the 50th.

50％线（蓝色）描述了您必须根据一天中的时间等待的中值时间。 90％的行（黄色）描述了不幸和90％的事件发生时必须等待的时间。这取决于您的风险厌恶程度，该选择哪条线-如果必须乘飞机去，您可能应该选择第90个百分位，但是如果您迟到也没关系，请选择第50个百分位。

Not shockingly, the waiting times peak in the wee hours — in particular the 90th percentile shoots up around 4AM. The 7AM-7PM window is very stable, and then it shoots up again.

毫不奇怪，等待时间在凌晨达到顶峰-特别是第90个百分点在凌晨4点左右上升。 7 AM-7PM窗口非常稳定，然后再次弹出。

等待地铁和沉没成本 (Waiting for subway and sunk cost)

Let’s say you wait for the subway for 10 minutes and it hasn’t arrived yet. Should you give up? Probably not. But if you have waited for the subway for an hour, there’s probably no point. Up to a certain point waiting for the subway is an investment in getting home sooner.

假设您等待地铁10分钟，但尚未到达。你应该放弃吗？可能不是。但是，如果您等了一个小时的地铁，那可能没有意义。在一定程度上等待地铁是对尽快回家的一项投资。

It also depends on your risk averseness again — if you need to make it to a flight, you might just give up and get a cab at some point. So given that you spent t minutes so far waiting for the subway, what’s the additional time you’re going to have to wait?

这也取决于您的风险规避能力-如果您需要搭乘飞机，可能会在某个时候放弃并获得出租车。因此，鉴于您到目前为止已经花了t分钟的时间在等地铁，那么您还要等什么时间？

There’s a tricky bias here, because the times where you waited longer tends to skew towards nights. This would be a confounding factor. So I limited the data set to 7AM-7PM above.

这里存在一个棘手的偏见，因为您等待时间较长的时间倾向于偏向夜晚。这将是一个令人困惑的因素。因此，我将数据集限制为上述7 AM-7PM。

The interesting conclusion is that after about five minutes, the longer you wait, the longer you will have to wait. If you waited for 15 min, the median additional waiting time is another 8 minutes. But 8 minutes later if the train still hasn’t come, the median additional waiting time is now another 12 minutes.

有趣的结论是， 大约五分钟后，您等待的时间越长，您将不得不等待的时间越长。 如果您等待了15分钟，则额外的等待时间中位数为另外8分钟。但是8分钟后，如果火车仍然没有来，那么现在的额外等待时间中位数又是12分钟。

So when should you give up waiting? One way to think about it is how much time you think it’s worth waiting. The time you already waited is “sunk cost” so it doesn’t really matter. What matters is how much additional time you are willing to wait. Let’s assume you want to optimize for a wait time that’s less than 30 min in 90% of the cases. Then the max time you should wait is about 11 minutes until giving up (this is at the point where the yellow line cuts the 30 min mark).

那么什么时候应该放弃等待呢？考虑它的一种方法是您认为值得等待多少时间。您已经等待的时间是“节省成本”，因此这并不重要。重要的是您愿意等待多少额外的时间。假设您要在90％的情况下优化小于30分钟的等待时间。然后，您应该等待的最长时间为大约11分钟，直到放弃为止（这是黄线切入30分钟标记的点）。

This reminds me a bit of project management. The longer a project has been going on, the longer the expected value of additional time is. Whatever resources you spent is sunk cost but what matters is the most likely estimation of project completion going forward. But of course the more overdue a project is, the longer that estimate is.

这使我想起了一些项目管理。项目进行的时间越长，预期的额外时间值就越长。您所花费的任何资源都是沉没成本，但重要的是最有可能进行的项目完成估算。但是，当然，项目逾期越多，估算值就越长。

Of course, there’s nothing “magic” about these kinds of distributions. There are certain probability distributions where waiting is an “investment” — the expected time until the next event goes down for every second you wait. There is exactly one type of probability distributions where waiting doesn’t affect the time until the next event at all. This is the exponential distribution and the particular property is referred to as memorylessness. Then, there’s “fat-tailed” distributions where the expected time to next event goes up for every second you wait. The NYC subway distribution exhibits all those behaviors in different parts of the curve.

当然，这些分布没有“魔术”。在某些概率分布中，等待是一种“投资”，即直到您等待的每一秒，下一个事件发生的预期时间。完全有一种概率分布，其中等待根本不影响到发生下一个事件的时间。这是指数分布，特定的属性称为无记忆性。然后，出现“胖尾”分布，其中等待下一个事件的预期时间每增加一秒钟就会增加。纽约地铁分布图显示了曲线不同部分的所有行为。

翻译自: https://www.pybloggers.com/2017/01/nyc-subway-math/

软通纽约

cumei1658

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
软通纽约_纽约地铁数学

软通纽约About Erik: Dad and CTO (Chief Troll Officer) at a fintech startup in NYC. Ex-Spotify, co-organizing NYC ML meetup, open source sometimes (Luigi, Annoy), blogs random stuff 关于埃里克（ Erik）：爸爸和首席技术...
复制链接

扫一扫