在现代Web应用程序中处理时间

by Ning Shi, VP of Engineering at Zoba. Ning loves distributed systems, high performance data processing and analytics, and can be found on Twitter at @ihsgnin.

宁时 ,在工程副总裁 Zoba Ning喜欢分布式系统,高性能数据处理和分析,可以在Twitter上 @ihsgnin 上找到 Ning

Zoba provides demand forecasting and optimization tools to shared mobility companies, from micromobility to car shares and beyond.

Zoba 向共享出行的公司提供需求预测和优化工具,从微型出行到汽车共享等。

A long time ago, in a time zone far far away…

很久以前,在一个遥远的时区…

Software engineers have been fighting the never ending battle for the one true representation of time that is simple, clearly defined, and unambiguous. Luckily for Zoba, we have already won half the battle by headquartering in the right time zone.

软件工程师一直在为永无止境的战斗而战,争取一种简单,清晰的定义和明确的时间真实表示。 对于Zoba来说幸运的是,我们已经将总部设在正确的时区,赢得了一半的胜利。

XKCD: Earth Standard Time

Time is a foundational component of today’s software, especially in distributed services and systems. Despite the fact that computer scientists have been working hard to tame time since the inception of computers, we still have fun challenges like time synchronization, leap seconds, time storage and representation, etc. Whether we will ever be able to conquer time, only time can tell (pun intended).

时间是当今软件的基本组成部分,尤其是在分布式服务和系统中。 尽管计算机科学家自从计算机诞生以来就一直在努力地驯服时间,但我们仍然面临许多有趣的挑战,例如时间同步,leap秒,时间存储和表示等。我们是否能够征服时间,只有时间能说出(双关语意)。

Although Zoba is a geospatial data science company, we care not only about where an event happens but also when. As a result, handling time series data is central to everything we build.

尽管Zoba是一家地理空间数据科学公司,但我们不仅关心事件发生的地点,还关心事件的发生时间。 因此,处理时间序列数据对于我们构建的所有内容都是至关重要的。

It is impossible to cover any time-related topic in depth in a single blog post. Each topic is worthy of its own research and already has decades of academic literature on it. In this post I will focus on the niche use cases we have at Zoba and some of the unique challenges we have come across in our work.

不可能在单个博客文章中深入讨论任何与时间相关的主题。 每个主题都值得自己研究,并且已经有数十年的学术文献。 在本文中,我将重点介绍Zoba上的利基用例以及我们在工作中遇到的一些独特挑战。

Zoba用例 (Zoba use case)

At the heart of Zoba, we process a lot of events from customers and turn them into incredible insights and actions that help customers increase revenue. Our machine learning models continuously extract geospatial and temporal information from time series data.

在Zoba的中心,我们处理来自客户的许多事件,并将其转变为可帮助客户增加收入的不可思议的见解和行动。 我们的机器学习模型不断从时间序列数据中提取地理空间和时间信息。

Since our customers are in time zones that span 19 hours and all output date times that need to be location aware, we have to do a fair amount of time zone conversions.

由于我们的客户所在的时区跨越19个小时,并且所有输出日期时间都需要知道位置,因此我们必须进行大量的时区转换。

Zoba offers a suite of intelligence API endpoints and a dashboard for analytics. When querying metrics for a given city, results are computed based on events in the city’s time zone so that they reflect what happened in the city on any given day. This is essential for customers to compare the performance of different cities without having to worry about metrics being cut off at arbitrary day boundaries.

Zoba提供了一套Intelligence API端点和用于分析的仪表板。 在查询给定城市的指标时,将根据城市所在时区中的事件来计算结果,以便它们反映任何给定日期在城市中发生的事情。 这对于客户比较不同城市的表现至关重要,而不必担心指标会在任意一天的边界被切断。

Similarly to analytics, when we schedule tasks that deliver optimization results to customers, the schedule has to be aware of the cities’ time zones. An operations shift that deploys scooters at 5am in Berlin runs at a different time than a shift that goes out at 5am in Austin. A subsequent blog post will elaborate on the unique challenges we faced building such a scheduling system.

与分析类似,当我们安排向客户交付优化结果的任务时,时间表必须知道城市的时区。 柏林凌晨5点部署踏板车的操作班次与奥斯丁凌晨5点进行的班次不同。 随后的博客文章将详细阐述我们构建这样的调度系统所面临的独特挑战。

Unlike the two use cases above where it is critical to manage time zones correctly but less essential to maximize speed, the third use case requires both. In order to discover patterns in vehicle usage in a city, we group events together by time of week. Time of week is defined as the hours from the beginning of the Monday of that week. There are a total of 168 hours in a week. For example, operators see different fleet usage during the Monday morning rush hour versus on a Sunday morning, but Mondays across multiple consecutive weeks often share similar patterns.

与上面的两个用例不同,在这两个用例中正确管理时区至关重要,但对于最大程度地提高速度则不那么重要,而第三个用例则需要两者。 为了发现城市中车辆的使用方式,我们将事件按一周的时间进行分组。 一周中的时间定义为从该周的星期一开始的小时数。 一周总共有168个小时。 例如,运营商在星期一的高峰时段看到的车队使用情况与星期日的早晨不同,但是连续多个星期的星期一通常具有相似的模式。

As part of the preprocessing steps to the machine learning models, events that happen at a similar time of week are grouped together. The number of events passing through our data pipelines typically ranges from hundreds of thousands to tens of millions in a single run. The grouping not only has to be aware of the given city’s time zone, but also be efficient.

作为机器学习模型的预处理步骤的一部分,在一周的相似时间发生的事件被组合在一起。 一次通过我们的数据管道传递的事件数通常在数十万到数千万之间。 分组不仅要知道给定城市的时区,而且要有效。

Most engineers who have worked intimately with date time in software development probably have nightmares of the time zone goblins chasing them in Eternity. The handling of time zones in modern software leaves much to be desired.

大多数与日期时间密切相关的软件开发工程师可能会对时空妖精的噩梦在永恒中追逐。 在现代软件中对时区的处理还有很多不足之处。

We will cover the different technical challenges we faced and share our learnings below.

我们将介绍我们面临的各种技术挑战,并在下面分享我们的经验。

存储 (Storage)

The first challenge we had was choosing the proper time storage format. Since we store data from many different time zones, a standard time storage format across all types of data makes development easier and code less error prone.

我们面临的第一个挑战是选择正确的时间存储格式。 由于我们存储来自许多不同时区的数据,因此跨所有类型数据的标准时间存储格式使开发变得更加容易,并且减少了出错代码的产生。

One choice is to store all date times without time zone information, sometimes referred to as naive date time. Unless the usage is strictly limited and the time zone-less date times are never exposed, this approach is strongly discouraged. More often than not, this approach leads to error prone spaghetti code and errors that are extremely hard to debug.

一种选择是存储没有时区信息的所有日期时间,有时也称为原始日期时间。 除非严格限制使用范围,并且永远不要公开无时区的日期时间,否则强烈建议不要使用此方法。 通常,这种方法会导致易出错的意大利面条代码和极其难以调试的错误。

We decided to separate time zone information from the date times and store the date times as POSIX timestamps (also known as Unix time) that are consistent throughout the code base. One common misconception is that POSIX timestamps (or timestamps for short) have no time zone information. On the contrary, POSIX timestamps are by definition always in UTC. It is defined as the number of seconds elapsed since the Unix epoch, which is 1970–01–01 00:00:00 UTC. All computers understand this concept because it is baked into the operating system’s kernel. If you accidentally use the elapsed seconds since 1970–01–01 00:00:00 of any time zone other than UTC as timestamp, it will lead to confusion and weird bugs in the system stack.

我们决定将时区信息与日期时间分开,并将日期时间存储为POSIX时间戳(也称为Unix时间),在整个代码库中都是一致的。 一种常见的误解是POSIX时间戳(或简称为时间戳)没有时区信息。 相反,根据定义,POSIX时间戳始终是UTC。 它定义为自Unix纪元以来的秒数,即1970-01-01 01 00:00:00 UTC。 所有计算机都理解此概念,因为它已包含在操作系统的内核中。 如果您不小心将自1970–01–01 00:00:00以来UTC以外的任何时区经过的秒数用作时间戳记,则将导致系统堆栈中的混乱和奇怪的错误。

One of the benefits of using timestamps is that you can represent any point in time using a single number (except when you really care about the leap seconds). It is much easier and more efficient to store an integer representation of a timestamp than its string form. We store timestamps as 64-bit signed integers, which is enough for a little over 292 billion years in the future. We will leave the overflow problem for another eon.

使用时间戳的好处之一是,您可以使用单个数字表示任何时间点(除非您真正关心the秒)。 存储时间戳的整数表示比字符串形式更容易,更有效。 我们将时间戳存储为64位带符号整数,这足以在将来超过2920亿年使用。 我们将把溢出问题留给另一个时代。

The other benefit of storing date times as timestamps, which is probably also the biggest, is that we can delay time zone conversion until the timestamps need to be displayed for human consumption. By doing this, we can have consistent time handling throughout the code base assuming date times are in UTC.

将日期时间存储为时间戳的另一个好处(可能也是最大的)是,我们可以延迟时区转换,直到需要显示时间戳以供人类使用为止。 这样,假设日期时间在UTC中,我们就可以在整个代码库中进行一致的时间处理。

Cities rarely change time zones, so time zone information is stored as an attribute of a city. We use IANA time zone names to avoid ambiguity around daylight saving transitions. For example, instead of using Eastern Daylight Time (EDT) for New York which only applies during the summer months, we use America/New_York or its older form US/Eastern. When accepting time zone input from users, it is best practice to validate that it is in IANA form to avoid unexpected time zone bugs.

城市很少更改时区,因此时区信息存储为城市的属性。 我们使用IANA时区名称来避免在夏令时过渡前后的歧义。 例如,我们不是使用仅在夏季使用的纽约东部夏令时间(EDT),而是使用America / New_York或更旧的形式US / Eastern。 接受用户的时区输入时,最佳做法是验证它是否为IANA形式,以避免意外的时区错误。

There are more exotic time storage formats in various systems, one such format commonly used for globally unique ID generation is the elapsed seconds since the inception of the product which saves some bits. For example, if the reference point is today, using only 32 bits can represent 136 years into the future. That should be plenty to let most engineers sleep tight. If POSIX time is used here, the same 32 bits can only represent 86 years into the future. This technique is common in distributed systems where time is only used for comparisons, not as wall clock time.

在各种系统中,存在更多的外来时间存储格式,一种通常用于全局唯一ID生成的格式是自产品问世以来经过的秒数,它节省了一些比特。 例如,如果参考点是今天,则仅使用32位就可以表示未来的136年。 这应该足以让大多数工程师睡个好觉。 如果在此处使用POSIX时间,则相同的32位只能表示未来的86年。 这种技术在分布式系统中很常见,在分布式系统中,时间仅用于比较,而不是壁钟时间。

夏令时 (Daylight saving time)

When we talk about time zones, it is impossible to ignore Daylight Saving Time (DST for short). It is worth noting that DST is not an intrinsic property of the universe. It is merely a human invention to remind ourselves that not all inventions can withstand the test of time.

当我们谈论时区时,不可能忽略夏令时(简称DST)。 值得注意的是,DST不是宇宙的固有属性。 提醒自己,并非所有发明都能经受住时间的考验,这只是人类的发明。

Twice a year, some parts of the world wake up to find themselves with a day that is one hour longer or shorter than normal. If this is not confusing enough, imagine different parts of the world making this transition on different days, sometimes even different parts of the same state¹ make different transitions. Unfortunately, this is the world we live in. If anything related to time handling will bite you in your professional life, it is almost guaranteed to be DST.

一年两次,世界上的某些地方醒来发现自己的一天比正常时间长或短一小时。 如果这还不够令人困惑,请想象一下世界上的不同部分在不同的日子进行这种转换,有时甚至同一州的不同部分¹也进行了不同的转换。 不幸的是,这就是我们生活的世界。如果与时间处理相关的任何事情在您的职业生涯中给您带来困扰,那么几乎可以保证它是DST。

XKCD: Supervillain plan
xkcd: Supervillain Plan xkcd:超级坏蛋计划

If you only care about local wall clock time, DST probably matters less to your use case. If I wake up at 7am local time everyday, it’s the same wall clock time regardless of whether DST is in effect or not. However, if you ever want to compare two date times or make time zone conversions, making sure DST is taken into account is essential. Depending on whether DST is in effect or not, the difference between two time zones or two date times in the same or different time zones can be different.

如果您只关心本地挂钟时间,则DST对您的用例的影响可能较小。 如果我每天在当地时间早上7点醒来,则无论DST是否生效,它都是相同的挂钟时间。 但是,如果您想比较两个日期时间或进行时区转换,则必须确保考虑到DST。 根据DST是否有效,相同或不同时区中的两个时区或两个日期时间之间的时差可能会不同。

Because DST is seasonal and location specific, it does not make sense to ask if DST is in effect without giving the specific date time and location. For this reason, it does not make sense to calculate the difference between date times or time zones without knowing if DST is in effect, either. If you use Python, this usually means that you have to localize a naive date time or normalize a time zone aware date time to get the proper time zone offset with DST taken into account.

因为DST是季节性的并且是特定于位置的,所以在没有提供特定日期时间和位置的情况下询问DST是否有效是没有意义的。 因此,在不知道DST是否生效的情况下计算日期时间或时区之间的时差也是没有意义的。 如果使用Python,这通常意味着您必须本地化原始日期时间或规范化时区感知日期时间,才能在考虑DST的情况下获得正确的时区偏移量。

from datetime import datetime, timedelta
from pytz import timezone


tz = timezone('America/New_York')
# Naive datetime without time zone information
dt = datetime(2020, 3, 8, 1)
# Datetime with time zone information
dt_tz = tz.localize(dt)

One of the pitfalls in DST handling is forgetting to correct the time zone offset after date time arithmetic. The textbook example is adding an hour during the “spring-forward” DST transition. For example, adding an hour to 2020–03–08 01:00:00 New York local time actually results in 2020–03–08 03:00:00. 2am does not exist on 2020–03–08 in New York. How nice! Without normalizing the date time after the addition of an hour, you will end up with a date time with an incorrect time zone offset.

DST处理中的陷阱之一是忘记在日期时间算术之后更正时区偏移。 教科书示例在DST过渡期间增加了一个小时。 例如,将一个小时数添加到2020–03–08 01:00:00纽约当地时间实际上会导致2020–03–08 03:00:00。 纽约的2020-03-08凌晨2点不存在。 多好! 一个小时后,如果不对日期时间进行归一化,您将得到一个带有错误时区偏移的日期时间。

# dt_tz: 2020-03-08T01:00:00-05:00
new_dt = dt_tz + timedelta(hours=1)
# new_dt: 2020-03-08T02:00:00-05:00 (wrong time zone offset)
tz.normalize(new_dt)
# After normalization: 2020-03-08T03:00:00-04:00 (correct time zone offset)

On the other hand, an extra hour materializes out of thin air during the “fall-back” DST transition. 1am appears twice on 2020–11–01 in New York. In other words, every year people in New York will experience a 23 hour day and a 25 hour day in local time. UTC, however, only has 24 hour days. This is yet another reason to use UTC internally.

另一方面,在“后备” DST过渡期间,空气稀薄地增加了一个小时。 凌晨1:00在2020–11–01在纽约出现两次。 换句话说,每年纽约人在当地时间都会经历23个小时和25个小时的一天。 但是,UTC只有24小时工作日。 这是在内部使用UTC的另一个原因。

We have developed a good practice of always writing unit tests to cover DST transition days in multiple time zones for any code that manipulates time. It is highly recommended if you want production stable code.

我们已经开发出一种良好的做法,始终编写单元测试以涵盖任何操纵时间的代码在多个时区的DST过渡天。 如果要生产稳定的代码,强烈建议使用。

快速的时间存储 (Fast time bucketing)

Python date time libraries like pytz or Pendulum make the above operations relatively easy. However, we have found them to be lacking in performance for our last use case, grouping millions of events by hour of week.

诸如pytz或Pendulum之类的Python日期时间库使上述操作相对容易。 但是,我们发现它们在我们的最后一个用例中缺乏性能,它按一周的小时将数百万个事件分组。

In a simple microbenchmark on a 2019 15-inch MacBook Pro, date time localization takes roughly 3 times longer than normalization, and about 100 times longer than pure integer arithmetic.

在2019年的15英寸MacBook Pro上的一个简单的微基准测试中,日期时间本地化的时间大约比规范化时间长3倍,比纯整数算术时间长100倍左右。

from datetime import datetime, timedelta
from pytz import timezone, utc


tz = timezone('America/New_York')


dt = datetime.utcnow()
dt_tz = tz.localize(dt)


%timeit tz.localize(dt)
# 14.9 µs ± 60.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit tz.normalize(dt_tz + timedelta(hours=1))
# 4.68 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even though the individual operations look very fast, repeating them tens of millions of times can add up to minutes for a single run.

即使单个操作看起来非常快,但重复执行数千万次,一次运行最多可能需要几分钟。

Given that our use case is to find the right time of week buckets for UTC date times in a given city’s local time zone, we built a ring data structure using pure integer arithmetic. This approach yielded a 6x speed up and removed a major bottleneck in one of our data pipelines.

假设我们的用例是在给定城市的本地时区中为UTC日期时间找到正确的星期时间段,我们使用纯整数算法构建了一个环形数据结构。 这种方法使速度提高了6倍,并消除了我们的数据管道之一的主要瓶颈。

The ring data structure represents time of week as a point along the ring; the ring “wraps around” so that the end of one week is the start of the next. The ring has length 168, with each unit of length covering an hour of the week. It wraps around at the origin into the next or previous week depending on whether the direction of traversal is clockwise or counterclockwise. The 168 ranges can have arbitrary bucket names assigned to them. For example, 7am to 10am on Monday can be named “Monday rush hours”.

环形数据结构将一周中的时间表示为沿环形的一个点; 戒指“环绕”,所以一周的结束是下一周的开始。 环的长度为168,每个长度单位覆盖一周的一个小时。 根据遍历的方向是顺时针还是逆时针,它从原点回绕到下一周或上一周。 168个范围可以分配有任意的存储桶名称。 例如,星期一的早上7点到10点可以命名为“星期一高峰时间”。

Ring data structure mapping hours of week to time buckets.
An illustration of a simple ring with min of 0 and max of 168. Min and max share the same spot on the ring. The ring is divided up into 4 ranges, [0, 30), [30, 84), [84, 126), [126, 168) with the labels R1, R2, R3, and R4, respectively. Note that it’s inclusive at range start and exclusive at range end. The number 11 falls into the first range since it’s between 0 and 30, thus having the label R1. The number 228 is greater than the max of 168, so it wraps around the ring and falls into the range [30, 84), thus having the label R2. Likewise, the number -10 is smaller than the min of 0, so it wraps around the ring backwards and falls into the range [126, 168), thus having the label R4.
最小为0,最大为168的简单环的图示。最小和最大在环上共享相同的点。 该环分为四个范围,分别为[R,R2,R3和R4],分别为[0、30),[30、84),[84、126),[126、168)。 请注意,它在范围开始时是包含性的,而在范围结束时是排他性的。 数字11属于第一个范围,因为它介于0到30之间,因此具有标签R1。 数字228大于最大值168,因此它缠绕在环上并落入范围[30,84),因此具有标签R2。 同样,数字-10小于最小值0,因此它向后缠绕在环上并落入范围[126,168),因此具有标签R4。

For any event that spans a time range, the ring makes it very easy for us to answer questions like “which bucket does the event start and end in?” and “what buckets does the event span?”. The second question would be hard to answer if we only had a lookup table from hour of week to bucket name. The ring comes in really handy in this case.

对于跨越某个时间范围的任何事件,通过响铃,我们可以很轻松地回答诸如“该事件开始和结束于哪个存储段?”之类的问题。 和“事件跨越了哪些阶段?”。 如果我们只有一个星期几到存储桶名称的查找表,那么第二个问题将很难回答。 在这种情况下,戒指非常方便。

The ring can be thought of as a lookup table that also preserves ordering and supports infinite iteration in both directions. Given a week’s start time, we can locate a timestamp’s position on the ring in O(logn) time where n is the number of buckets. Traversing the ring from start to end gives us the buckets covered by a time range in chronological order.

可以将环视为查找表,该表还保留顺序并在两个方向上支持无限迭代。 给定一周的开始时间,我们可以在O(log n )时间中找到时间戳在环上的位置,其中n是存储桶数。 从头到尾遍历环,使我们可以按时间顺序在一个时间范围内覆盖各个时段。

At a low level, the ring data structure is implemented as an immutable and compact binary search tree. The advantage of using such a data structure is that it is CPU cache friendly, and so it is faster when performing millions of lookups at once.

在较低级别,环形数据结构被实现为不变且紧凑的二进制搜索树。 使用这种数据结构的优点是它对CPU缓存友好,因此一次执行数百万个查找时速度更快。

The tokens on the ring define the boundaries of different buckets. They are all floating point numbers representing hours of week to support buckets that don’t align on the hour. All ring operations are performed on timestamps to avoid the overhead of Python date time manipulation. When the final results are returned to the caller, they are converted back to time zone aware datetime objects.

环上的标记定义了不同存储桶的边界。 它们都是浮点数,代表一周中的小时,以支持不按小时排列的时段。 所有响铃操作均在时间戳上执行,以避免Python日期时间操作的开销。 当最终结果返回给调用方时,它们将转换回可识别时区的datetime对象。

Because the ring stores static information unrelated to individual events, it can be constructed beforehand and cached throughout the run. The ring itself is immutable once constructed and all time zone conversions take place at the API layer. The ring can even be shared across multiple runs. The whole data structure is compact and can easily be serialized into several kilobytes for persistence.

由于该环存储与各个事件无关的静态信息,因此可以预先构造它并在整个运行过程中对其进行缓存。 构造后,环本身是不可变的,并且所有时区转换都在API层进行。 该环甚至可以跨多个运行共享。 整个数据结构紧凑,可以很容易地序列化为几千字节以保持持久性。

As efficient as this approach is, it is not what we started with first. If not for the significant test coverage we had on this use case and all the baselines we had built up over time to verify correctness, we would not have felt confident introducing such a sophisticated time manipulation solution in our core product. Our experience has taught us that when working with time, it is better to stick to the standards, use consistent approaches, and cover the corner cases because they matter.

尽管这种方法非常有效,但这并不是我们首先开始的。 如果不是因为我们在此用例上拥有大量测试覆盖面以及我们为验证正确性而建立的所有基线,那么我们就不会对在核心产品中引入如此复杂的时间操纵解决方案充满信心。 我们的经验告诉我们,与时间一起工作时,最好遵循标准,使用一致的方法并涵盖关键案例,因为它们很重要。

In this post, I shared some of the technical challenges we have faced working with time series data but this is just one of the many engineering challenges we face at Zoba. No system is a silver bullet. Engineering is all about tradeoffs, even for the best use cases. Understanding the use case at hand will save you trouble when making tradeoffs. If you are interested in solving problems like these and would like to challenge yourself, we are hiring software engineers like you.

在这篇文章中,我分享了我们在使用时间序列数据时面临的一些技术挑战,但这只是我们在Zoba面临的众多工程挑战之一。 没有系统是万灵丹。 工程就是权衡取舍,即使是最佳用例也是如此。 了解手边的用例将在进行权衡时省去您的麻烦。 如果您有兴趣解决此类问题并想挑战自己,我们正在招聘像您这样的软件工程师。

[1]: Arizona does not observe DST. However, the part of Navajo Nation inside Arizona does observe DST.

[1]: 亚利桑那州未遵守夏令时。 但是,亚利桑那州纳瓦霍族的一部分确实遵守了夏令时。

翻译自: https://medium.com/zoba-blog/datetime-7884313b52cb

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值