字节跳动基础架构面经_减少字节数，使netflixs数据基础架构具有成本效益

最新推荐文章于 2022-08-05 22:26:17 发布

weixin_26717401

最新推荐文章于 2022-08-05 22:26:17 发布

阅读量1.3k

点赞数

文章标签： java python

原文链接：https://netflixtechblog.com/byte-down-making-netflixs-data-infrastructure-cost-effective-fee7b3235032

版权

字节跳动基础架构面经

重点 (Top highlight)

By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park

作者： Torio Risianto ， Bhargavi Reddy ， Tanvi Sahni ， Andrew Park

数据效率背景 (Background on data efficiency)

At Netflix, we invest heavily in our data infrastructure which is composed of dozens of data platforms, hundreds of data producers and consumers, and petabytes of data.

在Netflix，我们在数据基础架构上进行了大量投资，该基础架构由数十个数据平台，数百个数据生产者和消费者以及PB级数据组成。

At many other organizations, an effective way to manage data infrastructure costs is to set budgets and other heavy guardrails to limit spending. However, due to the highly distributed nature of our data infrastructure and our emphasis on freedom and responsibility, those processes are counter-cultural and ineffective.

在许多其他组织中，管理数据基础结构成本的有效方法是设置预算和其他繁琐的护栏，以限制支出。但是，由于我们的数据基础结构高度分散的性质以及我们对自由和责任的重视，这些过程是反文化的，效率低下。

Our efficiency approach, therefore, is to provide cost transparency and place the efficiency context as close to the decision-makers as possible. Our highest leverage tool is a custom dashboard that serves as a feedback loop to data producers and consumers — it is the single holistic source of truth for cost and usage trends for Netflix’s data users. This post details our approach and lessons learned in creating our data efficiency dashboard.

因此，我们的效率方法是提供成本透明性，并将效率环境尽可能地靠近决策者。我们使用率最高的工具是自定义仪表板，可作为数据生产者和消费者的反馈回路-它是Netflix数据用户成本和使用趋势的唯一整体事实来源。这篇文章详细介绍了我们在创建数据效率仪表盘方面的方法和经验教训。

Netflix的数据平台格局 (Netflix’s data platform landscape)

Netflix’s data platforms can be broadly classified as data at rest and data in motion systems. Data at rest stores such as S3 Data Warehouse, Cassandra, Elasticsearch, etc. physically store data and the infrastructure cost is primarily attributed to storage. Data in motion systems such as Keystone, Mantis, Spark, Flink, etc. contribute to data infrastructure compute costs associated with processing transient data. Each data platform contains thousands of distinct data objects (i.e. resources), which are often owned by various teams and data users.

Netflix的数据平台可以大致分为静止数据和运动系统中的数据。静态存储(例如S3数据仓库，Cassandra，Elasticsearch等)中的数据实际存储数据，而基础架构成本主要归因于存储。运动系统(例如Keystone ， Mantis ，Spark，Flink等)中的数据会导致与处理瞬态数据相关的数据基础架构计算成本。每个数据平台包含数千个不同的数据对象(即资源)，这些对象通常由各种团队和数据用户拥有。

创建使用情况和成本可见性 (Creating usage and cost visibility)

To get a unified view of cost for each team, we need to be able to aggregate costs across all these platforms but also, retaining the ability to break it down by a meaningful resource unit (table, index, column family, job, etc).

为了获得每个团队成本的统一视图，我们需要能够汇总所有这些平台的成本，而且还需要保持按有意义的资源单位(表，索引，列族，工作等)细分成本的能力。。

数据流 (Data flow)

S3 Inventory: Provides a list of objects and their corresponding metadata like the size in bytes for S3 buckets which are configured to generate the inventory lists.Netflix Data Catalog (NDC): In-house federated metadata store which represents a single comprehensive knowledge base for all data resources at Netflix.Atlas: Monitoring system which generates operational metrics for a system (CPU usage, memory usage, network throughput, etc.)

成本计算和业务逻辑 (Cost calculations and business logic)

As the source of truth for cost data, AWS billing is categorized by service (EC2, S3, etc) and can be allocated to various platforms based on AWS tags. However, this granularity is not sufficient to provide visibility into infrastructure costs by data resource and/or team. We have used the following approach to further allocate these costs:

作为费用数据的真实来源， AWS计费按服务(EC2，S3等)分类，并且可以基于AWS标签分配给各种平台。但是，此粒度不足以提供数据资源和/或团队对基础结构成本的可见性。我们使用以下方法进一步分配这些费用：

EC2-based platforms: Determine bottleneck metrics for the platform, namely CPU, memory, storage, IO, throughput, or a combination. For example, Kafka data streams are typically network bound, whereas spark jobs are typically CPU and memory bound. Next, we identified the consumption of bottleneck metrics per data resource using Atlas, platform logs, and various REST APIs. Cost is allocated based on the consumption of bottleneck metrics per resource (e.g., % CPU utilization for spark jobs). The detailed calculation logic for platforms can vary depending on their architecture. The following is an example of cost attributions for jobs running in a CPU-bound compute platform:

基于EC2的平台：确定平台的瓶颈指标，即CPU，内存，存储，IO，吞吐量或其组合。例如，Kafka数据流通常受网络限制，而火花作业通常受CPU和内存限制。接下来，我们使用Atlas，平台日志和各种REST API确定每个数据资源的瓶颈指标消耗。成本是根据每个资源的瓶颈指标消耗(例如，火花作业的CPU利用率百分比)分配的。平台的详细计算逻辑可能会因其架构而异。以下是在受CPU限制的计算平台中运行的作业的成本归因示例：

S3-based platforms: We use AWS’s S3 Inventory (which has object level granularity) in order to map each S3 prefix to the corresponding data resource (e.g. hive table). We then translate storage bytes per data resource to cost based on S3 storage prices from AWS billing data.

基于S3的平台 ：我们使用AWS的S3清单(具有对象级别的粒度)，以便将每个S3前缀映射到相应的数据资源(例如，配置单元表)。然后，我们根据来自AWS账单数据的S3存储价格将每个数据资源的存储字节转换为成本。

仪表板视图 (Dashboard view)

We use a druid-backed custom dashboard to relay cost context to teams. The primary target audiences for our cost data are the engineering and data science teams as they have the best context to take action based on such information. In addition, we provide cost context at a higher level for engineering leaders. Depending on the use case, the cost can be grouped based on the data resource hierarchy or org hierarchy. Both snapshots and time-series views are available.

我们使用德鲁伊支持的自定义仪表板，以将成本上下文传递给团队。我们的成本数据的主要目标受众是工程和数据科学团队，因为他们具有根据此类信息采取行动的最佳环境。此外，我们为工程主管提供了更高层次的成本背景。根据用例，可以根据数据资源层次结构或组织层次结构对成本进行分组。快照和时序视图均可用。

Note: The following snippets containing costs, comparable business metrics, and job titles do not represent actual data and are for ILLUSTRATIVE purposes only.

注意：以下包含成本，可比较的业务指标和职位的代码段不代表实际数据，仅用于说明目的。

自动存储建议-生存时间(TTL) (Automated storage recommendations — Time to live (TTL))

In select scenarios where the engineering investment is worthwhile, we go beyond providing transparency and provide optimization recommendations. Since data storage has a lot of usage and cost momentum (i.e. save-and-forget build-up), we automated the analysis that determines the optimal duration of storage (TTL) based on data usage patterns. So far, we have enabled TTL recommendations for our S3 big data warehouse tables.

在某些值得进行工程投资的特定情况下，我们不仅提供透明性，而且提供优化建议。由于数据存储具有大量使用情况和成本动因(即，保存后遗忘)，因此我们使分析自动化，该分析根据数据使用模式确定最佳存储持续时间(TTL)。到目前为止，我们已经为S3大数据仓库表启用了TTL建议。

Our big data warehouse allows individual owners of tables to choose the length of retention. Based on these retention values, data stored in date- partitioned S3 tables are cleaned up by a data janitor process which drops partitions older than the TTL value on a daily basis. Historically most data owners did not have a good way of understanding usage patterns in order to decide optimal TTL.

我们的大数据仓库允许表的个人所有者选择保留的时间长度。基于这些保留值，数据看门人进程会清理存储在按日期划分的S3表中的数据，该进程每天都会删除比TTL值更旧的分区。从历史上看，大多数数据所有者并没有很好的理解使用模式来确定最佳TTL的方法。

数据流 (Data flow)

S3 Access logs: AWS generated logging for any S3 requests made which provide detailed records about what S3 prefix was accessed, time of access, and other useful information.Table Partition Metadata: Generated from an in-house metadata layer (Metacat) which maps a hive table and its partitions to a specific underlying S3 location and stores this metadata. This is useful to map the S3 access logs to the DW table which was accessed in the request.Lookback days: Difference between the date partition accessed and the date when the partition was accessed.

成本计算和业务逻辑 (Cost calculations and business logic)

The largest S3 storage cost comes from transactional tables, which are typically partitioned by date. Using S3 access logs and S3 prefix-to-table-partition mapping, we are able to determine which date partitions are accessed on any given day. Next, we look at access(read/write) activities in the last 180 days and identify the max lookback days. This maximum value of lookback days determines the ideal TTL of a given table. In addition, we calculate the potential annual savings that can be realized (based on today’s storage level) based on the optimal TTL.

S3的最大存储成本来自事务表，这些事务表通常按日期进行分区。使用S3访问日志和S3前缀到表分区的映射，我们能够确定在任何给定日期访问哪些日期分区。接下来，我们查看过去180天内的访问(读/写)活动，并确定最大回溯天数。回溯天数的最大值确定了给定表的理想TTL。此外，我们根据最佳TTL计算可实现的潜在年度节省(基于当今的存储水平)。

仪表板视图 (Dashboard view)

From the dashboard, data owners can look at the detailed access patterns, recommended vs. current TTL values, as well as the potential savings.

数据所有者可以从仪表板上查看详细的访问模式，建议的TTL值与当前的TTL值以及潜在的节省量。

沟通和提醒用户 (Communication and alerting users)

Checking data costs should not be part of any engineering team’s daily job, especially those with insignificant data costs. To that regard, we invested in email push notifications to increase data cost awareness among teams with significant data usage. Similarly, we send automated TTL recommendations only for tables with material cost-saving potentials. Currently, these emails are sent monthly.

检查数据成本不应成为任何工程团队日常工作的一部分，尤其是那些数据成本微不足道的工程师。为此，我们投资了电子邮件推送通知，以提高具有大量数据使用量的团队之间的数据成本意识。同样，我们仅针对具有节省材料成本潜力的表发送自动TTL建议。目前，这些电子邮件每月发送一次。

学习和挑战 (Learnings and challenges)

识别和维护资产元数据对于成本分配至关重要 (Identifying and maintaining metadata of assets is critical for cost allocation)

What is a resource? What is the complete set of data resources we own?These questions form the primary building blocks of cost efficiency and allocation. We are extracting metadata for a myriad of platforms across in-motion and at-rest systems as described earlier. Different platforms store their resource metadata in different ways. To address this, Netflix is building a metadata store called the Netflix Data Catalog (NDC). NDC enables easier data access and discovery to support data management requirements for both existing and new data. We use the NDC as the starting point for cost calculations. Having a federated metadata store ensures that we have a universally understood and accepted concept of defining what resources exist and which resources are owned by individual teams.

什么是资源？我们拥有的完整的数据资源是什么？这些问题构成了成本效率和分配的主要基础。如前所述，我们正在跨运动和静止系统提取大量平台的元数据。不同的平台以不同的方式存储其资源元数据。为了解决这个问题，Netflix正在建立一个称为Netflix数据目录(NDC)的元数据存储。 NDC使数据访问和发现更加容易，以支持现有数据和新数据的数据管理要求。我们将NDC用作成本计算的起点。拥有联邦元数据存储可确保我们拥有一个普遍理解和接受的概念，用于定义存在哪些资源以及各个团队拥有哪些资源。

时间趋势充满挑战 (Time trends are challenging)

Time trends carry a much higher maintenance burden than point-in-time snapshots. In the case of data inconsistencies and latencies in ingestion, showing a consistent view over time is often challenging. Specifically, we dealt with the following two challenges:

与时间点快照相比，时间趋势带来了更高的维护负担。如果数据在摄取中存在不一致和延迟的情况，那么随着时间的推移显示一致的视图通常会很困难。具体来说，我们应对以下两个挑战：

Changes in resource ownership: for a point-in-time snapshot view, this change should be automatically reflected. However, for a time series view, any change in the ownership should also be reflected in historical metadata as well.
资源所有权的更改：对于时间点快照视图，应自动反映此更改。但是，对于时间序列视图，所有权的任何更改也应同样反映在历史元数据中。
Loss of state in case of data issues: resource metadata is extracted from a variety of sources many of which are API extractions, it’s possible to lose state in case of job failures during data ingestion time. API extractions in general have drawbacks because the data is transient. It’s important to explore alternatives like pumping events to Keystone so that we can persist data for a longer period.
发生数据问题时状态会丢失 ：资源元数据是从多种资源中提取的，其中许多是API提取，如果在数据摄取期间作业失败，可能会丢失状态。 API提取通常具有缺点，因为数据是瞬时的。重要的是探索替代方法，例如将事件泵送到Keystone，以便我们可以将数据保留更长的时间。

结论 (Conclusion)

When faced with a myriad of data platforms with a highly distributed, decentralized data user base, consolidating usage and cost context to create feedback loops via dashboards provide great leverage in tackling efficiency. When reasonable, creating automated recommendations to further reduce the efficiency burden is warranted — in our case, there was high ROI in data warehouse table retention recommendations. So far, these dashboards and TTL recommendations have contributed to over a 10% decrease in our data warehouse storage footprint.

当面对具有高度分散的，分散的数据用户群的大量数据平台时，合并使用情况和成本上下文以通过仪表板创建反馈循环可极大地提高处理效率。在合理的情况下，有必要创建自动建议以进一步减轻效率负担-在我们的案例中，数据仓库表保留建议的投资回报率很高。到目前为止，这些仪表板和TTL建议已使我们的数据仓库存储空间减少了10％以上。

下一步是什么？ (What’s next?)

In the future, we plan to further push data efficiency by using different storage classes for resources based on usage patterns as well as identifying and aggressively deleting upstream and downstream dependencies of unused data resources.

未来，我们计划通过基于使用模式为资源使用不同的存储类来进一步提高数据效率，并识别并积极删除未使用的数据资源的上游和下游依赖性。

Interested in working with large scale data? Platform Data Science & Engineering is hiring!

有兴趣处理大规模数据吗？ 平台数据科学与工程系正在 招聘！

翻译自: https://netflixtechblog.com/byte-down-making-netflixs-data-infrastructure-cost-effective-fee7b3235032

字节跳动基础架构面经

weixin_26717401

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字节跳动基础架构面经_减少字节数，使netflixs数据基础架构具有成本效益

字节跳动基础架构面经重点 (Top highlight)By Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park 作者： Torio Risianto ， Bhargavi Reddy ， Tanvi Sahni ， Andrew Park 数据效率背景 (Background on data efficiency)At Ne...
复制链接

扫一扫