算法工程师 数据挖掘工程师_数据工程师的崛起

算法工程师 数据挖掘工程师

by Maxime Beauchemin

通过马克西姆·博赫明

数据工程师的崛起 (The Rise of the Data Engineer)

I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.

我于2011年加入Facebook,担任商业智能工程师 。 到2013年离开时,我已经是一名数据工程师

I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.

我没有晋升或分配给这个新角色。 取而代之的是,Facebook意识到我们所做的工作已经超越了经典的商业智能。 我们为自己创建的角色完全是一门新学科。

My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.

我的团队处于这种转型的最前沿。 我们正在开发新的技能,新的做事方式,新的工具,而且(通常)使我们背弃传统方法。

We were pioneers. We were data engineers!

我们是先驱。 我们是数据工程师!

数据工程? (Data Engineering?)

Data science as a discipline was going through its adolescence of self-affirming and defining itself. At the same time, data engineering was the slightly younger sibling, but it was going through something similar. The data engineering discipline took cues from its sibling, while also defining itself in opposition, and finding its own identity.

数据科学作为一门学科正经历着自我肯定和自我定义的青春期。 同时, 数据工程是稍年轻的兄弟,但正在经历类似的事情。 数据工程专业从其兄弟姐妹那里获得线索,同时也定义了自己的对立面并找到了自己的身份。

Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

像数据科学家一样,数据工程师也可以编写代码。 他们具有高度的分析能力,并对数据可视化感兴趣。

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

与数据科学家不同(受我们更成熟的父级软件工程的启发),数据工程师构建工具,基础架构,框架和服务。 实际上,可以说数据工程更接近软件工程,而不是数据科学。

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

关于以前存在的角色 ,可以将数据工程领域视为商业智能数据仓库的超集,它从软件工程中带来了更多元素。 该学科还整合了围绕所谓“大数据”分布式系统运行的专业化,以及围绕扩展的Hadoop生态系统,流处理和大规模计算的概念。

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.

在还没有正式成立数据基础架构团队的较小型公司中,数据工程角色也可能涵盖建立和运营组织的数据基础架构的工作量。 这包括诸如设置和操作Hadoop / Hive / HBase,Spark等平台的任务。

In smaller environments people tend to use hosted services offered by Amazon or Databricks, or get support from companies like Cloudera or Hortonworks — which essentially subcontracts the data engineering role to other companies.

在较小的环境中,人们倾向于使用Amazon或Databricks提供的托管服务,或者从Cloudera或Hortonworks之类的公司获得支持,这实际上是将数据工程角色转包给其他公司。

In larger environments, there tends to be specialization and the creation of a formal role to manage this workload, as the need for a data infrastructure team grows. In those organizations, the role of automating some of the data engineering processes falls under the hand of both the data engineering and data infrastructure teams, and it’s common for these teams to collaborate to solve higher level problems.

在更大的环境中,随着对数据基础架构团队的需求的增长,倾向于专门化并创建正式角色来管理此工作负载。 在这些组织中,使某些数据工程流程自动化的角色落在数据工程和数据基础架构团队的手下,这些团队通常共同协作解决更高级别的问题。

While the engineering aspect of the role is growing in scope, other aspects of the original business engineering role are becoming secondary. Areas like crafting and maintaining portfolios of reports and dashboards are not a data engineer’s primary focus.

尽管角色的工程学方面的范围在不断扩大,但原始业务工程学角色的其他方面正变得次要。 制定和维护报告和仪表板的组合等领域并不是数据工程师的主要重点。

We now have better self-service tooling where analysts, data scientist and the general “information worker” is becoming more data-savvy and can take care of data consumption autonomously.

现在,我们有了更好的自助服务工具,分析师,数据科学家和一般的“信息工作者”变得越来越精通数据,可以自主处理数据消耗。

ETL正在变化 (ETL is changing)

We’ve also observed a general shift away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach. Product know-how on platforms like Informatica, IBM Datastage, Cognos, AbInitio or Microsoft SSIS isn’t common amongst modern data engineers, and being replaced by more generic software engineering skills along with understanding of programmatic or configuration driven platforms like Airflow, Oozie, Azkabhan or Luigi. It’s also fairly common for engineers to develop and manage their own job orchestrator/scheduler.

我们还观察到总体上已经从拖放式ETL(提取转换和加载)工具转向了更具编程性的方法。 在Informatica,IBM Datastage,Cognos,AbInitio或Microsoft SSIS等平台上的产品专有技术在现代数据工程师中并不常见,并被更通用的软件工程技能所取代,并且对诸如Airflow,Oozie, Azkabhan或Luigi。 对于工程师来说,开发和管理他们自己的工作编排/调度程序也很普遍。

There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software. While it’s beyond the scope of this article to argue on this topic, it’s easy to infer that these same reasons apply to writing ETL as it applies to any other software. Code allows for arbitrary levels of abstractions, allows for all logical operation in a familiar way, integrates well with source control, is easy to version and to collaborate on. The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own.

没有使用拖放工具开发复杂软件的原因很多,这是因为最终代码是软件的最佳抽象。 尽管讨论该主题超出了本文的讨论范围,但很容易推断出这些相同的原因也适用于编写ETL,因为它适用于任何其他软件。 代码允许任意级别的抽象,允许以熟悉的方式进行所有逻辑操作,与源代码控制很好地集成在一起,易于版本化和协作。 ETL工具演变为公开图形界面的事实在数据处理历史上似乎是绕道而行的,并且肯定会成为一篇有趣的博客文章。

Let’s highlight the fact that the abstractions exposed by traditional ETL tools are off-target. Sure, there’s a need to abstract the complexity of data processing, computation and storage. But I would argue that the solution is not to expose ETL primitives (like source/target, aggregations, filtering) into a drag-and-drop fashion. The abstractions needed are of a higher level.

让我们强调一个事实,即传统ETL工具公开的抽象是偏离目标的。 当然,需要抽象化数据处理,计算和存储的复杂性。 但是我认为解决方案不是以拖放方式公开ETL原语(例如源/目标,聚合,过滤)。 所需的抽象是更高级别的。

For example, an example of a needed abstraction in a modern data environment is the configuration for the experiments in an A/B testing framework: what are all the experiment? what are the related treatments? what percentage of users should be exposed? what are the metrics that each experiment expects to affect? when is the experiment taking effect? In this example, we have a framework that receives precise, high level input, performs complex statistical computation and delivers computed results. We expect that adding an entry for a new experiment will result in extra computation and results being delivered. What is important to note in this example is that the input parameters of this abstraction are not the one offered by a traditional ETL tool, and that a building such an abstraction in a drag and drop interface would not be manageable.

例如,现代数据环境中所需抽象的一个示例是A / B测试框架中的实验配置: 所有的实验是什么? 有什么相关的治疗方法? 应该暴露多少用户? 每个实验预期会影响哪些指标? 实验何时生效? 在此示例中,我们有一个框架,可以接收精确的高级输入,执行复杂的统计计算并提供计算结果。 我们希望为新实验添加一个条目将导致额外的计算和结果交付。 在此示例中要注意的重要一点是,此抽象的输入参数不是传统ETL工具提供的参数,并且在拖放界面中构建这样的抽象将是不可管理的。

To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code. As a result, the abstractions needed cannot be expressed intuitively in those tools. Now knowing that the data engineer’s role consist largely of defining ETL, and knowing that a completely new set of tools and methodology is needed, one can argue that this forces the discipline to rebuild itself from the ground up. New stack, new tools, a new set of constraints, and in many cases, a new generation of individuals.

对于现代数据工程师而言,传统的ETL工具已经过时了,因为无法使用代码来表达逻辑。 结果,在这些工具中无法直观地表达所需的抽象。 现在知道数据工程师的角色主要是定义ETL,并且知道需要一套全新的工具和方法,因此可以说这迫使该学科从头开始重建。 新的堆栈,新的工具,新的约束集,在许多情况下,还包括新一代的个人。

数据建模正在发生变化 (Data modeling is changing)

Typical data modeling techniques — like the star schema — which defined our approach to data modeling for the analytics workloads typically associated with data warehouses, are less relevant than they once were. The traditional best practices of data warehousing are loosing ground on a shifting stack. Storage and compute is cheaper than ever, and with the advent of distributed databases that scale out linearly, the scarcer resource is engineering time.

典型的数据建模技术(例如星型架构 )为我们通常用于数据仓库的分析工作负载定义了数据建模方法,因此它们的关联性比以往任何时候都低。 数据仓库的传统最佳实践在不断变化的堆栈上逐渐消失。 存储和计算比以往任何时候都便宜,并且随着线性扩展的分布式数据库的出现,稀缺的资源是工程时间。

Here are some changes observed in data modeling techniques:

以下是在数据建模技术中观察到的一些变化:

  • further denormalization: maintaining surrogate keys in dimensions can be tricky, and it makes fact tables less readable. The use of natural, human readable keys and dimension attributes in fact tables is becoming more common, reducing the need for costly joins that can be heavy on distributed databases. Also note that support for encoding and compression in serialization formats like Parquet or ORC, or in database engines like Vertica, address most of the performance loss that would normally be associated with denormalization. Those systems have been taught to normalize the data for storage on their own.

    进一步的非规范化:在维中维护代理键可能很棘手,这会使事实表的可读性降低。 在事实表中使用自然的,人类可读的键和维度属性变得越来越普遍,从而减少了对昂贵的联接的需求,这种联接在分布式数据库上可能很繁重。 还要注意,对诸如Parquet或ORC之类的序列化格式或诸如Vertica之类的数据库引擎中的编码和压缩的支持解决了通常与非规范化相关的大多数性能损失。 那些系统已经被教过规范化数据以独立存储。

  • blobs: modern databases have a growing support for blobs through native types and functions. This opens new moves in the data modeler’s playbook, and can allow for fact tables to store multiple grains at once when needed

    blob:现代数据库通过本机类型和功能对blob的支持不断增长。 这在数据建模者的工作簿中打开了新的动作,并可以允许事实表在需要时一次存储多个粒度

  • dynamic schemas: since the advent of map reduce, with the growing popularity of document stores and with support for blobs in databases, it’s becoming easier to evolve database schemas without executing DML. This makes it easier to have an iterative approach to warehousing, and removes the need to get full consensus and buy-in prior to development.

    动态模式 :随着地图的出现减少,文档存储的日益普及以及对数据库中blob的支持,无需执行DML即可发展数据库模式变得越来越容易。 这使得采用迭代方法进行仓储变得更加容易,并且消除了在开发之前就达成完全共识和接受的需求。

  • systematically snapshoting dimensions (storing a full copy of the dimension for each ETL schedule cycle, usually in distinct table partitions) as a generic way to handle slowly changing dimension (SCD) is a simple generic approach that requires little engineering effort, and that unlike the classical approach, is easy to grasp when writing ETL and queries alike. It’s also easy and relatively cheap to denormalize the dimension’s attribute into the fact table to keep track of its value at the moment of the transaction. In retrospect, complex SCD modeling techniques are not intuitive and reduce accessibility.

    系统地对维度进行快照 (通常在不同的表分区中为每个ETL计划周期存储维度的完整副本),作为处理缓慢变化的维度 (SCD)的通用方法,是一种简单的通用方法,不需要花费太多的工程工作,并且与经典方法,在编写ETL和查询时很容易掌握。 将维度的属性归一化到事实表中以在交易时跟踪其值也很容易且相对便宜。 回顾过去,复杂的SCD建模技术并不直观,并且会降低可访问性。

  • conformance, as in conformed dimensions and metrics is still extremely important in modern data environment, but with the need for data warehouses to move fast, and with more team and roles invited to contribute to this effort, it’s less imperative and more of a tradeoff. Consensus and convergence can happen as a background process in the areas where the pain point of divergence become out-of-hand.

    在现代数据环境中, 一致性和度量标准仍然非常重要,但是由于数据仓库需要快速移动,并且需要更多的团队和角色来推动这一工作,因此减少命令的必要性和权衡取舍。 在分歧的痛点失控的地区,共识和融合可能会作为背景过程发生。

Also, more generally, it’s arguable to say that with the commoditization of compute cycles and with more people being data-savvy then before, there’s less need to precompute and store results in the warehouse. For instance you can have complex Spark job that can compute complex analysis on-demand only, and not be scheduled to be part of the warehouse.

而且,更普遍地说,可以说,随着计算周期的商品化以及比以前更多的人对数据的了解,对结果进行预先计算并将其存储在仓库中的需求减少了。 例如,您可以拥有复杂的Spark作业,该作业只能按需计算复杂的分析,而不会被安排为仓库的一部分。

角色与责任 (Roles & responsibilities)

数据仓库 (The data warehouse)

A data warehouse is a copy of transaction data specifically structured for query and analysis. — Ralph Kimball

数据仓库是专门为查询和分析而构造的交易数据的副本。 —拉尔夫·金博尔

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. — Bill Inmon

数据仓库是面向主题的,集成的,时变的和非易失性的数据收集,以支持管理层的决策过程。 —比尔·英蒙(Bill Inmon)

The data warehouse is just as relevant as it ever was, and data engineers are in charge of many aspects of its construction and operation. The data engineer’s focal point is the data warehouse and gravitates around it.

数据仓库与以往一样重要,数据工程师负责其构建和操作的许多方面。 数据工程师的重点是数据仓库并围绕它进行部署。

The modern data warehouse is a more public institution than it was historically, welcoming data scientists, analysts, and software engineers to partake in its construction and operation. Data is simply too centric to the company’s activity to have limitation around what roles can manage its flow. While this allows scaling to match the organization’s data needs, it often results in a much more chaotic, shape-shifting, imperfect piece of infrastructure.

现代数据仓库是一个比以往更加公开的机构,它欢迎数据科学家,分析师和软件工程师参与其构建和运营。 数据对于公司的活动而言过于过于中心化,以致于无法限制哪些角色可以管理其流程。 尽管这允许扩展以适应组织的数据需求,但通常会导致基础结构混乱得多,形状变化,不完善。

The data engineering team will often own pockets of certified, high quality areas in the data warehouse. At Airbnb for instance, there’s a set of “core” schemas that are managed by the data engineering team, where service level agreement (SLAs) are clearly defined and measured, naming conventions are strictly followed, business metadata and documentation is of the highest quality, and the related pipeline code follows a set of well defined best practices.

数据工程团队通常会在数据仓库中拥有经过认证的高质量区域。 例如,在Airbnb,有一组由数据工程团队管理的“核心”模式,其中明确定义和衡量了服务水平协议(SLA),严格遵循命名约定,业务元数据和文档的质量最高,并且相关的管道代码遵循一组定义明确的最佳做法。

It also becomes the role of the data engineering team to be a “center of excellence” through the definitions of standards, best practices and certification processes for data objects. The team can evolve to partake or lead an education program sharing its core competencies to help other teams become better citizens of the data warehouse. For instance, Facebook has a “data camp” education program and Airbnb is developing a similar “Data University” program where data engineers lead session that teach people how to be proficient with data.

通过定义数据对象的标准,最佳实践和认证过程,它也成为数据工程团队的角色,成为“卓越中心”。 团队可以发展以参与或领导一个共享其核心能力的教育计划,以帮助其他团队成为数据仓库的更好的公民。 例如,Facebook有一个“数据阵营”教育计划,而Airbnb正在开发一个类似的“数据大学”计划,数据工程师主持该课程,教人们如何精通数据。

Data engineers are also the “librarians” of the data warehouse, cataloging and organizing metadata, defining the processes by which one files or extract data from the warehouse. In a fast growing, rapidly evolving, slightly chaotic data ecosystem, metadata management and tooling become a vital component of a modern data platform.

数据工程师还是数据仓库的“馆员”,对元数据进行分类和组织,定义一个文件或从仓库提取数据的过程。 在一个快速增长,Swift发展,混乱的数据生态系统中,元数据管理和工具已成为现代数据平台的重要组成部分。

性能调整和优化 (Performance tuning and optimization)

With data becoming more strategic than ever, companies are growing impressive budgets for their data infrastructure. This makes it increasingly rational for data engineers to spend cycles on performance tuning and optimization of data processing and storage. Since the budgets are rarely shrinking in this area, optimization is often coming from the perspective of achieving more with the same amount of resources or trying to linearize exponential growth in resource utilization and costs.

随着数据变得比以往任何时候都更具战略意义,公司正在为其数据基础架构增加令人印象深刻的预算。 这使得数据工程师越来越花时间在性能调整以及数据处理和存储的优化上变得越来越合理。 由于该领域的预算很少收缩,因此优化通常是从使用相同数量的资源获得更多收益或试图线性化资源利用和成本的指数增长的角度出发的。

Knowing that the complexity of the data engineering stack is exploding we can assume that the complexity of optimizing such stack and processes can be just as challenging. Where it can be easy to get huge wins with little effort, diminishing returns laws typically apply.

知道数据工程堆栈的复杂性正在爆炸,我们可以假设优化此类堆栈和流程的复杂性同样具有挑战性。 在不费吹灰之力即可轻松获得巨大胜利的地方,通常适用递减收益法。

It’s definitely in the interest of the data engineer to build [on] infrastructure that scales with the company, and to be resource conscious at all times.

建立可随公司扩展的基础架构并始终保持资源意识,绝对符合数据工程师的利益。

资料整合 (Data Integration)

Data integration, the practice behind integrating businesses and systemsthrough the exchange of data, is as important and as challenging as its everbeen. As Software as a Service (SaaS) becomes the new standard way for companies to operate, the need to synchronize referential data across these systems becomes increasingly critical. Not only SaaS needs up-to-date data to function, we often want to bring the data generated on their side into our data warehouse so that it can be analyzed along the rest of our data. Sure SaaS often have their own analytics offering, but are systematically lacking the perspective that the rest of you company’s data offer, so more often than not it’s necessary to pull some of this data back.

数据集成是通过数据交换集成业务和系统背后的实践,它具有前所未有的重要性和挑战性。 随着软件即服务(SaaS)成为公司运营的新标准方式,跨这些系统同步参考数据的需求变得越来越重要。 SaaS不仅需要最新的数据来运行,我们通常还希望将在其端生成的数据带入我们的数据仓库,以便可以对其余数据进行分析。 当然,SaaS通常具有自己的分析产品,但系统地缺乏公司其余数据提供的前景,因此,经常有必要拉回其中一些数据。

Letting these SaaS offering redefine referential data without integrating and sharing a common primary key is a disaster that should be avoided at all costs. No one wants to manually maintain two employee or customer lists in 2 different systems, and worst: having to do fuzzy matching when bringing their HR data back into their warehouse.

让这些SaaS提供重新定义参考数据而又不集成和共享公用主键的灾难是应不惜一切代价避免的灾难。 没有人愿意在2个不同的系统中手动维护两个员工或客户列表,最糟糕的是:在将其HR数据返回到仓库时必须进行模糊匹配。

Worst, company executive often sign deal with SaaS providers without reallyconsidering the data integration challenges. The integration workload is systematically downplayed by vendors to facilitate their sales, and leaves data engineers stuck doing unaccounted, under appreciated work to do. Let alone the fact that typical SaaS APIs are often poorly designed, unclearly documented and “agile”: meaning that you can expect them to change without notice.

最糟糕的是,公司高管经常与SaaS提供商签订协议,而没有真正考虑数据集成方面的挑战。 供应商系统地淡化了集成工作负载,以促进其销售,而数据工程师则无法按需完成的工作而无法做。 更不用说典型的SaaS API往往设计不良,文档记录不清晰且“敏捷”的事实:这意味着您可以期望它们在不另行通知的情况下进行更改。

服务 (Services)

Data engineers are operating at a higher level of abstraction and in some cases that means providing services and tooling to automate the type of work that data engineers, data scientists or analysts may do manually.

数据工程师的抽象水平更高,在某些情况下,这意味着提供服务和工具以自动化数据工程师,数据科学家或分析师可能手动进行的工作类型。

Here are a few examples of services that data engineers and data infrastructure engineer may build and operate.

以下是数据工程师和数据基础架构工程师可以构建和运行的一些服务示例。

  • data ingestion: services and tooling around “scraping” databases, loading logs, fetching data from external stores or APIs, …

    数据摄取:围绕“抓取”数据库,加载日志,从外部存储或API获取数据的服务和工具,…
  • metric computation: frameworks to compute and summarize engagement, growth or segmentation related metrics

    指标计算:用于计算和汇总参与度,增长或细分相关指标的框架
  • anomaly detection: automating data consumption to alert people anomalous events occur or when trends are changing significantly

    异常检测:自动使用数据以警告人们发生异常事件或趋势发生显着变化
  • metadata management: tooling around allowing generation and consumption of metadata, making it easy to find information in and around the data warehouse

    元数据管理:围绕允许生成和使用元数据的工具,可轻松在数据仓库中及其周围查找信息
  • experimentation: A/B testing and experimentation frameworks is often a critical piece of company’s analytics with a significant data engineering component to it

    实验:A / B测试和实验框架通常是公司分析的关键部分,其中包含重要的数据工程组件
  • instrumentation: analytics starts with logging events and attributes related to those events, data engineers have vested interests in making sure that high quality data is captured upstream

    工具:分析从记录事件和与这些事件相关的属性开始,数据工程师对确保在上游捕获高质量数据具有既得利益
  • sessionization: pipelines that are specialized in understand series of actions in time, allowing analysts to understand user behaviors

    会话化:专门用于及时了解一系列操作的管道,使分析人员可以了解用户行为

Just like software engineers, data engineers should be constantly looking to automate their workloads and building abstraction that allow them to climb the complexity ladder. While the nature of the workflows that can be automated differs depending on the environment, the need to automate them is common across the board.

就像软件工程师一样,数据工程师应该不断寻求自动化工作负载和构建抽象方法,以使他们攀登复杂性阶梯。 尽管可以自动化的工作流的性质根据环境而有所不同,但是对它们进行自动化的需求普遍存在。

必备技能 (Required Skills)

SQL mastery: if english is the language of business, SQL is the language of data. How successful of a business man can you be if you don’t speak good english? While generations of technologies age and fade, SQL is still standing strong as the lingua franca of data. A data engineer should be able to express any degree of complexity in SQL using techniques like “correlated subqueries” and window functions. SQL/DML/DDL primitives are simple enough that it should hold no secrets to a data engineer. Beyond the declarative nature of SQL, she/he should be able to read and understand database execution plans, and have an understanding of what all the steps are, how indices work, the different join algorithm and the distributed dimension within the plan.

SQL精通:如果英语是商务语言,则SQL是数据语言。 如果你不会说一口流利的英语,你会成为一个成功的商人吗? 尽管几代技术日渐衰落,但SQL作为通用的数据仍然保持着强大的地位。 数据工程师应该能够使用“相关子查询”和窗口函数之类的技术来表达SQL中的任何程度的复杂性。 SQL / DML / DDL原语非常简单,因此对数据工程师不应该保密。 除了SQL的声明性性质外,她/她还应该能够阅读和理解数据库执行计划,并了解所有步骤是什么,索引如何工作,计划中不同的联接算法和分布式维度。

Data modeling techniques: for a data engineer, entity-relationship modeling should be a cognitive reflex, along with a clear understanding of normalization, and have a sharp intuition around denormalization tradeoffs. The data engineer should be familiar with dimensional modeling and the related concepts and lexical field.

数据建模技术:对于数据工程师而言,实体关系建模应是一种认知反射,同时应具有对归一化的清晰理解,并对非归一化折衷方案具有敏锐的直觉。 数据工程师应该熟悉维建模以及相关的概念和词汇领域。

ETL design: writing efficient, resilient and “evolvable” ETL is key. I’m planning on expanding on this topic on an upcoming blog post.

ETL设计:编写高效,有弹性和“可演化”的ETL是关键。 我打算在即将发布的博客文章中扩展有关此主题的信息。

Architectural projections: like any professional in any given field of expertise, the data engineer needs to have a high level understanding of most of the tools, platforms, libraries and other resources at its disposal. The properties, use-cases and subtleties behind the different flavors of databases, computation engines, stream processors, message queues, workflow orchestrators, serialization formats and other related technologies. When designing solutions, she/he should be able to make good choices as to which technologies to use and have a vision as to how to make them work together.

体系结构预测:像任何给定专业领域的任何专业人士一样,数据工程师需要对可使用的大多数工具,平台,库和其他资源有较高的了解。 数据库,计算引擎,流处理器,消息队列,工作流编排器,序列化格式和其他相关技术的不同风味之后的属性,用例和细微之处。 在设计解决方案时,她/她应该能够对要使用的技术做出很好的选择,并对如何使它们协同工作具有远见。

总而言之 (All in all)

Over the past 5 years working in Silicon Valley at Airbnb, Facebook and Yahoo!, and having interacted profusely with data teams of all kinds working for companies like Google, Netflix, Amazon, Uber, Lyft and dozens of companies of all sizes, I’m observing a growing consensus on what “data engineering” is evolving into, and felt a need to share some of my findings.

在过去的5年中,我在Airbnb,Facebook和Yahoo!的硅谷工作,并且与各种数据团队进行了广泛的互动,这些数据团队为Google,Netflix,Amazon,Uber,Lyft和数十家各种规模的公司服务,我对“数据工程”正在发展成什么越来越共识,并感到有必要分享我的一些发现。

I’m hoping that this article can serve as some sort of manifesto for data engineering, and I’m hoping to spark reactions from the community operating in the related fields!

我希望这篇文章可以作为数据工程的宣言,也希望引起相关领域的社区的反响!

翻译自: https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/

算法工程师 数据挖掘工程师

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值