mysql elt_引入可观察的自记录ELT

最新推荐文章于 2024-01-13 06:03:38 发布

weixin_26737625

最新推荐文章于 2024-01-13 06:03:38 发布

阅读量353

点赞数

文章标签： java

原文链接：https://medium.com/swlh/introducing-observable-self-documenting-elt-41aa8b124098

版权

mysql elt

Virevol ai is a collaborative visual shopping app. We run a remote-first lean team, have been for 2+ years. To do our magic, we swim with a lot of data.

Virevol ai是一款协作式视觉购物应用程序。我们拥有一支偏远的精益团队，已有2年以上的经验。为了做魔术，我们会游泳很多数据。

We get millions of rows from 100s of sources every day, run our models on them and serve them in a variety of ways. We are on the Google stack and have GCP cloud functions, services, containers, you name it. They trigger other events on file drops, queue messages, hooks, schedules. Everything is event-based, and just provisioned in time (We are stingy).

我们每天从数百个来源中获取数百万行，在其上运行我们的模型并以多种方式为它们提供服务。您可以在Google堆栈中找到GCP云功能，服务，容器。它们会在文件丢弃，队列消息，挂钩，计划中触发其他事件。一切都是基于事件的，只是按时提供(我们很小气)。

There are too many data paths to keep in our heads.

太多的数据路径无法保留。

With observable ELT, we can trace the provenance of data down to each row, with matching logs, inputs, versions of every job that touched it. We can do ELT accounting in dollars, a pipe dream for most CTOs. This enables us to rework parts of the pipe that are cost us more, not all legacy code.

借助可观察到的ELT，我们可以追溯到每一行的数据源，以及与之相匹配的日志，输入和每项涉及到的作业的版本。我们可以用美元进行ELT会计，这对于大多数CTO来说都是梦dream以求的事情。这使我们能够返工管道中花费更多的部分，而不是全部的旧代码。

Here is how we keep sane? And how do we ship fast while keeping data quality?

这是我们保持理智的方法？以及如何在保持数据质量的同时快速交付？

How data moves

数据如何移动

ETL is a series of batch jobs. When one finishes, the next one is triggered. Sometimes two, sometimes they are Y-joined back to one.

ETL是一系列批处理作业。当一个完成时，将触发下一个。有时是两个，有时又是Y型联接回到一个。

You get a directed graph. Graphs are easy™.

您将得到一个有向图。图形很容易™。

Graphs go from top to bottom (or left to right if you like).

图表从上到下(如果需要，也可以从左到右)。

The complexity arises from handling when, some step in the middle, breaks, or takes too long. Think Rube Goldberg machine.

复杂性是由于在中间某个步骤中断或花费太长时间时进行处理而引起的。想想Rube Goldberg机器。

Or worse, if nothing breaks. And you find the error days down the road while making sense of data.

更糟糕的是，如果一切都没有中断。这样一来，您就可以在理解数据的同时找到出错的日子。

To manage this complex pipeline coordination, people use all sort of great tools — Luigi, dbt, Airflow, etc.

为了管理这种复杂的管道协调，人们使用各种出色的工具-Luigi，dbt，Airflow等。

It’s a very top-down way of instrumenting.

这是一种自上而下的检测方式。

The end-user cares only about the last step. If the report needs to go out at 9am and the data isn’t ready, you’ll get a call. That’s your SLA.

最终用户只关心最后一步。如果报告需要在上午9点出炉并且数据还没有准备好，您会接到电话。那就是您的SLA。

If a service needs refreshed recommendations for a user every 10 minutes, that’s your SLA.

如果服务每10分钟需要刷新一次用户推荐，则这就是您的SLA。

The end-user sadly is not impressed by DAGs or fancy tools.

令人遗憾的是，最终用户对DAG或精美的工具没有印象。

When things break, a tearful junior developer tries to debug ‘why’ at 3am. She joined The Company to get her hands dirty with neural nets. But was baited and switched to doing ETL. Worse, it is called ‘Data Janitorial Work’ by some, not ‘Data Lorax’ as it should.

当事情破裂时，一个泪流满面的初级开发人员试图在凌晨3点调试“为什么”。她加入了Company，以使自己的双手被神经网络弄脏。但是被诱饵转而使用ETL。更糟糕的是，它被某些人称为“ Data Janitorial Work”，而不是应有的“ Data Lorax” 。

She toils through some documentation of what should have happened. If the person who wrote that still works there, a bit of Slack. Then going through the logs to make sense of things. If she is really lucky, there might even be a mostly up to date runbook for this job. She hops from Pagerduty to JIRA, to wiki, to github, to logs.

她辛辛苦苦地记录了应该发生的事情。如果写那个的人仍然在那里工作，那就有点懈怠。然后遍历日志以使事情变得有意义。如果她真的很幸运，甚至可能有一份最新的工作手册。她希望从Pagerduty跳转到JIRA，再到Wiki，再到github，再到日志。

Birds sing in a distance to welcome the break of dawn.

鸟儿在远处唱歌，欢迎破晓。

Observable ELT

可观察的ELT

We introduce a new concept called a Global Trace Identifier, or tid for short.

我们引入了一个称为“ 全局跟踪标识符”的新概念，简称tid。

Every function, batch job or service that will transform data needs to get a tid for that specific run.

每个将转换数据的功能，批处理作业或服务都需要获得该特定运行的提示。

It provides all old tids, its version, and instance variables (e.g. X-AppEngine-QueueName, X-Appengine-Taskname). These are logged in BigQuery in service_tracker table.

它提供了所有旧提示，其版本和实例变量(例如X-AppEngine-QueueName，X-Appengine-Taskname)。这些记录在service_tracker表的BigQuery中。

Here is what a call looks like in python

这是python中的呼叫情况

from requests import post
import json


GET_TID_URI='https://XXXX/get_service_trace_id'
TID_SERVICE_NAME='my_service'
TID_SERVICE_VERSION='1.0' # Automate this in CI/CD and read it in
TID_SERVICE_INSTANCE='default' # Get from Google environment variables
# For example
# TID_SERVICE_INSTANCE=f'{request.headers["X-Appengine-Queuename"]}-{request.headers["X-Appengine-Taskname"]}'


headers = {'content-type': 'application/json'}


data = {
    'service': TID_SERVICE_NAME,
    'instance': TID_SERVICE_INSTANCE,
    'version': TID_SERVICE_VERSION,
    'metadata': json.dumps({'prev_tids': prev_tids}) # A list of previous tids
}


r = post(GET_TID_URI, data=json.dumps(data), headers=headers)


assert(r.status_code == 200)
tid = r.json().get('tid')

It has to log that run with this tid.

它必须记录以此tid运行的信息。

When done, it has to append that row’s last_tid column to this). If it is a new row, it has to also populate tid column to that row. Scroll down to “Slowly Changing Dimensions” section for optimizations.

完成后，必须将该行的last_tid列附加到此)。如果是新行，则还必须将tid列填充到该行。向下滚动至“缓慢更改尺寸”部分以进行优化。

Here is the deployed Google function that does all we need for this.

这是已部署的Google功能，可满足我们的所有需求。

# Deployed function
def get_service_trace_id(request):
	if (request.method != 'POST'):
		abort(403)


	if (request.headers['content-type'] != 'application/json'):
		raise NameError("ERROR: JSON content-type expected")
	request_json = request.get_json(silent=True)
	if not request_json or "service" not in request_json:
		raise NameError("Please provide parameter 'service' name, 'instance' name optional")


	service = request_json["service"]
	instance = request_json["instance"] if "instance" in request_json else "default"
	version = request_json["version"] if "version" in request_json else None
	metadata = request_json["metadata"] if "metadata" in request_json else None


	return service_trace_id(service, instance, version, metadata)


def service_trace_id(service, instance='default', version=None, metadata=None):
	global service_tracker
	global bq_cl
	if not service_tracker or not bq_cl:
		__connect_service_tracker()


	tid = elt.get_tid(service, instance)
	rows = [
		{
			'service': service.upper(),
			'instance': instance,
			'tid': tid,
			'version': version,
			'metadata': metadata,
			'added_ts': datetime.now()
		}
	]


	errors = bq_cl.insert_rows(service_tracker, rows)


	if errors:
		print('Errors:')
		for error in errors:
			print(error)
	return json.dumps({'tid': tid})

How does it work?

它是如何工作的？

Let’s draw a graph outlining these cases:

让我们画一个图来概述这些情况：

Here A, B, C,…F, are all different processes. Each takes a thing, wakes up, sends a thing and shuts down. But we do not need to have a documented graph like this one anywhere.

这里的A，B，C，…F都是不同的过程。每个拿东西，醒来，发送东西，然后关闭。但是我们不需要在任何地方都有这样的文档化图。

So how do we generate it?

那么我们如何生成它呢？

From the perspective of table row that got inserted at the end of D, we have tid of A, and last_tid of D that traces to [B], and B traces to [A].

从在D的末尾插入的表行的角度来看，我们有A的tid和D的last_tid追溯到[B]，B追溯到[A]。

So we know, A -> B -> D -> this row. We can trace every row here to the run, and code version that generated it.

所以我们知道，A-> B-> D->此行。我们可以在此处跟踪运行的每一行以及生成该行的代码版本。

Let’s do a more complex case.

让我们做一个更复杂的案例。

From the perspective of the table row that got inserted at the end of F, we have the tid of A, and last_tid F, that traces to tids for [E, D]. We can trace each one of the tids mentioned in the list to reverse the graph.

从在F的末尾插入的表行的角度来看，我们有A的tid和last_tid F，它们追溯到[E，D]的tid。我们可以跟踪列表中提到的每个提示，以反转图表。

When something breaks (F), we won’t get a row for it in the data warehouse. Our fearless developer can look at the previous run that references F as a prev_tid (they are timestamped). She can then debug the log of F’s run directly and will find that rows affected by that broke in E.

当某件事中断(F)时，我们将不会在数据仓库中为它行。我们的无所畏惧的开发人员可以查看先前将F引用为prev_tid的运行(带有时间戳)。然后，她可以直接调试F的运行日志，并发现E中受该中断影响的行。

She can do all this without asking anyone else. You still need runbooks for service F, right in github.

她可以做到这一切而无需问其他任何人。在github中，您仍然需要服务F的运行手册。

She goes from pagerduty, to Google logging and github for code and issues. If she finds an issue with the service, she can update and push code for it right from there, which triggers CI/CD tests.

她从pagerduty转到Google日志记录和github，获取代码和问题。如果她发现该服务有问题，可以立即从那里更新并推送该代码，从而触发CI / CD测试。

She can document right in code and point to the tid that helped her find that issue. The code is the best ‘current’ documentation. The table is a time series of pointers to it.

她可以直接用代码记录文档，并指出帮助她找到问题的提示。该代码是最好的“当前”文档。该表是指向它的时间序列。

We can do more than that, we can audit how long each table takes to produce and which jobs along the chain cause it. We can figure out when the chains changed (ML job C got inserted in the mix).

我们可以做更多的事情，我们可以审核每个表需要花费多长时间，以及链中哪些工作导致了它。我们可以找出链何时更改(混合中插入了ML作业C)。

We can figure out which legacy code needs rework by looking at the cost of each run for each service. This enables a new way to optimize. We optimize not by just ‘feeling’ which code irks us the most, but by an objective function — cost.

通过查看每个服务的每次运行成本，我们可以找出需要重做的遗留代码。这为优化提供了新的途径。我们不仅通过“感觉”最讨厌我们的代码来优化，还通过目标函数(成本)进行优化。

Think of it as profiling a program, except you see it in terms of functions and dollars, not subroutine timings.

可以将其视为对程序进行概要分析，除非您从功能和美元的角度看待它，而不是从子例程的时间来看。

Let’s talk about small but important details to get better results.

让我们讨论一些小的但重要的细节以获得更好的结果。

ELT not ETL

ELT不是ETL

Since your data is in the cloud, moving data is the largest share of your bill. Do ELT instead of ETL wherever possible. The data will be transformed right on the server and saved back, minimal transfers needed.

由于您的数据在云中，因此移动数据是您账单中最大的份额。尽可能使用ELT代替ETL。数据将直接在服务器上转换并保存回去，所需的传输极少。

And it will be faster for another reason. BigQuery and most modern data warehouses are doing map-reduce for you on their large cluster of servers. You would have to spin up a k8s instance to get the same level of performance on a large load.

由于另一个原因，它将更快。 BigQuery和大多数现代数据仓库都在大型服务器集群上为您执行map-reduce。您必须启动k8s实例才能在大负载下获得相同级别的性能。

To get the most bang for your buck, use database specific SQL, no abstractions to transform data. Writing SQL is the right tool for the job in this case. Get over it.

为了最大程度地发挥作用，请使用特定于数据库SQL，而无需抽象即可转换数据。在这种情况下，编写SQL是完成任务的正确工具。克服它。

How complex can SQL get? As an exercise, here is tf-idf being calculated in SQL, just for fun (In practice you’d use something like spaCy because you probably need more steps). You might get styling tips out of the sample too that make SQL enjoyable.

SQL有多复杂？作为练习，这里是用SQL计算tf-idf的目的，只是为了好玩(实际上，您可能会使用spaCy之类的东西，因为您可能需要更多的步骤)。您可能还会从示例中得到一些使SQL变得令人愉快的样式提示。

For observable ELT, use a Google function that fetches the tid and runs the parameterized SQL from a dictionary.

对于可观察到的ELT，请使用Google函数来获取tid并从字典中运行参数化SQL。

Something like this,

像这样

def get_elt_queries():
	"""Gets parameterised queries for ELT"""
	return {
		# Parameters: service
		'input_data': """
			select
				data,
				added_ts
			from
				XXX.TABLE
			where
				service = @service
			order by
				added_ts desc
		""",
		# Parameters: tid
		'output_merge': """
			merge into target t
			using (
				select *, 
				@tid as tid,
				timestamp_utc as last_updated_ts, 
				tid as last_tid 
				from input_table
			) s
			on t.some_id = s.some_id
			when matched then
			update set
				t.last_updated_ts = s.last_updated_ts,
				t.last_tid = s.last_tid
			when not matched then
				insert row;
		"""
	}


# Deploy a wrapper around this function to wake up on file drop of some file
def sample_transform_table():
	input_data = get_input_df('input_data', 'some_service') # Returns pandas dataframe
	r = service_trace_id('sample_transform_table', ...., input_data['tid'].unique().tolist())
	tid = json.loads(r).get('tid')
	# When writing data back,
	# For new rows, insert this new tid into into tid column.
	# For updating rows, insert this new tid into prev_tid column.
  
	# Check out merge example to do all in one step (when possible)

If you can do it all in SQL, then use the merge statement directly to be faster (see sample).

如果可以在SQL中完成所有操作，则直接使用merge语句可以更快(请参见示例)。

Oh, and make sure you have test cases for all this fancy SQL you now have.

哦，请确保您已经拥有了所有现在花哨SQL的测试用例。

Slowly Changing Dimensions

尺寸变化缓慢

We use three other niceties that are make life beautiful.

我们使用了其他三个美化生活的美好。

One is to wrap every table behind a view. So a table named ‘students’ becomes ‘v_students’. This enables easy refactor. It’s something I came up at Rent The Runway and kept it.

一种是将每个表包装在视图后面。因此，名为“ students”的表将变为“ v_students”。这使得易于重构。这是我在Rent The Runway上保留的东西。

The second is to use a version of SCD type 2.

第二种是使用SCD类型2的版本。

For most important tables, every row is only ever appended, and partitioned by time. The view on top then gives us the latest version of that table using a rank() over added_ts.

对于最重要的表，每行仅被附加，并按时间进行分区。然后，最上面的视图使用added_ts上的rank()为我们提供了该表的最新版本。

This might seem wasteful, but it keeps us sane, and cost of storage isn’t as high as one might think. We can always drop old partitions when it gets too large.

这可能看起来很浪费，但它使我们保持理智，并且存储成本并不像人们想象的那样高。当旧分区太大时，我们总是可以删除它。

Lastly, every job is restartable. This is usually a standard for most mature data teams, so I won’t dwell on it.

最后，每个作业都可以重新启动。对于大多数成熟的数据团队来说，这通常是一个标准，因此我不再赘述。

With observable ELT, we had to invent a couple of new tools, but we do not need any other external tool for coordination. It coexists with Google Cloud Task Queues, Pub/Sub, and Apache Beam pipes.

使用可观察的ELT，我们不得不发明了几个新工具，但是我们不需要任何其他外部工具来进行协调。它与Google Cloud Task Queue，Pub / Sub和Apache Beam管道共存。

So far we are happy. We will let you know if we still are once we have a slightly larger team servicing a Billion customers.

到目前为止，我们很高兴。如果我们有一个稍大一些的团队为10亿客户提供服务，我们会告知您是否仍然存在。

These ideas can be extended to logging as well. We recently came across this thread on logging and are excited to try it out.

这些想法也可以扩展到日志记录。我们最近在记录日志时遇到了这个线程，很高兴尝试一下。

演示地址

We would love to hear from you on if this gave you any ideas about your own ETL flow. This decentralized approach should help you ship faster.

如果您对自己的ETL流程有任何想法，我们很乐意听取您的意见。这种分散的方法应该可以帮助您更快地发货。

If you like to work on challenging problems, and are an unconventional thinker, please reach out.

如果你喜欢的工作具有挑战性的问题，是一个非常规的思想家，请伸手。