bigquery使用散列数据填充开发环境

最新推荐文章于 2024-07-21 23:25:11 发布

weixin_26745985

最新推荐文章于 2024-07-21 23:25:11 发布

阅读量172

点赞数

文章标签： python java 人工智能大数据 linux

原文链接：https://medium.com/@darraghcopley/bigquery-populating-a-development-environment-with-hashed-data-af8900a241c5

版权

介绍 (Introduction)

A common dilemma I encounter when working on data engineering projects is coming up with a means to adequately test data processes in a development environment. This typically requires there to be some form of test data in a development environment. However, a development environment tends to be less secure than a production system. As such, many do not wish to store real data in their development environment. A common approach I see employed is to hash data from a production system and use it to populate the development environment.

在进行数据工程项目时遇到的一个常见难题是提出一种在开发环境中充分测试数据过程的方法。这通常要求开发环境中存在某种形式的测试数据。但是，开发环境往往不如生产系统安全。因此，许多人不希望在其开发环境中存储真实数据。我看到的一种常用方法是散列生产系统中的数据，并使用其填充开发环境。

While this approach makes sense, I’ve found implementing it often gets very convoluted and difficult to maintain. The aim of this post is to provide a little context into the problem and also a method for leveraging BigQuery’s built-in hashing functions to tackle this problem in an easier way.

尽管这种方法很有意义，但我发现实施它通常会非常复杂并且难以维护。这篇文章的目的是为问题提供一些背景信息，以及一种利用BigQuery的内置哈希函数以更简单的方式解决此问题的方法。

Please note that in the examples used, there isn’t any sensitive data that needs to be hashed. These are just used to demonstrate the functionality. In a real-world scenario, there will probably be certain columns in your data that don’t need to be hashed as they don’t contain any sensitive information.

请注意，在使用的示例中，没有任何敏感数据需要进行哈希处理。这些仅用于演示功能。在实际情况下，您的数据中可能会有某些列不需要进行哈希处理，因为它们不包含任何敏感信息。

问题 (Problem)

As a data platform evolves, it becomes more and more necessary to come up with a solution for testing your data processes. Consider the following scenario where BigQuery is being used as the data warehousing solution for a company. Over time, more and more processes have become dependent on the data warehouse.

随着数据平台的发展，提出用于测试数据过程的解决方案变得越来越必要。考虑以下情形，其中BigQuery被用作公司的数据仓库解决方案。随着时间的流逝，越来越多的流程变得越来越依赖于数据仓库。

When seeing all the different users/tools that feed into our data warehouse, it’s pretty clear that having a development environment where we could test out changes in these processes would be very useful. In addition, there may also be processes happening within BigQuery that we would like to test eg. data modelling. In order to test out these data processes before deploying to production, having some test data to work with in our development environment would be very beneficial.

当看到送入数据仓库的所有不同用户/工具时，很显然，拥有一个可以测试这些流程中的更改的开发环境将非常有用。此外，BigQuery中可能还会发生一些我们要测试的过程。数据建模。为了在部署到生产之前测试这些数据过程，在我们的开发环境中使用一些测试数据将非常有益。

The benefits of having test data to work with include:

使用测试数据的好处包括：

Ability to test changes to data pipelines
能够测试对数据管道的更改
Performing dry runs of queries and views before deploying to production
部署到生产之前执行查询和视图的空运行
Testing data modelling
测试数据建模

Incorporating these sort of processes into our workflow will help prevent us making mistakes when deploying to production and make the data warehouse more stable for everyone who uses it.

将此类流程整合到我们的工作流程中将有助于防止我们在部署到生产环境时犯错误，并使使用它的每个人的数据仓库更加稳定。

要求 (Requirements)

As touched upon in the intro, there are a couple of requirements we want to meet when tackling this problem:

正如介绍中所提到的，解决此问题时我们需要满足几个要求：

We need to have data of similar characteristics to our production data in our development environment — Size, structure, table names etc.
我们需要在开发环境中拥有与生产数据类似的特征数据-大小，结构，表名称等。
The data should be hashed. Since this is a development environment, we do not want to populate it with real data.
数据应进行哈希处理。由于这是一个开发环境，因此我们不想用实际数据填充它。
We would like to maintain typing eg. a FLOAT data type in production, should ideally be a FLOAT in the development environment
我们想保持输入例如。生产中的FLOAT数据类型，理想情况下应为开发环境中的FLOAT

使用BigQuery的哈希函数 (Using BigQuery’s Hash Functions)

I’ve used the BigQuery Python client to demonstrate an example of how BigQuery’s built-in hashing functions can be leveraged to meet the above requirements. Essentially, the approach is the construct SQL query statements that will hash our data and then output it to destination tables in our development environment. The following are the hash functions supported in BigQuery:

我已使用BigQuery Python客户端演示了如何利用BigQuery的内置哈希函数满足上述要求的示例。本质上，该方法是构造SQL查询语句，该语句将对我们的数据进行哈希处理，然后将其输出到开发环境中的目标表中。以下是BigQuery支持的哈希函数：

FARM_FINGERPRINT
FARM_FINGERPRINT
MD5
MD5
SHA1
SHA1
SHA256
SHA256
SHA512
SHA512

Below is a table schema from a public BigQuery dataset bigquery-public-data.london_bicycles.cycle_stations. This table contains several different data types.

以下是来自公共BigQuery数据集bigquery-public-data.london_bicycles.cycle_stations的表架构。该表包含几种不同的数据类型。

A combination of BigQuery functions can be used to generate a SQL statement to hash each field that returns a hashed value of the same type as the input. If we were to hash the INTEGER field docks_count, this would look as follows:

BigQuery函数的组合可用于生成SQL语句，以对每个字段进行哈希处理，该字段返回与输入相同类型的哈希值。如果我们要对INTEGER字段docks_count进行哈希docks_count ，则将如下所示：

FARM_FINGERPRINT(CAST(`docks_count` as STRING)) as `docks_count`

There are a couple of things happening here:

这里发生了几件事：

Firstly, the docks_count field gets cast to a STRING type as this is what the FARM_FINGERPRINT hash function accepts. Most data types in BigQuery can be cast to a STRING
首先，将docks_count字段docks_count转换为STRING类型，因为这是FARM_FINGERPRINT哈希函数接受的类型。 BigQuery中的大多数数据类型都可以强制转换为STRING
The FARM_FINGERPRINT function is then called which returns a hashed INTEGER value. Since this is the desired data type we want to be returned, nothing else needs to be done here.
然后调用FARM_FINGERPRINT函数，该函数返回哈希的INTEGER值。由于这是我们要返回的所需数据类型，因此此处无需执行其他任何操作。

The output for this field looks as follows:

该字段的输出如下所示：

A similar process is carried out for each field and the results are then outputted to a target environment. For instance, the hashing SQL statement for a FLOAT field could look like:

对每个字段执行类似的过程，然后将结果输出到目标环境。例如，FLOAT字段的哈希SQL语句可能类似于：

CAST(FARM_FINGERPRINT(CAST(`{col}` as STRING)) as FLOAT64) as `{col}`

Depending on the data type in question, you can use different combinations of functions to achieve the desired output. Some more examples are here. Executing SQL statements like this for each field in a table results in:

根据所讨论的数据类型，您可以使用不同的功能组合来获得所需的输出。这里还有更多示例。对表中的每个字段执行这样SQL语句会导致：

A table of hashed data in a development environment which has a schema that is the same as that in production
开发环境中的哈希数据表，该表的模式与生产环境中的模式相同
When hashing multiple tables, joins will still be possible across them. The hashing functions will always produce the same output value based on input. I’ve found this to be very useful for testing out views and queries in a development environment. Since the JOIN statements will still work, it is possible to execute a dry run of these queries before deploying to production.
当哈希多个表时，仍然可以在它们之间进行联接。散列函数将始终根据输入产生相同的输出值。我发现这对于在开发环境中测试视图和查询非常有用。由于JOIN语句仍将起作用，因此可以在部署到生产之前对这些查询执行空运行。

其他注意事项 (Other considerations)

性能 (Performance)

I bench-marked running the script against several tables which had the following characteristics:

我将运行脚本标记为针对具有以下特征的几个表：

Over 300 million rows
超过3亿行
Over 130 GB in size (in BigQuery storage).
大小超过130 GB(在BigQuery存储中)。
Over 30 columns
超过30列

For these tables the hashing query jobs executed successfully and populated the target development table in around 2 minutes. I found the speed of this approach to be very useful. If new data is ingested into the production system, this process can quickly be executed to populate equivalent tables in the development environment to enable testing. Additionally, you may not need to hash the entirety of these tables. If the data in your warehouse is large, you could easily add a limit to the query statements as you see fit.

对于这些表，哈希查询作业已成功执行，并在大约2分钟内填充了目标开发表。我发现这种方法的速度非常有用。如果将新数据吸收到生产系统中，则可以快速执行此过程以在开发环境中填充等效表以进行测试。此外，您可能不需要哈希所有这些表。如果仓库中的数据很大，则可以根据需要轻松地向查询语句添加限制。

排程 (Scheduling)

It’s very easy to run this in Airflow or another scheduler to regularly update your development environment with data.

在Airflow或其他调度程序中运行此程序以定期使用数据更新您的开发环境非常容易。

安全 (Security)

If you wanted to be extra confident about the hashing being irreversible, it’s possible to add “salt” to the process. The data in the input fields could be extended with a random secret string using the CONCAT function, for example.

如果您对散列不可逆转充满信心，可以在过程中添加“盐”。例如，可以使用CONCAT函数使用随机秘密字符串扩展输入字段中的数据。

结论 (Conclusion)

BigQuery’s hashing functions are very easy-to-use and are a very performant method of data hashing. Since, this functionality is built-in, there’s very little maintenance involved in using these functions. It’s very easy to leverage these functions to automate the process with minimal coding effort. Populating a development environment with data in this way does a good job of providing useful test data that is very similar in structure and size to real data. Overall, I’ve found there’s a lot of benefits to this and it has helped me a lot to test in our development BigQuery environment before deploying to production.

BigQuery的哈希函数非常易于使用，并且是一种非常高效的数据哈希方法。由于此功能是内置的，因此使用这些功能几乎不需要维护。利用这些功能以最少的编码工作就可以很容易地实现流程自动化。以这种方式用数据填充开发环境可以很好地提供有用的测试数据，这些测试数据的结构和大小与真实数据非常相似。总体而言，我发现这样做有很多好处，并且在部署到生产环境之前，这对我在开发BigQuery环境中进行测试提供了很多帮助。

翻译自: https://medium.com/@darraghcopley/bigquery-populating-a-development-environment-with-hashed-data-af8900a241c5

weixin_26745985

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
bigquery使用散列数据填充开发环境

介绍 (Introduction)A common dilemma I encounter when working on data engineering projects is coming up with a means to adequately test data processes in a development environment. This typically requi...
复制链接

扫一扫