terraform_使用Terraform的GCP上的数据湖

最新推荐文章于 2023-04-23 09:14:36 发布

weixin_26752765

最新推荐文章于 2023-04-23 09:14:36 发布

阅读量670

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/data-lake-on-gcp-using-terraform-469062a205ad

版权

terraform

Back in the old days, dealing with physical infrastructure is a huge burden, which not only requires teams of experts to manage but also is time-consuming. In the modern cloud computing era, however, you can deploy hundreds of computers instantly to solve your problems with the click of a button. Well, to be realistic, most day to day problems that we are trying to solve won’t require that much computing power.

过去，处理物理基础架构是一个巨大的负担，这不仅需要专家团队进行管理，而且非常耗时。但是，在现代云计算时代，您只需单击一下按钮，即可立即部署数百台计算机以解决您的问题。好吧，实际上，我们要解决的大多数日常问题都不需要那么多的计算能力。

什么是基础架构即代码(IaC) (What is infrastructure-as-code (IaC))

Infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

基础架构即代码(IaC)是通过机器可读的定义文件而不是物理硬件配置或交互式配置工具来管理和配置计算机数据中心的过程。

Wikipedia

维基百科

So instead of placing a physical server on a rack, setting up all the cables, configuring the network, installing all the required operating system, you can write codes to do all that. To use IaC, you typically would use a source code control repository, write code to configure your infrastructures, run the code locally, or set up an automation to execute your code on every commit.

因此，您可以编写代码来完成所有这些工作，而不是将物理服务器放在机架上，设置所有电缆，配置网络，安装所有必需的操作系统。要使用IaC，通常需要使用源代码控制存储库，编写代码以配置基础结构，在本地运行代码或设置自动化以在每次提交时执行代码。

为什么要使用IaC？ (Why IaC?)

Why bother setting up and learning new stuff to manage your infrastructure while you can do that through the interface of all cloud providers, you ask? There are many advantages to having version control of our infrastructure.

您问为什么要设置和学习新知识来管理基础架构，而又可以通过所有云提供商的界面来做到这一点呢？对我们的基础结构进行版本控制有许多优点。

重现性 (Reproducibility)

IaC enables you to create different environments with ease, especially when your project is complex, and there are a lot of moving parts. For example, after setting up a development environment with multiple projects, VPCs, storage, compute instances, and IAM, it is counter-productive to do the same thing for staging and production environment.

IaC使您能够轻松地创建不同的环境，尤其是在您的项目很复杂且有很多活动部件的情况下。例如，在设置具有多个项目，VPC，存储，计算实例和IAM的开发环境之后，对登台环境和生产环境执行相同的操作会适得其反。

With IaC, you can replicate your environment with some minor changes to your codebase. You can even customize your environment to suit your needs. For instance, you can set up different machine types for prod and dev, or set up a more permissive IAM for dev.

使用IaC，您可以对代码库进行一些小的更改来复制环境。您甚至可以自定义环境以满足您的需求。例如，您可以为prod和dev设置不同的机器类型，或者为dev设置更宽松的IAM。

Imagine if an intern mess up your production environment, you can tear down everything and build up your environments with ease (given that your underlying data are unscathed)

想象一下，如果一名实习生破坏了您的生产环境，那么您可以拆除所有内容并轻松构建环境(假设基础数据没有受到破坏)

安全 (Security)

On a large project, you can have environments with complex permissions and policies. Not only do you have to worry about designing and setting up these environments, enforcing the policies and permissions can also be challenging.

在大型项目中，您可以拥有具有复杂权限和策略的环境。您不仅要担心设计和设置这些环境，而且执行策略和权限也很困难。

With IaC, every change in your environment is versioned, so you know who makes what changes (given that you limit admin permissions to only your IaC service). You can also periodically scan your environments for differences between the configurations and the actual environments. Most IaC service would then be able to revert your environment to the configurations if they detect any alternations.

使用IaC，您环境中的每个更改都会被版本化，因此您知道是谁进行了哪些更改(假设您将管理员权限限制为仅IaC服务)。您还可以定期扫描您的环境，以了解配置和实际环境之间的差异。如果大多数IaC服务检测到任何更改，便可以将您的环境还原为配置。

合作 (Collaboration)

When your infrastructures are managed by code, you can share them between teams or use them for later projects. You can also generate infrastructure documentation automatically based on the configurations.

当您的基础结构由代码管理时，您可以在团队之间共享它们或将它们用于以后的项目。您还可以根据配置自动生成基础结构文档。

在GCP上使用Terraform进行IaC入门 (Getting started with IaC using Terraform on GCP)

Great, so let’s get started setting up the infrastructure for a data lake on Google Cloud Platform using Terraform. You can use any other IaC tool on any other cloud provider, and I choose this combination since it is familiar to me.

太好了，让我们开始使用Terraform在Google Cloud Platform上为数据湖设置基础架构。您可以在任何其他云提供商上使用任何其他IaC工具，由于我对它很熟悉，因此我选择此组合。

Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision data center infrastructure using a declarative configuration language known as HashiCorp Configuration Language, or optionally JSON. Terraform manages external resources with “providers”. Wikipedia

Terraform是由HashiCorp创建的作为代码软件工具的开源基础结构。它使用户能够使用称为HashiCorp配置语言或JSON(可选)的声明性配置语言来定义和配置数据中心基础结构。 Terraform通过“提供者”管理外部资源。维基百科

In this project, we will use terraform code to provision resources and permissions for a data lake on GCP. The diagram above is a simplified version of a data lake, and we will write code to provision and set up everything on GCP.

在此项目中，我们将使用terraform代码为GCP上的数据湖配置资源和权限。上图是数据湖的简化版本，我们将编写代码来配置和设置GCP上的所有内容。

Let me briefly walk through the architecture of a typical data lake on GCP (for simplicity, I only consider batch pipeline). You would generally have systems producing data, running on-premise, or other cloud providers/project that you need to connect to. You would connect to these systems via VPN or interconnect for security purposes. Then you will need an orchestration/staging server to extract data and load them to your storage buckets.

让我简要地介绍一下GCP上典型数据湖的体系结构(为简单起见，我仅考虑批处理管道)。通常，您将具有在本地运行的数据生成系统或其他需要连接的云提供程序/项目。为了安全起见，您将通过VPN连接到这些系统或互连。然后，您将需要一个业务流程/登台服务器来提取数据并将其加载到存储桶中。

The data will then be classified and load into different buckets. Raw data would typically be ingested in the landing bucket. Data with sensitive customer information would be treated separately (masking, deidentification, separate permission policy) and load to the sensitive bucket. The work bucket is used for work in process data of data engineers and data scientists, and the backup bucket is for backup of cleansed data.

然后，数据将被分类并加载到不同的存储桶中。原始数据通常会被收集到着陆桶中。具有敏感客户信息的数据将被分别处理(屏蔽，取消标识，单独的权限策略)并加载到敏感存储桶中。工作桶用于处理数据工程师和数据科学家的过程数据，备份桶用于备份已清理的数据。

The data then would be loaded to the data warehouse, where it would be separated based on the system of ingestion (different companies may do this differently). Data here is cleansed normalize/denormalize based on the needs, and model for later use. Data from the data warehouse will be further modeled and aggregated and load to data marts. Data marts are often organized by business functions, such as marketing, sales, and finance.

然后，数据将被加载到数据仓库，然后根据摄取系统将其分离(不同公司的处理方式可能有所不同)。将根据需要清理此处的数据以进行规范化/非规范化，并建模以备后用。来自数据仓库的数据将被进一步建模和聚合，并加载到数据集市。数据集市通常按市场营销，销售和财务等业务功能进行组织。

You can see that we have several layers and different teams accessing these layers. The architecture above is based on our work and specific business problems. Any constructive feedback is welcomed :)

您可以看到我们有多个层次，并且不同的团队正在访问这些层次。上面的架构基于我们的工作和特定的业务问题。欢迎任何建设性的反馈意见：)

开始之前 (Before we start)

You will need to do the following for this project:

您将需要为此项目执行以下操作：

Download and setup Terraform CLI: Use this getting started guide to install terraform CLI on your local machine.
下载并设置Terraform CLI：使用此入门指南在本地计算机上安装terraform CLI。
Create a Google Cloud account: Sign up for a Google Cloud account, if you haven’t already. You will get $300 credit when signing up, more than enough to get you through this tutorial without spending a dollar.
创建一个Google Cloud帐户 ：如果尚未注册，请注册一个Google Cloud帐户。注册时您将获得$ 300的抵免额，足以让您无需花费一美元即可完成本教程。
Get your billing ID: Follow the guide here to find out your billing ID on GCP. You will need it for later use.
获取您的帐单ID：请按照此处的指南在GCP上查找您的帐单ID。您将需要它以备后用。
Install gcloud CLI: Use this link to help you install gcloud CLI locally.
安装gcloud CLI ：使用此链接可帮助您在本地安装gcloud CLI。

You can view the full code for this project here.

您可以在此处查看该项目的完整代码。

在GCP上使用Terraform进行IaC入门 (Getting started with IaC using Terraform on GCP)

通过GCP进行身份验证 (Authenticate with GCP)

First thing first, we need to authenticate with GCP. Paste the following comment to a terminal and follow the instructions.

首先，我们需要通过GCP进行身份验证。将以下注释粘贴到终端，然后按照说明进行操作。

gcloud auth application-default login

设置main.tf (Setup main.tf)

Create a main.tf file with the following content:

创建具有以下内容的main.tf文件：

provider "google" {}

This will set the provider for our terraform project.

这将为我们的terraform项目设置提供商。

terraform init

建立专案 (Create projects)

Now we can start setting up our infrastructure. We will start by creating two projects for the data lake and the data warehouse. You can have all of your settings in a giant main.tf file, but I recommend separating based on services. Let’s create a new project.tf file where we will define our project.

现在我们可以开始设置基础架构了。我们将从为数据湖和数据仓库创建两个项目开始。您可以将所有设置都放在一个巨大的main.tf文件中，但是我建议根据服务进行分离。让我们创建一个新的project.tf文件，在其中定义我们的项目。

The first line will define the resource that we want to create: google_project. The next bit data-lake is the name of the resource to refer to by other services. Replace project_id with a globally unique ID (include your name or your project), and billing_account with your own.

第一行将定义我们要创建的资源： google_project 。下一个data-lake是其他服务要引用的资源的名称。将project_id替换为全局唯一ID(包括您的姓名或项目)，并将billing_account为您自己的ID。

terraform apply

You will see output like so:

您将看到如下输出：

# google_project.data-lake will be created
  + resource "google_project" "data-lake" {
      + auto_create_network = true
      + billing_account     = ""
      + folder_id           = (known after apply)
      + id                  = (known after apply)
      + name                = "Data Lake"
      + number              = (known after apply)
      + org_id              = (known after apply)
      + project_id          = "cloud-iac-data-lake"
      + skip_delete         = true
    }# google_project.data-warehouse will be created
  + resource "google_project" "data-warehouse" {
      + auto_create_network = true
      + billing_account     = ""
      + folder_id           = (known after apply)
      + id                  = (known after apply)
      + name                = "Data Warehouse"
      + number              = (known after apply)
      + org_id              = (known after apply)
      + project_id          = "cloud-iac-data-warehouse"
      + skip_delete         = true
    }# google_project.data-marts will be created
  + resource "google_project" "data-marts" {
      + auto_create_network = true
      + billing_account     = ""
      + folder_id           = (known after apply)
      + id                  = (known after apply)
      + name                = "Data Marts"
      + number              = (known after apply)
      + org_id              = (known after apply)
      + project_id          = "cloud-iac-data-marts"
      + skip_delete         = true
    }

This is the prompt detailing what terraform will create. Study these to make sure that the results match what you are trying to do, and type yes to the terminal.

这是详细说明地形将创建的提示。研究这些内容以确保结果与您要执行的操作匹配，然后在终端上键入yes 。

You have successfully created three projects: Data Lake, Data Warehouse, and Data Marts! Go to the GCP console to verify your results. Note that you may have a limit of three projects per billing account that may prevent you from proceeding further.

您已经成功创建了三个项目：Data Lake，Data Warehouse和Data Marts！转到GCP控制台以验证您的结果。请注意，您每个帐单帐户最多只能有三个项目，这可能会阻止您继续进行下去。

定义变量 (Define variables)

Before moving on, let’s talk about variables. You can see that in the terraform code in our project.tf above, we use specific names for our project ID. That is not always the best way to go. Imagine if we want to use the code somewhere else, we would have to change all the names manually.

在继续之前，让我们谈谈变量。您可以在上面的project.tf中的terraform代码中看到，我们使用特定名称作为项目ID。这并不总是最好的方法。想象一下，如果我们想在其他地方使用该代码，则必须手动更改所有名称。

Instead, we can define a variables.tf file that will be used throughout the project. We can have commonly used variables stored there. There are different types of variables that we can use, but I will use local variables for simplicity. You can read more about Terraform variables here.

相反，我们可以定义一个variables.tf文件，该文件将在整个项目中使用。我们可以将常用变量存储在那里。我们可以使用不同类型的变量，但是为了简单起见，我将使用局部变量。您可以在此处阅读有关Terraform变量的更多信息。

locals {
  region = "asia-southeast1"
  unique_id = "cloud-iac"
  billing_id = ""
}

创建GCS资源 (Create GCS resources)

In a similar fashion to create the three projects, we can create 4 GCS buckets that we would require: landing, sensitive, work, and backup bucket. Create a gcs.tf file and paste in the following:

以类似的方式来创建三个项目，我们可以创建四个GCS存储桶，这将是我们需要的： landing ， sensitive ， work和backup存储桶。创建一个gcs.tf文件并粘贴以下内容：

Run terraform apply and input yes , and you will have created four buckets in our data lake project. In the code above, you see that we are using variables to refer to the project and the regions of the buckets. If we need to create the data lake again (perhaps for a different client, or different company), we only need to change the values in variables.tf. Pretty powerful stuff!

运行terraform apply并输入yes ，您将在我们的data lake项目中创建了四个存储桶。在上面的代码中，您看到我们正在使用变量来引用项目和存储桶的区域。如果需要再次创建数据湖(可能是针对不同的客户或不同的公司)，则只需更改variables.tf的值。很强大的东西！

配置GCS存储桶的ACL权限 (Configure ACL permission for GCS bucket)

Now we need different permissions for different teams. For example, DE should have access to all buckets, while DS cannot access thesensitive bucket, can only read on landing and backup, but can write on work. We can set that up easily with the following codes:

现在，我们需要为不同的团队提供不同的权限。例如，DE应该有权访问所有存储桶，而DS无法访问sensitive存储桶，只能在landing和backup读取，而可以在work写入。我们可以使用以下代码轻松地进行设置：

We would create Google groups to manage people in different teams, making it easier for permission control (instead of having ten different emails here for permission, we only need one email per team).

我们将创建Google网上论坛来管理不同团队中的人员，从而简化权限控制(而不是在这里拥有十封不同的电子邮件进行权限管理，每个团队只需要一封电子邮件即可)。

Keep in mind that if the email does not exist, terraform command will fail.

请记住，如果电子邮件不存在，terraform命令将失败。

数据仓库 (Data warehouse)

Next, we will create datasets for our data warehouse. Referring back to the diagram, we have three systems, and thus, will create three corresponding datasets. Unlike GCS, we can define Bigquery ACL in google_bigquery_dataset definition.

接下来，我们将为我们的数据仓库创建数据集。回到该图，我们有三个系统，因此将创建三个相应的数据集。与GCS不同，我们可以在google_bigquery_dataset定义中定义Bigquery ACL。

We will configure the same ACL for data warehouse datasets. DE will be the owner of those datasets (in a production environment, it recommended to set up a service account to be the owner), DS will be the writer, and DA will be the reader.

我们将为数据仓库数据集配置相同的ACL。 DE将是这些数据集的owner (在生产环境中，建议将服务帐户设置为所有者)，DS将是writer ，而DA将是reader 。

数据集市 (Data Marts)

For our data marts, we will have similar configurations to the data warehouse, but with different access permissions.

对于我们的数据集市，我们将具有与数据仓库类似的配置，但具有不同的访问权限。

计算引擎 (Compute engine)

For the orchestration part, we will build a VPC network, an orchestration instance, and a static external IP address. If you read through the code below, there is nothing complicated going on here. You can read the Terraform documentation on how to create an instance here.

对于业务流程部分，我们将构建VPC网络，业务流程实例和静态外部IP地址。如果您仔细阅读下面的代码，这里没有什么复杂的事情。您可以在此处阅读有关如何创建实例的Terraform文档。

IAM许可 (IAM permission)

Last but not least, we need to set up the IAM permissions for our project. I will only provide an example for this part, but we can map each group to any roles like below.

最后但并非最不重要的一点是，我们需要为我们的项目设置IAM权限。我仅提供此部分的示例，但是我们可以将每个组映射到如下所示的任何角色。

结论 (Conclusion)

In this mini-project, we have created most of the infrastructures necessary to run a data lake on Google Cloud. You might need to dive deeper to customize the code for your particular needs.

在这个小型项目中，我们创建了在Google Cloud上运行数据湖所需的大多数基础架构。您可能需要更深入地研究以满足特定需求的代码。