kafka流数据_使用数据流将Kafka转换为BigQuery

kafka流数据

Disclaimer: I work at Google in the cloud team. Opinions are my own and not the views of my current employer.

免责声明:我在Google的云团队工作。 意见是我自己的,而不是我当前雇主的观点。

流分析 (Streaming analytics)

Many organizations are relying on the open-source streaming platform Kafka to build real-time data pipelines and applications.The same organizations are often looking to mordernize their IT landscape and adopt BigQuery to meet their growing analytics needs.By connecting the Kafka streaming data to the BigQuery analytics capabilities, these organizations can quickly analyze and activate data-derived insights as they happen, instead of waiting for a batch process to complete. This powerful combination enables real-time streaming analytics use cases such as fraud detection, inventory or fleet management, dynamic recommendations, predictive maintenance, capacity planning...

许多组织都依靠开源的流媒体平台卡夫卡建立实时数据管道和applications.The相同的组织往往希望mordernize他们的IT环境和采取的BigQuery ,以满足他们日益增长的分析 needs.By连接卡夫卡流数据利用BigQuery分析功能,这些组织可以在发生数据时快速分析和激活数据派生的见解,而不必等待批处理完成。 这种强大的组合可以实现实时流分析案例,例如欺诈检测,库存或车队管理,动态建议,预测性维护,容量规划等。

Lambda,Kappa和数据流 (Lambda, Kappa and Dataflow)

Organizations have been implementing the Lambda or Kappa architecture to support both batch and streaming data processing. But both of these architectures have some drawbacks. With Lambda for example, the batch and streaming sides each require a different code base. And with Kappa, everything is considered a stream of data, even large files have to be fed to the stream processing system and this can sometimes impact performance.

组织已经在实现LambdaKappa体系结构,以支持批处理和流数据处理。 但是这两种架构都有一些缺点。 以Lambda为例,批处理和流式处理各需要不同的代码库。 使用Kappa,所有内容都被视为数据流,即使大型文件也必须馈送到流处理系统,这有时会影响性能。

Image for post

More recently (2015), Google published the Dataflow model paper which is a unified programming model for both batch and streaming. One could say that this model is a Lambda architecture but without the drawback of having to maintain two different code base.Apache Beam is the open source implementation of this model. Apache Beam supports many runners. In Google Cloud, Beam code runs best on the fully managed data processing service that shares the same name as the whitepaper linked above: Cloud Dataflow.

最近(2015年),Google发布了Dataflow模型文件,该文件是针对批处理和流传输统一编程模型。 可以说这种模型是Lambda体系结构,但是没有必须维护两个不同代码库的缺点。 Apache Beam是此模型的开源实现。 Apache Beam支持许多跑步者 。 在Google Cloud中,Beam代码在完全托管的数据处理服务上运行效果最佳,该服务与上面链接的白皮书具有相同的名称: Cloud Dataflow

The following is a step-by-step guide on how to use Apache Beam running on Google Cloud Dataflow to ingest Kafka messages into BigQuery.

以下是有关如何使用在Google Cloud Dataflow上运行的Apache Beam将Kafka消息提取到BigQuery中的逐步指南。

环境设定 (Environment setup)

Let’s start by installing a Kafka instance.

让我们从安装Kafka实例开始。

Navigate to the Google Cloud Marketplace and search for “kafka”.In the list of solutions returned, select the Kafka solution provided by Google Click to Deploy as highlighted in blue in the picture below.

导航到Google Cloud Marketplace,然后搜索“ kafka”。在返回的解决方案列表中,选择Google提供的Kafka解决方案。 单击部署 ,如下图蓝色突出显示。

Image for post

Pick your region/zone where you want your VM to be located, for example europe-west1-b 🇧🇪Leave the default settings for everything else (unless you want to use a custom Network) and click “Deploy”.

选择您想要虚拟机所在的区域/区域,例如europe-west1-b b🇧🇪保留其他所有设置的默认设置(除非您要使用自定义网络),然后单击“部署”。

Image for post

创建一个BigQuery表 (Create a BigQuery table)

While our VM is being deployed, let’s define a JSON schema and create our BigQuery table.It is usually best practice to create the BigQuery table before instead of having it created by the first Kafka message that arrives. This is because the first Kafka message might have some optional fields not set. So the BigQuery schema inferred from it using schema auto-detection, would be incomplete.

在部署VM时,让我们定义一个JSON模式并创建我们的BigQuery表。通常最好的做法是先创建BigQuery表,而不要使用到达的第一条Kafka消息来创建它。 这是因为第一条Kafka消息可能未设置某些可选字段。 因此,使用模式自动检测从中推断出的BigQuery模式将是不完整的。

Note that if your schema cannot be defined because it changes too frequently, having your JSON as a single String column inside BigQuery is definitely an option. You could then use JSON Functions to parse it.

请注意,如果由于架构更改过于频繁而无法定义其架构,则将BigQuery作为BigQuery中的单个String列来使用绝对是一种选择。 然后,您可以使用JSON函数进行解析。

For the purpose of this article, we will be creating a table storing sample purchase events of multiple products.In a file named schema.json copy/paste the following JSON:

就本文而言,我们将创建一个表来存储多个产品的样本购买事件。在一个名为schema.json的文件中,复制/粘贴以下JSON:

[
{
"description": "Transaction time",
"name": "transaction_time",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"description": "First name",
"name": "first_name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"description": "Last name",
"name": "last_name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"description": "City",
"name": "city",
"type": "STRING",
"mode": "NULLABLE"
},
{
"description": "List of products",
"name": "products",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"description": "Product name",
"name": "product_name",
"type": "STRING",
"mode": "REQUIRED"
},
{
"description": "Product price",
"name": "product_price",
"type": "FLOAT64",
"mode": "NULLABLE"
}
]
}
]

To create our empty BigQuery table, we would ideally use an IaC tool like Terraform triggered by a CI/CD system. But that is maybe a topic for another article, so let’s just use the bq mk command to create our dataset and table.

要创建空的BigQuery表,理想情况下,我们将使用CI / CD系统触发的IaC工具(例如Terraform) 。 但这可能是另一篇文章的主题,因此让我们仅使用bq mk命令创建数据集和表。

Open Cloud Shell and upload the schema.json you created earlier:

打开Cloud Shell并上传您之前创建的schema.json

Image for post

Then, run the following commands in Cloud Shell to create our timestamp partitioned table. Don’t forget to replace <my-project> below with your GCP project ID:

然后,在Cloud Shell中运行以下命令以创建我们的时间戳分区表 。 不要忘记将以下<my-project>替换为您的GCP项目ID:

gcloud config set project <my-project>
bq mk --location EU --dataset kafka_to_bigquery
bq mk --table \
--schema schema.json \
--time_partitioning_field transaction_time \
kafka_to_bigquery

Instead of using the bq mk command, you can also create a dataset and a table using the BigQuery web UI.

除了使用bq mk命令,您还可以使用BigQuery Web UI创建数据集和表格。

发送信息给Kafka主题 (Send a message to a Kafka topic)

We are almost done with the environment setup! The final step is to create a Kafka topic and send a Kafka message to it. Navigate to the Google Cloud Console and open Compute Engine > VM instances. You should see our Kafka VM created earlier. Click the SSH button as highlighted in blue in the picture below.

我们几乎完成了环境设置! 最后一步是创建一个Kafka主题并向其发送Kafka消息。 导航到Google Cloud Console,然后打开Compute Engine> VM实例。 您应该看到我们之前创建的Kafka VM。 单击SSH按钮,如下图蓝色突出显示。

Image for post

In the terminal window that opens, enter the following command to create our Kafka topic, named txtopic:

在打开的终端窗口中,输入以下命令以创建我们的Kafka主题,名为txtopic

/opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1
--partitions 1 --topic txtopic

Confirm the topic has been created by listing the different topics. You should see txtopic being returned when entering the following command:

通过列出其他主题确认主题已创建。 输入以下命令时,应该看到返回了txtopic &

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值