bigquery_如何将Google BigQuery表导入AWS Athena

bigquery

by Aftab Ansari

通过Aftab Ansari

如何将Google BigQuery表导入AWS Athena (How to import Google BigQuery tables to AWS Athena)

As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is something you might have already faced or will face at some point.

作为数据工程师,您很有可能使用领先的大数据云平台之一,例如AWS,Microsoft Azure或Google Cloud进行数据处理。 同样,将数据从一个平台迁移到另一个平台可能是您已经面临或将要面对的事情。

In this post, I will show how I imported Google BigQuery tables to AWS Athena. If you only need a list of tools to be used with some very high-level guidance, you can quickly look at this post that shows how to import a single BigQuery table into Hive metastore. In this article, I will show one way of importing a full BigQuery project (multiple tables) into both Hive and Athena metastore.

在本文中,我将展示如何将Google BigQuery表导入AWS Athena。 如果您只需要一系列非常高级的指导即可使用的工具,则可以快速查看这篇文章,文章显示了如何将单个BigQuery表导入Hive metastore 。 在本文中,我将展示一种将完整的BigQuery项目(多个表)导入到Hive和Athena元存储库中的一种方法。

There are few import limitations: for example, when you import data from partitioned tables, you cannot import individual partitions. Please check the limitations before starting the process.

导入限制很少:例如,当您从分区表中导入数据时,您将无法导入单个分区。 在开始该过程之前,请检查限制

In order to successfully import Google BigQuery tables to Athena, I performed the steps shown below. I used AVRO format when dumping data and the schemas from Google BigQuery and loading them into AWS Athena.

为了将Google BigQuery表成功导入Athena,我执行了以下步骤。 从Google BigQuery转储数据和架构并将其加载到AWS Athena时,我使用了AVRO格式。

Step 1. Dump BigQuery data to Google Cloud Storage

步骤1.将BigQuery数据转储到Google Cloud Storage

Step 2. Transfer data from Google Cloud Storage to AWS S3

步骤2.将数据从Google Cloud Storage传输到AWS S3

Step 3. Extract AVRO schema from AVRO files stored in S3

步骤3.从S3中存储的AVRO文件中提取AVRO模式

Step 4. Create Hive tables on top of AVRO data, use schema from Step 3

步骤4.在AVRO数据之上创建Hive表,使用步骤3中的架构

Step 5. Extract Hive table definition from Hive tables

步骤5.从Hive表中提取Hive表定义

Step 6. Use the output of Step 3 and 5 to create Athena tables

步骤6.使用步骤3和5的输出来创建Athena表

So why do I have to create Hive tables in the first place although the end goal is to have data in Athena? This is because:

那么,尽管最终目标是在Athena中拥有数据,但为什么为什么我必须首先创建Hive表? 这是因为:

  • Athena does not support using avro.schema.url to specify table schema.

    雅典娜不支持使用avro.schema.url 指定表架构。

  • Athena requires you to explicitly specify field names and their data types in CREATE statement.

    雅典娜要求您在CREATE语句中显式指定字段名称及其数据类型。
  • Athena also requires the AVRO schema in JSON format under avro.schema.literal.

    雅典娜还需要avro.schema.literal下的JSON格式的AVRO模式。

  • You can check this AWS doc for more details.

    您可以检查此AWS 文档以了解更多详细信息。

So, Hive tables can be created directly by pointing to AVRO schema files stored on S3. But to have the same in Athena, columns and schema are required in the CREATE TABLE statement.

因此,可以通过指向存储在S3上的AVRO模式文件直接创建Hive表。 但是要在Athena中具有相同的功能,则CREATE TABLE语句中需要列和架构。

One way to overcome this is to first extract schema from AVRO data to be supplied as avro.schema.literal . Second, for field names and data types required for CREATE statement, create Hive tables based on AVRO schemas stored in S3 and use SHOW CREATE TABLE to dump/export Hive table definitions which contain field names and datatypes. Finally, create Athena tables by combining the extracted AVRO schema and Hive table definition. I will discuss in detail in subsequent sections.

解决此问题的一种方法是首先从AVRO数据中提取模式,以提供给avro.schema.literal 。 其次,对于CREATE语句所需的字段名称和数据类型,基于存储在S3中的AVRO模式创建Hive表,并使用SHOW CREATE TABLE转储/导出包含字段名称和数据类型的Hive表定义。 最后,通过组合提取的AVRO模式和Hive表定义来创建Athena表。 我将在后面的部分中详细讨论。

For the demonstration, I have the following BigQuery tables that I would like to import to Athena.

为了进行演示,我具有以下要导入到Athena的BigQuery表。

So, let’s get started!

所以,让我们开始吧!

步骤1.将BigQuery数据转储到Google Cloud Storage (Step 1. Dump BigQuery data to Google Cloud Storage)

It is possible to dump BigQuery data in Google storage with the help of the Google cloud UI. However, this can become a tedious task if you have to dump several tables manually.

可以借助Google云用户界面将BigQuery数据转储到Google存储中。 但是,如果必须手动转储多个表,这可能会变得很繁琐。

To tackle this problem, I used Google Cloud Shell. In Cloud Shell, you can combine regular shell scripting with BigQuery commands and dump multiple tables relatively fast. You can activate Cloud Shell as shown in the picture below.

为了解决这个问题,我使用了Google Cloud Shell。 在Cloud Shell中,您可以将常规的Shell脚本与BigQuery命令结合使用,并相对快速地转储多个表。 您可以如下图所示激活Cloud Shell。

From Cloud Shell, the following operation provides the BigQuery extract commands to dump each table of the “backend” dataset to Google Cloud Storage.

从Cloud Shell,以下操作提供了BigQuery extract命令,以将“后端”数据集的每个表转储到Google Cloud Storage。

bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY <dataset>.@ gs://<bucket>@

In my case it prints:

就我而言,它打印:

aftab_ansari@cloudshell:~ (project-ark-archive)$ bq ls backend | cut -d ' ' -f3 | tail -n+3 | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY backend.@ gs://plr_data_transfer_temp/bigquery_data/backend/@/@-*.avro
bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_daily_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-*.avro
bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_detailed_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-*.avro
bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_phase2 gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-*.avro

Please note: --compression SNAPPY, this is important, as uncompressed and big files can cause the gsutil command (that is used to transfer data to AWS S3) to get stuck. The wildcard (*) makes bq extract split bigger tables (>1GB) into multiple output files. Running those commands on Cloud Shell, copy data to the following Google Storage directory.

请注意:-- --compression SNAPPY ,这很重要,因为未压缩的大文件可能会导致gsutil命令(用于将数据传输到AWS S3)卡住。 通配符( * )使bq extract将较大的表(> 1GB)拆分为多个输出文件。 在Cloud Shell上运行这些命令,将数据复制到以下Google Storage目录。

gs://plr_data_transfer_temp/bigquery_data/backend/table_name/table_name-*.avro

Let’s do ls to see the dumped AVRO file.

让我们做ls看到倾倒AVRO文件。

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil ls gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2
gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro

I can also browse from the UI and find the data like shown below.

我也可以从UI浏览并找到如下所示的数据。

步骤2.将数据从Google Cloud Storage传输到AWS S3 (Step 2. Transfer data from Google Cloud Storage to AWS S3)

Transferring data from Google Storage to AWS S3 is straightforward. First, set up your S3 credentials. On Cloud Shell, create or edit .boto file ( vi ~/.boto) and add these:

从Google Storage到AWS S3的数据传输非常简单。 首先,设置您的S3凭据。 在Cloud Shell上,创建或编辑.boto文件( .boto vi ~/.boto )并添加以下文件:

[Credentials]aws_access_key_id = <your aws access key ID>aws_secret_access_key = <your aws secret access key>[s3]host = s3.us-east-1.amazonaws.comuse-sigv4 = True

Please note: s3.us-east-1.amazonaws.com has to correspond with the region where the bucket is.

请注意: s3.us-east-1.amazonaws.com必须与存储桶所在的区域相对应。

After setting up the credentials, execute gsutil to transfer data from Google Storage to AWS S3. For example:

设置凭证后,执行gsutil将数据从Google Storage传输到AWS S3。 例如:

gsutil rsync -r gs://your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schema

Add the -n flag to the command above to display the operations that would be performed using the specified command without actually running them.

-n标志添加到上面的命令以显示使用指定命令执行的操作,而无需实际运行它们。

In this case, to transfer the data to S3, I used the following:

在这种情况下,为了将数据传输到S3,我使用了以下方法:

aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil rsync -r gs://plr_data_transfer_temp/bigquery_data/backend s3://my-bucket/bq_data/backend
Building synchronization state…Starting synchronization…Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-000000000000.avro [Content-Type=application/octet-stream]...| [3 files][987.8 KiB/987.8 KiB]Operation completed over 3 objects/987.8 KiB.

Let’s check if the data got transferred to S3. I verified that from my local machine:

让我们检查数据是否已传输到S3。 我从本地计算机验证了这一点:

aws s3 ls --recursive  s3://my-bucket/bq_data/backend --profile smoke | awk '{print $4}'
bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avrobq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avrobq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro

步骤3.从S3中存储的AVRO文件中提取AVRO模式 (Step 3. Extract AVRO schema from AVRO files stored in S3)

To extract schema from AVRO data, you can use the Apache avro-tools-<version&gt;.jar with the getschema parameter. The benefit of using this tool is that it returns schema in the form you can use directly in WITH SERDEPROPERTIES statement when creating Athena tables.

要从AVRO数据中提取模式,可以将Apache avro-tools-<version&g t; .jar h the get schema参数一起使用。 使用此工具的好处是,它以创建Athena表时可以ly in WITH SERDEPROP ERTIES语句中直接使用的形式返回架构。

You noticed I got only one .avro file per table when dumping BigQuery tables. This was because of small data volume — otherwise, I would have gotten several files per table. Regardless of single or multiple files per table, it’s enough to run avro-tools against any single file per table to extract that table’s schema.

您注意到转储BigQuery表时每个表只有一个.avro文件。 这是因为数据量较小-否则,我每个表将获得几个文件。 无论每个表有单个或多个文件,只要对每个表的任何单个文件运行avro-tools即可提取该表的架构。

I downloaded the latest version of avro-tools which is avro-tools-1.8.2.jar. I first copied all .avro files from s3 to local disk:

我下载了avro-tools的最新版本,即avro-tools-1.8.2.jar 。 我首先将所有.avro文件从s3复制到本地磁盘:

[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp s3://my-bucket/bq_data/backend/ bq_data/backend/ --recursive
download: s3://my-bucket/bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro to bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro to bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro
download: s3://my-bucket/bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro to bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro

Avro-tools command should look like java -jar avro-tools-1.8.2.jar getschema your_data.avro > schema_file.avsc. This can become tedious if you have several AVRO files (in reality, I’ve done this for a project with many more tables). Again, I used a shell script to generate commands. I created extract_schema_avro.sh with the following content:

Avro-tools命令应类似于java -jar avro-tools-1.8.2.jar getschema your_data.avro > schema_file.a vsc。 如果您有多个AVRO文件,这可能变得很乏味(实际上,我已经为具有更多表的项目完成了此操作)。 同样,我使用Shell脚本生成命令。 我用以下内容创建了ed extract_schema_avro .sh:

schema_avro=(bq_data/backend/*)for i in ${!schema_avro[*]}; do  input_file=$(find ${schema_avro[$i]} -type f)  output_file=$(ls -l ${schema_avro[$i]} | tail -n+2 \    | awk -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9);    print $9 }')  commands=$(    echo "java -jar avro-tools-1.8.2.jar getschema " \      $input_file" > bq_data/schemas/backend/avro/"$output_file  )  echo $commandsdone

Running extract_schema_avro.sh provides the following:

运行extract_schema_avro.sh提供以下内容:

[hadoop@ip-10-0-10-205 tmpAftab]$ sh extract_schema_avro.sh
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
java -jar avro-tools-1.8.2.jar getschema bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro > bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc

Executing the above commands copies the extracted schema under bq_data/schemas/backend/avro/ :

执行上述命令将提取的架构复制到bq_data/schemas/backend/avro/

[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/schemas/backend/avro/* | awk '{print $9}'
bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avscbq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc

Let’s also check what’s inside an .avsc file.

我们还要检查.avsc文件中的内容。

[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc
{"type" : "record","name" : "Root","fields" : [ {"name" : "uid","type" : [ "null", "string" ]}, {"name" : "platform","type" : [ "null", "string" ]}, {"name" : "version","type" : [ "null", "string" ]}, {"name" : "country","type" : [ "null", "string" ]}, {"name" : "sessions","type" : [ "null", "long" ]}, {"name" : "active_days","type" : [ "null", "long" ]}, {"name" : "session_time_minutes","type" : [ "null", "double" ]} ]}

As you can see, the schema is in the form that can be directly used in Athena WITH SERDEPROPERTIES. But before Athena, I used the AVRO schemas to create Hive tables. If you want to avoid Hive table creation, you can read the .avsc files to extract field names and data types, but then you have to map the data types yourself from AVRO format to Athena table creation DDL.

如您所见,该模式采用可以在Athena WITH SERDEPROPERTIES直接使用的形式。 但是在雅典娜之前,我使用了AVRO模式来创建Hive表。 如果要避免创建Hive表,可以读取.avsc文件以提取字段名称和数据类型,但随后必须将数据类型从AVRO格式映射到Athena表创建DDL。

The complexity of the mapping task depends on how complex the data types are in your tables. For simplicity (and to cover most simple to complex data types), I let Hive do the mapping for me. So I created the tables first in Hive metastore. Then I used SHOW CREATE TABLE to get the field names and data types part of the DDL.

映射任务的复杂度取决于表中数据类型的复杂程度。 为了简单起见(并涵盖最简单到复杂的数据类型),我让Hive为我做映射。 因此,我首先在Hive Metastore中创建了表。 然后,我使用SHOW CREATE TABLE来获取DDL的字段名称和数据类型。

步骤4.在AVRO数据之上创建Hive表,使用步骤3中的架构 (Step 4. Create Hive tables on top of AVRO data, use schema from Step 3)

As discussed earlier, Hive allows creating tables by using avro.schema.url. So once you have schema (.avsc file) extracted from AVRO data, you can create tables as follows:

如前所述,Hive允许使用avro.schema.url创建表。 因此,一旦从AVRO数据中提取了架构( .avsc文件),就可以创建表,如下所示:

CREATE EXTERNAL TABLE table_nameSTORED AS AVROLOCATION 's3://your-aws-bucket/your-target-path/avro_data'TBLPROPERTIES ('avro.schema.url'='s3://your-aws-bucket/your-target-path/your-avro-schema');

First, upload the extracted schemas to S3 so that avro.schema.url can refer to their S3 locations:

首先,将提取的架构上传到S3,以便avro.schema.url 可以参考他们的S3位置:

[hadoop@ip-10-0-10-205 tmpAftab]$ aws s3 cp bq_data/schemas s3://my-bucket/bq_data/schemas --recursive
upload: bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc
upload: bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc to s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc

After having both AVRO data and schema in S3, DDL for Hive table can be created using the template shown at the beginning of this section. I used another shell script create_tables_hive.sh (shown below) to cover any number of tables:

在S3中同时具有AVRO数据和架构后,可以使用本节开头所示的模板创建Hive表的DDL。 我使用了另一个shell脚本create_tables_hive.sh (如下所示)来覆盖任意数量的表:

schema_avro=$(ls -l bq_data/backend | tail -n+2 | awk '{print $9}')for table_name in $schema_avro; do  file_name=$(ls -l bq_data/backend/$table_name | tail -n+2 | awk \    -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9); print $9 }')  table_definition=$(    echo "CREATE EXTERNAL TABLE IF NOT EXISTS backend."$table_name"\\nSTORED AS AVRO""\\nLOCATION 's3://my-bucket/bq_data/backend/"$table_name"'""\\nTBLPROPERTIES('avro.schema.url'='s3://my-bucket/bq_data/\schemas/backend/avro/"$file_name"');"  )  printf "\n$table_definition\n"done

Running the script provides the following:

运行脚本可提供以下内容:

[hadoop@ip-10-0-10-205 tmpAftab]$ sh create_tables_hive.sh
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');

I ran the above on Hive console to actually create the Hive tables:

我在Hive控制台上运行了以上内容,以实际创建Hive表:

[hadoop@ip-10-0-10-205 tmpAftab]$ hiveLogging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: false
hive> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc');OKTime taken: 4.24 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2'> TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc');OKTime taken: 0.563 seconds
hive>> CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2> STORED AS AVRO> LOCATION 's3://my-bucket/bq_data/backend/sessions_phase2' TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc');OKTime taken: 0.386 seconds

So I have created the Hive tables successfully. To verify that the tables work, I ran this simple query:

因此,我已经成功创建了Hive表。 为了验证表是否有效,我运行了以下简单查询:

hive> select count(*) from backend.sessions_detailed_phase2;Query ID = hadoop_20190214152548_2316cb5b-29f1-4416-922e-a6ff02ec1775Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1550010493995_0220)----------------------------------------------------------------------------------------------VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 8.17 s----------------------------------------------------------------------------------------------OK6130

So it works!

这样就行了!

步骤5.从Hive表中提取Hive表定义 (Step 5. Extract Hive table definition from Hive tables)

As discussed earlier, Athena requires you to explicitly specify field names and their data types in CREATE statement. In Step 3, I extracted the AVRO schema, which can be used in WITH SERDEPROPERTIES of Athena table DDL, but I also have to specify all the fiend names and their (Hive) data types. Now that I have the tables in the Hive metastore, I can easily get those by running SHOW CREATE TABLE. First, prepare the Hive DDL queries for all tables:

如前所述,Athena要求您在CREATE语句中显式指定字段名称及其数据类型。 在第3步中,我提取了AVRO模式,该模式可以在Athena表DDL的WITH SERDEPROPERTIESWITH SERDEPROPERTIES ,但是我还必须指定所有恶魔名称及其(Hive)数据类型。 现在,我在Hive元存储中有了表,可以通过运行SHOW CREATE TABLE轻松获得这些SHOW CREATE TABLE 。 首先,为所有表准备Hive DDL查询:

[hadoop@ip-10-0-10-205 tmpAftab]$ ls -l bq_data/backend | tail -n+2 | awk '{print "hive -e '\''SHOW CREATE TABLE backend."$9"'\'' > bq_data/schemas/backend/hql/backend."$9".hql;" }'
hive -e 'SHOW CREATE TABLE backend.sessions_daily_phase2' > bq_data/schemas/backend/hql/backend.sessions_daily_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_detailed_phase2' > bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql;
hive -e 'SHOW CREATE TABLE backend.sessions_phase2' > bq_data/schemas/backend/hql/backend.sessions_phase2.hql;

Executing the above commands copies Hive table definitions under bq_data/schemas/backend/hql/. Let’s see what’s inside:

执行上述命令会在bq_data/schemas/backend/hql/下复制Hive表定义。 让我们看看里面是什么:

[hadoop@ip-10-0-10-205 tmpAftab]$ cat bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`(`uid` string COMMENT '',`platform` string COMMENT '',`version` string COMMENT '',`country` string COMMENT '',`sessions` bigint COMMENT '',`active_days` bigint COMMENT '',`session_time_minutes` double COMMENT '')ROW FORMAT SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'LOCATION's3://my-bucket/bq_data/backend/sessions_detailed_phase2'TBLPROPERTIES ('avro.schema.url'='s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc','transient_lastDdlTime'='1550157659')

By now all the building blocks needed for creating AVRO tables in Athena are there:

到目前为止,在雅典娜中创建AVRO表所需的所有构造块都已存在:

  • Field names and data types can be obtained from the Hive table DDL (to be used in columns section of CREATE statement)

    字段名称和数据类型可以从Hive表DDL中获得(将在CREATE语句的column部分中使用)

  • AVRO schema (JSON) can be obtained from the extracted .avsc files (to be supplied in WITH SERDEPROPERTIES).

    可以从提取的.avsc文件(将在WITH SERDEPROPERTIES提供)中获取AVRO模式(JSON)。

步骤6.使用步骤3和5的输出来创建Athena表 (Step 6. Use the output of Steps 3 and 5 to Create Athena tables)

If you are still with me, you have done a great job coming this far. I am now going to perform the final step which is creating Athena tables. I used the following script to combine .avsc and .hql files to construct Athena table definitions:

如果您仍然和我在一起,那么到目前为止您已经做得很好。 我现在要执行创建雅典娜表的最后一步。 我使用以下脚本来组合.avsc.hql文件以构造Athena表定义:

[hadoop@ip-10-0-10-205 tmpAftab]$ cat create_tables_athena.sh
# directory where extracted avro schemas are storedschema_avro=(bq_data/schemas/backend/avro/*)# directory where extracted HQL schemas are storedschema_hive=(bq_data/schemas/backend/hql/*)for i in ${!schema_avro[*]}; do  schema=$(awk -F '{print $0}' '/CREATE/{flag=1}/STORED/{flag=0}\   flag' ${schema_hive[$i]})  location=$(awk -F '{print $0}' '/LOCATION/{flag=1; next}\  /TBLPROPERTIES/{flag=0} flag' ${schema_hive[$i]})  properties=$(cat ${schema_avro[$i]})  table=$(echo $schema '\n' \    "WITH SERDEPROPERTIES ('avro.schema.literal'='\n"$properties \    "\n""')STORED AS AVRO \n" \    "LOCATION" $location";\n\n")  printf "\n$table\n"done \  > bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql

Running the above script copies Athena table definitions to bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql. In my case it contains:

运行上述脚本bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql Athena表定义复制到bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql 。 就我而言,它包含:

[hadoop@ip-10-0-10-205 all_athena_tables]$ cat all_athena_tables.hql
CREATE EXTERNAL TABLE `backend.sessions_daily_phase2`( `uid` string COMMENT '', `activity_date` string COMMENT '', `sessions` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "activity_date", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_daily_phase2';
CREATE EXTERNAL TABLE `backend.sessions_detailed_phase2`( `uid` string COMMENT '', `platform` string COMMENT '', `version` string COMMENT '', `country` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "platform", "type" : [ "null", "string" ] }, { "name" : "version", "type" : [ "null", "string" ] }, { "name" : "country", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] } ')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_detailed_phase2';
CREATE EXTERNAL TABLE `backend.sessions_phase2`( `uid` string COMMENT '', `sessions` bigint COMMENT '', `active_days` bigint COMMENT '', `session_time_minutes` double COMMENT '')ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }')STORED AS AVROLOCATION 's3://my-bucket/bq_data/backend/sessions_phase2';

And finally, I ran the above scripts in Athena to create the tables:

最后,我在Athena中运行了上述脚本以创建表:

There you have it.

你有它。

I feel that the process is a bit lengthy. However, this has worked well for me. The other approach would be to use AWS Glue wizard to crawl the data and infer the schema. If you have used AWS Glue wizard, please share your experience in the comment section below.

我觉得这个过程有点冗长。 但是,这对我来说效果很好。 另一种方法是使用AWS Glue向导对数据进行爬网并推断架构。 如果您使用过AWS Glue向导,请在下面的评论部分中分享您的经验。

翻译自: https://www.freecodecamp.org/news/how-to-import-google-bigquery-tables-to-aws-athena-5da842a13539/

bigquery

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值