bigquery数据类型_逐步指南，将数据加载到bigquery中

最新推荐文章于 2023-12-23 05:05:15 发布

weixin_26705651

最新推荐文章于 2023-12-23 05:05:15 发布

阅读量719

点赞数

文章标签： python java

原文链接：https://medium.com/front-end-weekly/step-by-step-guide-to-load-data-into-bigquery-8a25ad415342

版权

本文详细介绍了如何逐步将数据加载到Google BigQuery中，并重点讨论了BigQuery的各种数据类型，包括其特点和使用场景。

摘要由CSDN通过智能技术生成

bigquery数据类型

In this Part 6 of the series, “Modernisation of a Data Platform”, we would be focussing a little more on BigQuery’s key concepts which are essential for designing a DWH.

在本系列的第6部分“数据平台的现代化”中，我们将重点关注BigQuery的关键概念，这些概念对于设计DWH是必不可少的。

In this part, we will see how to deal with table design in BigQuery using different methods and load a covid19_tweets dataset and run a query to analyse the data.

在这一部分中，我们将看到如何使用不同的方法处理BigQuery中的表设计，并加载covid19_tweets数据集并运行查询以分析数据。

Creating a Schema:

创建模式：

We can create a schema in BigQuery either while migrating the data from an existing datawarehouse or ingesting the data into BigQuery from various data sources that are either on cloud or on-premise.

我们可以在BigQuery中创建架构，既可以从现有数据仓库中迁移数据，也可以从云或本地的各种数据源中将数据提取到BigQuery中。

Other than manually creating the schema, Bigquery also gives an option to auto-detect the schema.

除了手动创建模式外，Bigquery还提供了自动检测模式的选项。

How does this auto-detect work?

自动检测如何工作？

BigQuery compares the header row of an input file and a representative sample of 100 records from row 2 onwards. If the data types of 100 samples differ from the header row, BigQuery proceeds to use them as column names. User will just have to enable auto-detect to have the schema created automatically while load happens.

BigQuery比较输入文件的标题行和从第2行开始的100条记录的代表性示例。如果100个样本的数据类型与标题行不同，则BigQuery继续将其用作列名。用户只需启用自动检测功能，即可在加载发生时自动创建架构。

Datatypes in BigQuery:

BigQuery中的数据类型：

While most of the data types are standard ones such as Integer, float, Numeric, Boolean etc, one special data type that we need to discuss is STRUCT.

尽管大多数数据类型都是标准数据类型，例如Integer，float，Numeric，Boolean等，但我们需要讨论的一种特殊数据类型是STRUCT。

This data type is particularly used for nested and repeated fields. The best example to represent a STRUCT is addresses. Normally, addresses have multiple sub-fields such as Is_active, address line 1, address line 2, town, city, post code, number of years_addr etc.,

此数据类型特别用于嵌套字段和重复字段。代表STRUCT的最佳示例是地址。通常，地址具有多个子字段，例如Is_active，地址行1，地址行2，城镇，城市，邮政编码，years_addr等。

All these fields can be nested under the parent field ‘Addresses’. While normal data types are either ‘nullable’ or ‘non nullable’, the mode of STRUCTS would always be defined as ‘REPEATED’.

所有这些字段都可以嵌套在父字段“地址”下。虽然普通数据类型为“可为空”或“不可为空”，但STRUCTS的模式将始终定义为“重复”。

Creating Tables & Managing Accesses:

创建表和管理访问：

The easiest way to create a table in BQ is by using the Cloud Console. The UI is extremely friendly and user can navigate to BigQuery console to create tables in a dataset.

在BQ中创建表的最简单方法是使用Cloud Console。用户界面非常友好，用户可以导航到BigQuery控制台以在数据集中创建表。

Alternatively, there is a REST API service that can be used to insert tables with a specific schema into the dataset.

另外，还有一种REST API服务，可用于将具有特定架构的表插入数据集中。

BigQuery provides an option to restrict access at a dataset level. However, there is a beta feature (as of this article is being published) to grant access at a table level or view level too. Access can be granted as a data viewer, data admin, data editor, data owner etc.,

BigQuery提供了一个选项来限制数据集级别的访问。但是，有一个beta功能(在本文发布时)也可以在表级别或视图级别授予访问权限。可以授予访问权限，例如数据查看者，数据管理员，数据编辑者，数据所有者等，

BigQuery allows users to copy table, delete table, alter the expiration time of the table, update table description etc. All these actions are possible either by Console, API, bq command line or using Client libraries.

BigQuery允许用户复制表，删除表，更改表的到期时间，更新表说明等。所有这些操作都可以通过控制台，API，bq命令行或使用客户端库来进行。

What is Partitioning & Clustering?

什么是分区和群集？

Query performance is paramount in BigQuery and one of the key features that enable the same is table partitioning. Dividing the large sized tables into small partitions is the key in enhancing the query performance and fetching the results quicker.

查询性能在BigQuery中至关重要，而实现这一点的关键功能之一就是表分区。将大型表划分为小分区是提高查询性能和更快地获取结果的关键。

Partitions can be done by the following methods:

可以通过以下方法进行分区：

· Ingestion time — On the basis of the time data arrives to bigquery

·提取时间-根据数据到达bigquery的时间

Ex: If the data load happens to BigQuery on a daily basis, then partitions are created for every day. A pseudo column called “_PARTITIONTIME” is created which calls out the date of load. By default the schema of all these tables will be the same, but bigquery provides an option to change the schema.

例如：如果BigQuery每天都会发生数据加载，则每天都会创建分区。创建了一个名为“ _PARTITIONTIME”的伪列，该列调出加载日期。默认情况下，所有这些表的架构都是相同的，但是bigquery提供了更改架构的选项。

· Date/Timestamp — Based on the data or timestamp column of the table

·日期/时间戳记-基于表的数据或时间戳记列

Ex: Based on the timestamp at which a row is created, a partition can be created. In a user table, every user who gets registered will have a registration timestamp. We can define the partition based on the registration timestamp at a granularity of daily or even hourly.

例：根据创建行的时间戳，可以创建分区。在用户表中，每个注册的用户都将有一个注册时间戳。我们可以基于注册时间戳以每天甚至每小时的粒度定义分区。

· Integer range — Based on the range of Integer column.

·整数范围-基于“整数”列的范围。

Perhaps the simplest one where partitions are on the basis of IDs. Ex: Customer IDs 1 to 100 in Partition 1 and so on…

也许是最简单的分区，其中基于ID进行分区。例如：分区1中的客户ID 1至100，依此类推…

In case we do not care on the basis of which tables are to be partitioned, we can use Clustering techniques to split the tables. Both will improve the query performance, but partitioning is more specific and of user choice.

如果我们不在乎要对哪些表进行分区，则可以使用“聚类”技术来拆分表。两者都可以提高查询性能，但是分区更加具体，用户可以选择。

Let us load a dataset which is a collection of some of the tweets related to Covid19 and do a short analysis.

让我们加载一个数据集，该数据集是与Covid19相关的一些推文的集合，并进行简短分析。

Step 1: Create a project on Google Cloud “Test Project”

步骤1：在Google Cloud“测试项目”上创建一个项目

Step 2: Enable BigQuery API to enable calls from client libraries.

步骤2：启用BigQuery API，以启用来自客户端库的调用。

BigQuery sandbox lets user to load data up to 10GB and query data up to 1TB for free of cost without enabling the billing account.

BigQuery沙箱可让用户免费加载高达10GB的数据并查询高达1TB的数据，而无需启用计费帐户。

Step 3: Install Cloud SDK to run the commands from your local machine. Alternatively one can login to the Google Cloud console and click on the Cloud Shell icon adjacent to the search bar on the header of the console.

步骤3：安装Cloud SDK，以从本地计算机运行命令。或者，您可以登录Google Cloud控制台，然后单击控制台标题上搜索栏旁边的Cloud Shell图标。

On Cloud SDK type gcloud init to initialise.

在Cloud SDK上，输入gcloud init进行初始化。

Follow the instructions such as selecting the credentials with which you choose to login. The credentials are normally gmail credentials which is associated with the google cloud project. Most of it is self-explanatory

按照说明进行操作，例如选择用于登录的凭据。这些凭据通常是与Google云项目相关联的gmail凭据。大部分是不言自明的

In case you choose to operate via Google Cloud Shell, then the following commands will set it up.

如果您选择通过Google Cloud Shell操作，则以下命令将对其进行设置。

gcloud config set project <project-id>
sudo apt-get update
sudo apt-get install virtualenv
virtualenv -p python3 venv
source venv/bin/activate

The above commands will setup the project, installs a virtual environment and activates it.

上面的命令将设置项目，安装虚拟环境并激活它。

Once the venv is activated, install google cloud’s bigquery library into the Cloud Shell VM’s virtual env.

激活venv后，将Google Cloud的bigquery库安装到Cloud Shell VM的虚拟env中。

pip install --upgrade google-cloud-bigquery

On Cloud Console, navigate to IAM > Service accounts >Create service account > Download the key json to local

在Cloud Console上，导航到IAM>服务帐户>创建服务帐户>将密钥json下载到本地

export GOOGLE_APPLICATION_CREDENTIALS='PATH TO JSON FILE'

This will allow us to operate bigquery via Client libraries as well as using native cloud shell commands.

这将使我们能够通过客户端库以及使用本机云外壳命令来操作bigquery。

Let us now create a Google Cloud storage bucket in order to store the csv file. Via Cloud Console, User can just navigate to Storage > Create Bucket. It is as simple as creating a folder on Google Drive and uploading files to it. If you choose to do the same using the shell commands.

现在让我们创建一个Google Cloud存储桶以存储csv文件。通过Cloud Console，用户可以仅导航到Storage> Create Bucket。就像在Google云端硬盘上创建文件夹并将文件上传到其中一样简单。如果您选择使用shell命令执行相同的操作。

F

F

gsutil mk -b on -l us-east1 gs://<bucket-name>

The above command is split as below:

上面的命令如下所示：

gsutil — A command used for accessing Google Cloud Storage via Cloud Shell as well as Cloud SDK

gsutil-用于通过Cloud Shell和Cloud SDK访问Google Cloud Storage的命令

mk — Command to create a bucket

mk-创建存储桶的命令

-l <location name> — Creates the bucket in a specific location of choice. This effectively means that the data will be stored in the servers of that location.

-l <位置名称>-在选择的特定位置创建存储桶。这实际上意味着数据将存储在该位置的服务器中。

gs://<bucket-name>: The bucket name we wish to assign

gs：// <bucket-name> ：我们希望分配的存储桶名称

Once bucket is created we can upload data. Our data is in csv format. Having the data in a google cloud storage bucket gives 99.999% availability and ensures it is not lost as the data is replicated redundantly as we discussed in one of the previous posts.

创建存储桶后，我们就可以上传数据了。我们的数据为csv格式。将数据存储在google云存储桶中可提供99.999％的可用性，并确保数据不会丢失，因为我们已在上一篇文章中讨论过冗余复制数据。

gsutil cp <path of csv on local> gs://<bucket-name>

In simple words the command copies the file ‘cp’ from the path we mention and pastes it inside the bucket we created previously.

简单来说，该命令将从我们提到的路径中复制文件“ cp”，并将其粘贴到我们先前创建的存储桶中。

BigQuery is Democratisation in its true sense. Making the features easily understandable and accessible to everyone irrespective of their technical acumen.

BigQuery是真正意义上的民主化。使每个人都易于理解和访问这些功能，而不管其技术敏锐度如何。

Now we have the file we want to load into BigQuery, available in the Google Cloud Storage bucket. Next task is to load it into BigQuery for further analysis.

现在，我们有了要加载到BigQuery中的文件，该文件可在Google Cloud Storage存储桶中找到。下一个任务是将其加载到BigQuery中以进行进一步分析。

Similar to “gsutil” for Google Cloud storage the shell command for BigQuery is “bq”

与Google云端存储的“ gsutil”类似，BigQuery的shell命令为“ bq”

We can perform “bq” commands either via Cloud SDK or Cloud Shell in the console.

我们可以通过Cloud SDK或控制台中的Cloud Shell执行“ bq”命令。

BigQuery creates datasets and they can be created in a particular geography, just as how we create VM’s in a particular location.

BigQuery创建数据集，并且可以在特定地理位置创建数据集，就像我们在特定位置创建VM的方式一样。

bq --location=<location-name> mk -d <dataset-name>

We can add additional parameters to the command such as setting up a partition pattern or setting up an expiration date to the dataset etc.,Ex:

我们可以向命令添加其他参数，例如设置分区模式或为数据集设置到期日期等，例如：

bq --location=location mk \
--dataset \
--default_table_expiration integer1 \
--default_partition_expiration integer2 \
--description description \project_id:dataset

After creating a dataset, we create a table on BigQuery. The below command creates a table called covid19_tweets in the dataset called mydataset1. We are creating an empty table without any schema.

创建数据集后，我们在BigQuery上创建一个表。以下命令在名为mydataset1的数据集中创建一个名为covid19_tweets的表。我们正在创建一个没有任何模式的空表。

bq mk -t mydataset1.covid19_tweets

We will now load the data using another bq command. We also need to create schema while loading the data. For that we will use auto-detect command along with bq. We will be giving the path of the Cloud storage bucket and the destination table details in the command.

现在，我们将使用另一个bq命令加载数据。我们还需要在加载数据时创建架构。为此，我们将与bq一起使用自动检测命令。我们将在命令中提供Cloud Storage存储桶的路径和目标表详细信息。

bq load --autodetect  --source_format=CSV mydataset1.covid19_tweets gs://srikrishna1234/covid19_tweets.csv

Once run we see the below messages on the terminal indicating that the job is in progress and finally DONE.

运行后，我们会在终端上看到以下消息，指示作业正在进行中，并最终完成。

Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (0s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (1s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (2s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (3s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (5s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (6s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (7s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (9s) Current status: R                                                                                Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (10s) Current status:                                                                                 Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (11s) Current status:                                                                                 Waiting on bqjob_r716b7160069a1bce_00000173ede404b9_1 ... (11s) Current status: DONE

So far, we have see how to use Console UI and CLI to operate BigQuery. Now that the data is loaded, let us run some queries to analyse a scenario using Python’s client libraries.

到目前为止，我们已经了解了如何使用控制台UI和CLI来操作BigQuery。现在已经加载了数据，让我们运行一些查询以使用Python的客户端库分析场景。

Let us find out from the dataset on “How many users from United Kingdom have made controversial tweets or rather used the word “Controversial” in their tweets.

让我们从“有多少英国用户进行了有争议的推文，或者在其推文中使用了“有争议的”一词的数据集中找出来。

Query would look like below:

查询如下所示：

with X as (select user_name, user_location, user_description from `<project-name>.mydataset1.covid19_tweets` where user_description like '%controversial%') select count(user_location) as count from X where user_location in ('United Kingdom')

A simple python program with the SQL query that hits Bigquery and extracts the information for us:

一个简单的带有SQL查询的python程序，命中Bigquery并为我们提取信息：

from google.cloud import bigquery
def query_tweeples():client = bigquery.Client()query_job = client.query("""with X as (select user_name, user_location, user_descriptionfrom `canvas-seat-267117.mydataset1.covid19_tweets`where user_description like '%controversial%')select count(user_location) as count from Xwhere user_location in ('United Kingdom')""")results = query_job.result()  # Waits for job to complete.for row in results:print("{} people".format(row.count))if __name__ == "__main__":query_tweeples()

Save the above code as a .py file. Ex: bigquery.py

将上面的代码另存为.py文件。例如：bigquery.py

Remember, we have already previously created a service account, downloaded the json key file, exported them to application credentials and installed google cloud bigquery library too. Time to run the python code.

记住，我们之前已经创建了一个服务帐户，下载了json密钥文件，将它们导出到应用程序凭证中，还安装了Google Cloud bigquery库。是时候运行python代码了。

cd <path where the python file is placed>python bigquery.py

You will find the output as ‘43 people’ which means there are 43 people in the dataset who are in United Kingdom and had the word ‘controversial’ in their tweets.

您会发现输出为“ 43个人”，这意味着数据集中有43个人在英国，并且在其推文中带有“有争议”一词。

In case you wish to use bq command for the above activity, use the below command:

如果您希望对上述活动使用bq命令，请使用以下命令：

Bq query –destination_table mydataset1.covid19_tweets ‘<query>’

So, in this part of the series we learnt how to interact with Bigquery via console, Command line interface using bq commands and also have performed some actions using client libraries.

因此，在本系列的这一部分中，我们学习了如何通过控制台，使用bq命令的命令行界面与Bigquery进行交互，以及如何使用客户端库执行了一些操作。

Hope you are enjoying the series. In the next part we will learn about streaming data into Bigquery.

希望您喜欢这个系列。在下一部分中，我们将学习将数据流传输到Bigquery中。