bigquery_python google bigquery入门

bigquery

BigQuery is Google’s highly-scalable, serverless and cost-effective solution for enterprise interested in collecting data and storing the data. You can view BigQuery as a cloud-based data warehouse that has some interesting machine learning and BI-Engine features.

BigQuery是Google的高度可扩展,无服务器且具有成本效益的解决方案,适用于有兴趣收集数据和存储数据的企业。 您可以将BigQuery视为具有一些有趣的机器学习和BI引擎功能的基于云的数据仓库。

Formplus is an easy-to-use form builder which allows anyone to build forms on top of cloud storage services. At Formplus, we must collect the data that our users get on their forms, we don’t stop at collecting these data, data is mostly useless in its raw state, so we analyse these data so that we can help them make sense of it.

Formplus是易于使用的表单构建器,它允许任何人在云存储服务之上构建表单。 在Formplus,我们必须收集用户在其表单上获得的数据,我们不应该停止收集这些数据,数据在原始状态下几乎是无用的,因此我们对这些数据进行分析,以便我们可以帮助他们理解它们。

Formplus application is powered by Google Cloud Platform (GCP), we use GCP to host and store data, we find BigQuery as the ultimate data warehouse because it is optimised for the challenges that come with real-time data analysis, another factor is that BigQuery is cost-effective, also you don’t have to worry about the architecture and it works seamlessly with SQL for real-time data streaming.

Formplus应用程序由Google Cloud Platform(GCP)提供支持,我们使用GCP托管和存储数据,我们发现BigQuery是最终的数据仓库,因为它针对实时数据分析带来的挑战进行了优化,另一个因素是BigQuery具有成本效益,也不必担心体系结构,它可与SQL无缝协作以进行实时数据流传输。

As an engineer at Formplus, I want to share some fundamental tips on how to get started with BigQuery with Python.

作为Formplus的工程师,我想分享一些有关如何开始使用BigQuery和Python的基本技巧。

如何安装和设置BigQuery。 (How To Install and Setup BigQuery.)

This tutorial is not for total beginners, so I assume that you know how to create a GCP project or have an existing GCP project, if not, you should read this on how to get started with GCP.

本教程不是针对所有初学者的,因此我假定您知道如何创建GCP项目或拥有现有的GCP项目,否则,您应该阅读有关如何开始使用GCP的内容

After creating a GCP project, There are four basic ways to interact with BigQuery according to the official Google BigQuery documentation. However, this article will only focus on how to interact with BigQuery from your Python code. So let’s get started.

创建GCP项目后,根据官方的Google BigQuery文档 ,有四种与BigQuery进行交互的基本方法。 但是,本文仅关注如何通过Python代码与BigQuery进行交互。 因此,让我们开始吧。

Interacting with BigQuery from Python can be done with either the BigQuery REST API or client libraries, I have found myself using both of them in some cases. I prefer using the Python client library because it’s like using the BigQuery REST API but on steroid. The BigQuery REST API makes it a little bit harder to access some methods that can easily be done with the Python client.

可以使用BigQuery REST API或客户端库与Python中的BigQuery进行交互,我发现自己在某些情况下会同时使用它们。 我更喜欢使用Python客户端库,因为它类似于使用BigQuery REST API,但使用的是类固醇。 BigQuery REST API使得访问某些可以通过Python客户端轻松完成的方法变得更加困难。

BigQuery can be used by making the popular HTTP request to the server, I am going to talk about this later in the article. Google developers have provided an easy way to make these HTTP requests to the server using your favourite language. A lot of Google APIs do not have API client libraries, luckily for us, Google BigQuery has one. This article is majorly on how to query your Google BigQuery using the Python client.

BigQuery可以通过向服务器发出流行的HTTP请求来使用,我将在本文稍后讨论。 Google开发人员提供了一种使用您喜欢的语言向服务器发出这些HTTP请求的简便方法。 许多Google API没有API客户端库,幸运的是,Google BigQuery拥有一个。 本文主要介绍如何使用Python客户端查询Google BigQuery。

In the below section, I will be explaining how to install, setup and use the Python client library to manage your data.

在下一节中,我将说明如何安装,设置和使用Python客户端库来管理数据。

如何安装和连接到BigQuery客户端 (How To Install and Connect to BigQuery Client)

Image for post
Install BigQuery API client
安装BigQuery API客户端

BigQuery needs authentication by using Google Application Credentials, to authorize BigQuery, you can either let BigQuery client determine the authentication credentials from the environment or you can directly set the path of your credential JSON to the environment variables. Check out this link for more information on how to get the credential JSON.

BigQuery需要使用Google应用程序凭据进行身份验证,才能授权BigQuery,您可以让BigQuery客户端从环境中确定身份验证凭据,也可以直接将凭据JSON的路径设置为环境变量。 请查看此链接 ,以获取有关如何获取凭证JSON的更多信息。

Image for post
set path to credentials
设置凭证的路径

After setting the path of your credential JSON to the environment variable, the next thing is to authenticate BigQuery client.

将凭证JSON的路径设置为环境变量后,下一步是对BigQuery客户端进行身份验证。

Image for post
authenticate BigQuery 1
验证BigQuery 1

Or if your application is set up to use Google Cloud Platform (GCP), you can dynamically authenticate BigQuery client by letting your application environment determine your project ID and credentials.

或者,如果您的应用程序设置为使用Google Cloud Platform(GCP),则可以通过让您的应用程序环境确定您的项目ID和凭据来动态认证BigQuery客户端。

Image for post
authenticate BigQuery 2
验证BigQuery 2

Congratulations 😃 you have successfully authenticated your BigQuery client using the Python client library.

恭喜😃您已成功使用Python客户端库对BigQuery客户端进行了身份验证。

Now that we have our BigQuery running, let us delve into some of the basic things you can do with BigQuery.

现在,我们已经运行了BigQuery,让我们深入研究一下BigQuery可以完成的一些基本操作。

使用BigQuery客户端创建数据集 (Create A Dataset Using BigQuery Client)

If you already have a dataset, you should skip this part.

如果已经有数据集,则应跳过此部分。

A dataset is to a folder, as tables in a dataset are to files inside a folder, in other words, a dataset is a collection of tables. For the rest of this article, I will be authenticating BigQuery by letting the application environment determine the project ID and credentials.

数据集是文件夹,就像数据集中的表是文件夹内的文件一样,换句话说,数据集是表的集合。 对于本文的其余部分,我将通过让应用程序环境确定项目ID和凭据来对BigQuery进行身份验证。

Image for post
create BigQuery dataset
创建BigQuery数据集

使用BigQuery客户端删除数据集 (Delete A Dataset Using BigQuery Client)

There are two ways to delete a dataset, you can either delete a dataset without deleting the table that is contained in the dataset or you can delete a dataset including the tables inside the dataset.

删除数据集的方法有两种,可以删除数据集而不删除数据集中包含的表,也可以删除包括数据集内的表在内的数据集。

Image for post
delete dataset
删除数据集

使用BigQuery客户端创建表 (Create a Table Using BigQuery Client)

Tables are inside the dataset, they are like the files inside folders I told you about. Here is how to create a table. You have to specify the Schema for each column in the table, you can check out BigQuery official documentation on different types of schema fields. Also, you can specify if a particular column of your table should be required or not.

表位于数据集中,就像我告诉过您的文件夹中的文件一样。 这是创建表格的方法。 您必须为表中的每一列指定Schema,然后可以查看有关不同类型的schema字段的BigQuery官方文档。 另外,您可以指定是否需要表的特定列。

In the example below, I have three columns apart from the first column which holds the row counts.

在下面的示例中,除了第一列之外,我还有三列用于保存行数。

Image for post

Here is what my table would look like after creation: This is to show that the table is empty, we only created the schema for the table.

这是我的表在创建后的样子:这是为了表明该表为空,我们只为该表创建了架构。

Image for post
Empty table
空表

使用BigQuery Client在表中插入数据行 (Insert Rows with Data in Table Using BigQuery Client)

Image for post
insert rows in a table
在表格中插入行

Here is what my table would look like after inserting two rows with data:

这是插入两行数据后我的表的样子:

Image for post
Table with 2 rows
两行表

Now that you have your data stored in BigQuery table, this is the right time to talk about how to query your table using SQL.

现在您已将数据存储在BigQuery表中,现在是时候讨论如何使用SQL查询表了。

使用BigQuery Client在表中查询包含数据的行 (Query Rows With Data in Table Using BigQuery Client)

Here, you will learn to run queries using SQL, and how to use the QueryJobConfig() method to enable BigQuery cache when fetching data, this will reduce the cost of fetching data from BigQuery table. The idea is that when the cache is enabled, the query will try to fetch the cached result if that same query has run in the past and if the query can’t find the result in the cache, it will then fetch the data from the table.

在这里,您将学习使用SQL运行查询,以及在获取数据时如何使用QueryJobConfig()方法启用BigQuery缓存,这将减少从BigQuery表获取数据的成本。 想法是,启用缓存后,如果该查询过去已经运行过,则查询将尝试获取缓存的结果;如果查询无法在缓存中找到结果,则查询将从缓存中获取数据。表。

After fetching the result, you can either get the row values by the field name or index.

提取结果后,您可以按字段名称或索引获取行值。

Image for post
SQL query in BigQuery
BigQuery中SQL查询

使用BigQuery客户端删除表 (Delete a Table Using BigQuery Client)

Image for post
delete table
删除表

Congratulation, 🎊 these are the basic things you need to know to get started with working with Google BigQuery using Python Client Library. For more powerful things you can do with BigQuery Python client, see the official documentation

恭喜,🎊这些是使用Python客户端库开始使用Google BigQuery所需的基本知识。 有关您可以使用BigQuery Python客户端执行的更强大的操作, 请参阅官方文档

In the next section, I will be talking on how to use the BigQuery REST API.

在下一部分中,我将讨论如何使用BigQuery REST API。

BigQuery REST API (BigQuery REST API)

You can also use the BigQuery REST API to perform some operations on datasets and tables. Firstly, let us see how you can create a BigQuery service, this is similar to creating a BigQuery client using the Python client library.

您还可以使用BigQuery REST API对数据集和表执行一些操作。 首先,让我们看看如何创建BigQuery服务,这类似于使用Python客户端库创建BigQuery客户端。

使用BigQuery REST API连接到BigQuery客户端 (Connect to BigQuery client using BigQuery REST API)

​This way of connecting to BigQuery project is suitable for when inserting or fetching all the data in your table without the power of data manipulation that can be done with SQL.

这种连接到BigQuery项目的方式适用于在不使用SQL进行数据操作的情况下插入或获取表中的所有数据的情况。

I will be getting my credentials through the OAuth2 client in AppEngine.

我将通过AppEngine中的OAuth2客户端获取凭据。

Image for post
BigQuery service using REST API
使用REST API的BigQuery服务

Congratulations 😃 you have successfully connected your client using BigQuery REST API.

恭喜😃您已使用BigQuery REST API成功连接了客户端。

使用BigQuery REST API将行插入表格 (Insert row to a table using BigQuery REST API)

In this part, we will learn to insert rows to BigQuery table using the BigQuery REST API v2. We will be referring to the same table, dataset and project used in the section above.

在这一部分中,我们将学习使用BigQuery REST API v2将行插入到BigQuery表中。 我们将引用上一节中使用的相同表,数据集和项目。

Image for post
Insert rows in a table using REST API
使用REST API在表格中插入行

Here is what my table would look like after inserting rows with data:

这是插入带有数据的行后我的表的样子:

Image for post
Table with 3 rows
3行桌

BigQuery的缺点 (Disadvantages of BigQuery)

Google BigQuey is an excellent tool for people that only want a data warehouse without the need of understanding the underlying architecture behind it. However, the issue I have with Google BigQuery is that it is read-only. You can’t modify the data in a BigQuery, you can’t delete any row. You can only delete the whole table or append to the table. This is as a result of the technology behind it.

对于只需要数据仓库而不需要了解其背后的基础架构的人们来说,Google BigQuey是一个出色的工具。 但是,我对Google BigQuery的问题是它是只读的。 您无法修改BigQuery中的数据,也无法删除任何行。 您只能删除整个表或追加到表中。 这是其背后技术的结果。

结论 (​Conclusion)

In a nutshell, we have learnt how to use basic BigQuery to achieve results and we have seen how powerful BigQuery is when combined with the Python Client Library. The advantages of BigQuery far outweigh its disadvantages. Compared to other data warehouses out there, BigQuery has an edge of processing speed of complex queries. BigQuery is the perfect tool for when speed is a factor because it can process Terabytes of data in seconds, it is also very suitable for continuous streaming of data to your tables.

简而言之,我们了解了如何使用基本的BigQuery来获得结果,并且了解了与Python客户端库结合使用时BigQuery的功能。 BigQuery的优点远大于缺点。 与现有的其他数据仓库相比,BigQuery具有处理复杂查询的速度优势。 BigQuery是速度因素的理想选择,因为它可以在几秒钟内处理TB级的数据,它也非常适合将数据连续流式传输到表中。

翻译自: https://blog.formpl.us/getting-started-with-python-google-bigquery-1ebc82ca9368

bigquery

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值