bigquery_bigquery关键概念

bigquery

This is the part 5 of the series, Modernising a Data Platform and BigQuery concepts. In this part and next few parts, we will discuss about some of the key concepts of BigQuery for Data warehousing professionals.

这是本系列的第5部分,现代化数据平台和BigQuery概念。 在这一部分和接下来的几部分中,我们将讨论BigQuery for Data仓库专家的一些关键概念。

In first 4 parts of the series we have focussed on concept of Modernisation, Datawarehouse modelling & fundamentals, Characteristics of a modernised data platform and the architecture that drives the big data analytical platforms.

在本系列的前4部分中,我们重点介绍了现代化的概念数据仓库建模和基础知识现代化数据平台的特性以及驱动大数据分析平台体系结构

BigQuery is a modernised Datawarehouse solution on Cloud that offers wide range of benefits and is an answer to quite a few pain points that a traditional Datawarehouse poses to a user.

BigQuery是一种基于Cloud的现代化Datawarehouse解决方案,具有广泛的优势,并且可以解决传统Datawarehouse给用户带来的诸多痛苦。

Its ability to process the queries faster, provision to storage petabytes of data for a relatively lower cost, Serverless architecture, No Ops facilities that helps in eliminating maintenance and operational overhead for the users, compatibility with other renowned technologies and ease of migration and ML capabilities are a few important characteristics that makes BigQuery a comprehensive solution for a modern Datawarehouse.

它具有更快地处理查询的能力,以相对较低的成本提供存储PB级数据的能力,无服务器架构,无操作设施,可帮助消除用户的维护和运营开销,与其他知名技术的兼容性以及易于迁移和ML功能有几个重要的特征使BigQuery成为现代Datawarehouse的全面解决方案。

BigQuery does not deviate by a great deal from the conventional Dwh concepts such as data marts, data lake, tables, views and Grants/accesses, however it is much more organised. The Data Marts in Traditional Datawarehouse are called as datasets in BigQuery, The Data Lake which is a raw data storage option which is synonymous to Google cloud storage and Google Drive can be directly queried from BigQuery with its external data source integration capabilities. Google’s Identity and Access management controls the accesses to Bigquery datasets at a very granular level.

BigQuery与传统的Dwh概念(例如数据集市,数据湖,表,视图和授予/访问)没有很大的不同,但是它的组织性更大。 传统数据仓库中的数据集市在BigQuery中称为数据集,Data Lake是原始数据存储选项,与Google云存储同义,而Google Drive可以通过其外部数据源集成功能直接从BigQuery查询。 Google的身份和访问权限管理可以非常精细地控制对Bigquery数据集的访问。

Image for post
Access to public datasets for exploration
访问公共数据集进行探索
Image for post
Assigning specific roles to users while sharing a table
共享表时为用户分配特定角色

BigQuery has the following key features:

BigQuery具有以下主要功能:

1. Loading & Exporting the data

1.加载和导出数据

2. Querying and Viewing the data

2.查询和查看数据

3. Managing the data

3.管理数据

The data can be transactional data, Analytical data, Logs that can be streamed in for further analysis.

数据可以是事务数据,分析数据,日志,可以将其流式传输以进行进一步分析。

The data can be written from external sources directly into BigQuery or BigQuery’s analytical engine can alone be used for processing the data while storage isn’t used.

可以将数据从外部源直接写入BigQuery,也可以将BigQuery的分析引擎单独用于处理数据,而无需使用存储。

A user can connect/interact with BigQuery in 3 ways:

用户可以通过3种方式与BigQuery连接/互动:

1. UI/Console

1. UI /控制台

2. Rest API

2. Rest API

3. Command Line

3.命令行

As mentioned earlier, BigQuery is serverless and we do not have to worry about providing or allocating resources. User will just have to follow the below steps:

如前所述,BigQuery是无服务器的,我们不必担心提供或分配资源。 用户只需遵循以下步骤:

1. Create a project

1.创建一个项目

2. Create a dataset

2.创建一个数据集

3. Create a schema as per the design

3.根据设计创建模式

4. Import data into the tables using various ETL techniques native to GCP

4.使用GCP固有的各种ETL技术将数据导入表中

BigQuery keeps the data organised in the datasets and tables. So the structure would always be project.dataset.table.

BigQuery会将数据整理在数据集和表格中。 因此,结构将始终为project.dataset.table。

Remember the Part 4 where we have spoken about the underlying technology of BigQuery/BigData Analytical processing systems. Dremel and Borg. They play a major role in allocation of resources and distribution of the processing loads within the resources.

还记得我们在第4部分中谈到的BigQuery / BigData分析处理系统的基础技术。 德雷梅尔和博格。 它们在资源分配和资源内处理负载的分配中起主要作用。

BigQuery’s distribution of loads is per “Slot”. A “Slot” is a combination of CPU, Memory and Networking resources. 1 Slot literally represents a virtual CPU with X memory. When user enters a query, BigQuery decides the number of Slots required based on the query complexity. The higher the complexity, the more will be the number of Slots that will be requested. Few important points to remember about Slots:

BigQuery的负载分配是按“插槽”划分的。 “插槽”是CPU,内存和网络资源的组合。 1个插槽从字面上表示具有X内存的虚拟CPU。 用户输入查询时,BigQuery会根据查询的复杂性来决定所需的插槽数。 复杂度越高,将请求的插槽数量越多。 关于老虎机,请记住以下几点要点:

1. They are dynamically dealt with by BQ in-flight. Which means, In case the query demands for more Slots than available, a portion of work gets queued up.

1. BQ机上会动态处理它们。 这意味着,如果查询要求的插槽数超过可用插槽数,则一部分工作会排队。

2. Slots can be determined on the basis of an on-demand pricing model where the pricing is on the basis of bytes processed by the query

2.插槽可以基于按需定价模型确定,其中定价基于查询处理的字节

3. Slots can also be determined on the basis of Flat pricing where the number of Slots are reserved for the project and the billing is on month on month basis.

3.插槽也可以根据固定价格确定,其中为该项目保留了插槽数量,并且按月计费。

Run-length encoding:

游程编码:

We have seen briefly about Columnar data storage in Part 4, however as an extension, there are two brilliant ways of compressing the data even further. They are bitmaps and RLE (Run Length Encoding). Let us look at an example.

我们已经在第4部分中简要了解了列式数据存储,但是作为扩展,有两种精妙的方式来进一步压缩数据。 它们是位图和RLE(行程编码)。 让我们来看一个例子。

Let us consider an example of an e-commerce dataset which has a table where monetary value of orders per customer are captured. Column A represents customer ID and column B represents the total value of the transaction. Lets say, the column B has 100 values out of which 75 are distinct values. Which means, there are 100 transactions in total out of which 75 of them had a unique transaction value.

让我们考虑一个电子商务数据集的示例,该示例具有一个表格,其中捕获了每个客户的订单的货币价值。 列A代表客户ID,列B代表交易的总价值。 可以说,列B具有100个值,其中75个是不同的值。 这意味着,总共有100笔交易,其中75笔具有唯一的交易价值。

The bitmap way of compression looks like below:

压缩的位图方式如下所示:

Transaction value: 100.00 [1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,…………..0]

交易价值:100.00 [1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,…………..0]

Transaction value: 110.00 [0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,…………..0]

交易价值:110.00 [0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,…………..0]

.

.

.

Transaction value: 170.00 [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,…………..1]

交易价值:170.00 [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,…………..1]

Each binary value inside the square bracket represents whether that individual transaction had the respective transaction value. 1 means Yes and 0 means No.

方括号内的每个二进制值表示该单个交易是否具有相应的交易值。 1表示是,0表示否。

Taking it to next level, RLE technique is used for further compressing the data which looks like:

将其提升到一个新的水平,RLE技术用于进一步压缩数据,如下所示:

Transaction Value: 100 [1, 4,3,92] à [one 1, 4 zeroes, Three 1s and rest 0s]

交易价值:100 [1,4,3,92]à[一个1、4个零,三个1s和休息0s]

Transaction Value: 110 [1,2,97] à [1 zero, two 1s, rest 0s]

交易价值:110 [1,2,97]à[1个零,两个1s,休息0s]

In the above representation that uses RLE, we get to know how many rows in the entire dataset had a transaction value of 100, 110 and so on..

在上面使用RLE的表示中,我们知道整个数据集中有多少行的事务值为100、110,依此类推。

So, combining all these, below is how the entire encoded view looks like:

因此,结合所有这些,下面是整个编码视图的外观:

Image for post

All of this happens inside BigQuery’s Capacitor. Once the data is encoded it is all stored in Colossus — Google’s distributed data storage.

所有这些都发生在BigQuery的电容器内部。 数据编码后,全部存储在Colossus(Google的分布式数据存储)中。

One important aspect that BigQuery brings to the table is separation of compute and storage. Along with that the redundant replication of datasets ensures the data is not lost. The connectivity between storage and compute as we discussed in part 4 is via Jupiter — Google’s petabit bandwidth network.

BigQuery带给表格的一个重要方面是计算和存储的分离。 此外,数据集的冗余复制可确保数据不会丢失。 正如我们在第4部分中讨论的那样,存储和计算之间的连接是通过Jupiter(Google的PB带宽网络)进行的。

翻译自: https://medium.com/front-end-weekly/key-bigquery-concepts-a80269118115

bigquery

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值