[3] Hive3.x Materialized view

最新推荐文章于 2023-07-23 00:51:02 发布

hjw199089

最新推荐文章于 2023-07-23 00:51:02 发布

阅读量2.5k

点赞数

分类专栏： [1]Hive

[1]Hive 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

Objectives

一般来说，查询加速的最有效方法即

关系聚合预计算（pre-computation of relevant summaries）
物化视图（materialized views）

Hive3.0开始尝试引入物化视图，并提供对于物化视图的查询自动重写（基于Apache Calcite实现）；值得注意的是，3.0中提供了物化视图存储选择机制，可以本地存储在hive，同时可以通过用户自定义storage handlers存储在其他系统（如Druid）。Hive3.0提供了对于物化视图生命周期管理（如数据更新）的控制。

Not a view not a table, meet the Materialized view

According to Wikipedia a SQL View is the result set of a stored query on the data. Let’s say you have a lot of different tables that you are constantly requesting, using always the same joins, filters and aggregations. With a view, you could simplify access to those datasets while providing more meaning to the end user. It avoids repeating the same complex queries and eases schema evolution.

For example, an application needs access to a products dataset with the product owner and the total number of order for each product. Such queries would need to join the User and Order tables with the Product table. A view would mask the complexity of the schema to the end users by only providing one table with custom and dedicated ACLs.

However such views in Hive used to be virtual and implied huge and slow queries. Instead, you could create an intermediate table to store the results of your query, but such operations require changing your access patterns and has the challenge of making sure the data in the table stays fresh.

We can identify four main types of optimization:

Change data’s physical properties (distribute, sort).
Filter or partition rows.
Denormalization(Denormalization is the operation of grouping two or more tables into one bigger table. Basically it removes the need of a heavy JOIN operation)
Preaggregation.

The goal of Materialized views (MV) is to improve the speed of queries while requiring zero maintenance operations.

The main features are:

Storing the result of a query just like a table (the storage can be in Hive or Druid).
The definition of the MV is used to rewrite query and requires no change in your previous patterns.
The freshness of the data is ensured by the system.
A simple insert in the table is very efficient since it does not require rebuilding the view.

Management of materialized views in Hive

Materialized views creation
支持的基本特性：

partition columns
custom storage handler
passing table properties

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
  [DISABLE REWRITE]
  [COMMENT materialized_view_comment]
  [PARTITIONED ON (col_name, ...)]
  [
    [ROW FORMAT row_format]
    [STORED AS file_format]
      | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;

说明

（1）物化视图创建后，query的执行数据自动落地，"自动"也即在query的执行期间，任何用户对该物化视图是不可见的
（2）默认，该物化视图可被用于查询优化器optimizer查询重写（在物化视图创建期间可以通过DISABLE REWRITE参数设置禁止使用）
(3) SerDe和storage format非强制参数，可以用户配置，默认可用hive.materializedview.serde、 hive.materializedview.fileformat
（4）物化视图可以使用custom storage handlers存储在外部系统（如druid）例如：

CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;

Other operations for materialized view management

目前支持物化视图的drop和show操作，后续会增加其他操作

-- Drops a materialized view
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
-- Shows materialized views (with optional filters)
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
-- Shows information about a specific materialized view
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;

Materialized view-based query rewriting

物化视图创建后即可用于相关查询的加速，用户提交查询query，若该query经过重写后可命中已建视图，则被重写命中相关已建视图实现查询加速。
是否重写查询使用物化视图可以通过全局参数控制（hive.materializedview.rewriting，默认为true, ）
SET hive.materializedview.rewriting=true;
用户可选择性的失能物化视图的重写，materialized views are enabled for rewriting at creation time. To alter that behavior, the following statement can be used: ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;

基于Calcite重写物化视图，其中支持的重写样例可参见：
Materialized Views

Materialized view maintenance

当数据源变更（新数据插入inserted、数据修改modified），物化视图也需要更新以保持数据一致性，目前需要用户主动触发rebuild：

ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;

增量更新
Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.

By default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.

To execute incremental maintenance, following conditions should be met:

The materialized view should only use transactional tables, either micromanaged or ACID.
If the materialized view definition contains a Group By clause, the materialized view should be stored in an ACID table, since it needs to support MERGE operation. For materialized view definitions consisting of Scan-Project-Filter-Join, this restriction does not exist.
A rebuild operation acquires an exclusive write lock over the materialized view, i.e., for a given materialized view, only one rebuild operation can be executed at a given time.

Materialized view lifecycle

By default, once a materialized view contents are stale, the materialized view will not be used for automatic query rewriting.

However, in some occasions it may be fine to accept stale data, e.g., if the materialized view uses non-transactional tables and hence we cannot verify whether its contents are outdated, however we still want to use the automatic rewriting. For those occasions, we can combine a rebuild operation run periodically, e.g., every 5minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window configuration parameter, for instance:

SET hive.materializedview.rewriting.time.window=10min;
The parameter value can be also overridden by a concrete materialized view just by setting it as a table property when the materialization is created.

Materialized view related setting parameters

<property>
    <name>hive.materializedview.rewriting</name>
    <value>true</value>
    <description>Whether to try to rewrite queries using the materialized views enabled for rewriting</description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.strategy</name>
    <value>heuristic</value>
    <description>
      Expects one of [heuristic, costbased].
      The strategy that should be used to cost and select the materialized view rewriting. 
        heuristic: Always try to select the plan using the materialized view if rewriting produced one,choosing the plan with lower cost among possible plans containing a materialized view
        costbased: Fully cost-based strategy, always use plan with lower cost, independently on whether it uses a materialized view or not
    </description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.time.window</name>
    <value>0min</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is min if not specified.
      Time window, specified in seconds, after which outdated materialized views become invalid for automatic query rewriting.
      For instance, if more time than the value assigned to the property has passed since the materialized view was created or rebuilt, and one of its source tables has changed since, the materialized view will not be considered for rewriting. Default value 0 means that the materialized view cannot be outdated to be used automatically in query rewriting. Value -1 means to skip this check.
    </description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.incremental</name>
    <value>false</value>
    <description>
      Whether to try to execute incremental rewritings based on outdated materializations and
      current content of tables. Default value of true effectively amounts to enabling incremental
      rebuild for the materializations too.
    </description>
  </property>
  <property>
    <name>hive.materializedview.rebuild.incremental</name>
    <value>true</value>
    <description>
      Whether to try to execute incremental rebuild for the materialized views. Incremental rebuild
      tries to modify the original materialization contents to reflect the latest changes to the
      materialized view source tables, instead of rebuilding the contents fully. Incremental rebuild
      is based on the materialized view algebraic incremental rewriting.
    </description>
  </property>
  <property>
    <name>hive.materializedview.fileformat</name>
    <value>ORC</value>
    <description>
      Expects one of [none, textfile, sequencefile, rcfile, orc].
      Default file format for CREATE MATERIALIZED VIEW statement
    </description>
  </property>
  <property>
    <name>hive.materializedview.serde</name>
    <value>org.apache.hadoop.hive.ql.io.orc.OrcSerde</value>
    <description>Default SerDe used for materialized views</description>
  </property>

Example

（1）新建一张transactional表depts

SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
CREATE TABLE depts (
  deptno INT,
  deptname VARCHAR(256),
  locationid INT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

（2）导入数据

hive> INSERT OVERWRITE TABLE depts
    > select
    > id,name,1 as loc from student;
Query ID = didi_20181128204405_c06c8983-a363-458b-b1f8-443deeb514c2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
  ....
  
hive> select * from depts;
OK
1001	zhangsan	1
1002	lisi	1
Time taken: 0.195 seconds, Fetched: 2 row(s)
...

（3）对depts建立聚合物化视图

hive> CREATE MATERIALIZED VIEW depts_agg
    > AS
    > SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
    
Query ID = didi_20181128204706_be53ca94-f594-49a2-beda-7cec2b2f2c71
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1543385586294_0004, Tracking URL = http://localhost:8088/proxy/application_1543385586294_0004/
Kill Command = /..../software/hadoop/hadoop-2.7.4/bin/mapred job  -kill job_1543385586294_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
.....

注意.
这里日志可见，当执行CREATE MATERIALIZED VIEW，与一遍creat table 不同，会启动一个MR（这里没有指定其他类型的引擎如spark，默认为MR）对物化视图进行构建

打开资源UI，可见具体的构建作业
在这里插入图片描述

（4）对原始表deptno查询
由于会命中物化视图，重写query查询物化视图，查询速度会加快（没有启动MR，只是普通的tablescan）

hive> SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
OK
1001	1
1002	1
Time taken: 0.414 seconds, Fetched: 2 row(s)

具体可见执行过程
查询被自动重写为TableScan alias: hive3_test.depts_agg

hive> explain SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: hive3_test.depts_agg
          Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: deptno (type: int), deptno_cnt (type: bigint)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            ListSink

Time taken: 0.275 seconds, Fetched: 17 row(s)

Roadmap

Many improvements are planned :

Improving the rewriting algorithm inside Apache Calcite
Control distribution of data inside the view (SORT BY, CLUSTER BY, DISTRIBUTE BY)
Supports UPDATE/DELETE in incremental rebuild of the view