Optimizing Joins with In-Memory Aggregation

最新推荐文章于 2024-07-26 14:20:58 发布

彦祖的小号

最新推荐文章于 2024-07-26 14:20:58 发布

阅读量185

点赞数

分类专栏： Oracle 文章标签：数据库

本文链接：https://blog.csdn.net/linsuhangoracle/article/details/118569171

版权

Oracle 专栏收录该内容

193 篇文章 0 订阅

订阅专栏

（一）IM Aggregation简介

IM aggregation又称VECTOR aggregation或VECTOR GROUP BY aggregation，注这里的aggregation与GROUP BY并非指原查询有涉及聚合函数或是GROUP BY查询，而是IM aggregation的实现过程会涉及使用聚合函数及GROUP BY进行SQL transformation，在下面原理部分会详细说明

IM aggregation用于优化一个大表连结多个小表的查询

IM aggregation is also called vector aggregation and VECTOR GROUP BY aggregation.

IM aggregation optimizes query blocks involving aggregation and joins from a large table to multiple small tables.

使用IM aggregation前提是INMEMORY_SIZE值非零即可。另外IM aggregation不会对GROUP BY ROLLUP, GROUPING SETS以及CUBE操作生效。

IM aggregation requires INMEMORY_SIZE to be set to a nonzero value. However, IM aggregation does not require that the referenced tables be populated in the IM column store.

The optimizer does not choose VECTOR GROUP BY aggregations for GROUP BY ROLLUP, GROUPING SETS, or CUBE operations.

IM aggregation的原理是在join前对fact table通过dimension table过滤条件进行过滤

A typical analytic query aggregates from a fact table, and joins it to dimension tables. This type of query scans a large volume of data, with optional filtering, and performs a GROUP BY of between 1 and 40 columns. The first aggregation on the fact table processes the most rows.

Before Oracle Database 12c, the only GROUP BY operations were HASH and SORT. The VECTOR GROUP BY is an additional cost-based transformation that transforms a join between a dimension and fact table into a filter. The database can apply this filter during the fact table scan. The joins use key vectors, which are similar to Bloom filters, and the aggregation uses a VECTOR GROUP BY.

IM aggregation对行、列存储均适用，即它不依赖于IM column store。但它通过SIMD vector作用于In-Memory data更为高效

Both row-store tables and tables in the IM column store can benefit from IM aggregation.

Although vector transformations are independent of the IM column store, they can be applied very efficiently to In-Memory data through SIMD vector processing.

IM aggregation enables vector joins and GROUP BY operations to occur simultaneously with the scan of the large table. Thus, these operations aggregate as they scan, and do not need to wait for table scans and join operations to complete. IM aggregation optimizes CPU usage, especially the CPU cache.

IM aggregation can greatly improve query performance. The database can create a report outline dynamically, and then fill in report details during the scan of the fact table.

When IM Aggregation Is Useful

IM aggregation improves performance of queries that join relatively small tables to a relatively large fact table, and aggregate data in the fact table. This typically occurs in a star or snowflake query.

Example 7-1 VECTOR GROUP BY

Consider the following query, which performs a join of the customers dimension table with the sales fact table:

SELECT c.customer_id, s.quantity_sold, s.amount_sold

FROM customers c, sales s

WHERE c.customer_id = s.customer_id

AND c.country_id = 'FR';

When both tables are populated in the IM column store, the database can use SIMD vector processing to scan the row sets and apply filters.

这个例子比较特殊，过滤列正好是连接列，下面有非连接列也可对fact table进行过滤

The following figure shows how the query uses vector joins. The optimizer converts the predicate on the customers table, c.country_id='FR' into a filter on the sales fact table. The filter is country_id='FR'. Because sales is stored in columnar format, the query only needs to scan one column to determine the result.

When IM Aggregation Is Not Beneficial

IM aggregation benefits certain star queries when sufficient system resources exist. Other queries may receive little or no benefit. Specifically, VECTOR GROUP BY aggregation does not benefit performance in the following scenarios:

Joins are performed between two very large tables.

By default, the optimizer chooses a VECTOR GROUP BY transformation only if a relatively small table is joined to a relatively large table.

Dimensions contain more than 2 billion(20亿) rows.

The VECTOR GROUP BY transformation is not used if a dimension contains more than 2 billion rows Dimension

The system does not have sufficient memory.

Most databases that use the IM column store benefit from IM aggregation.

（二）In-Memory Aggregation原理

在通过small table对fact table过滤时按以下三个步骤进行，这三个步骤合称为一个DFO

VECTOR GROUP BY aggregation spends extra time processing the small tables up front to accelerate the per-row work performed on the large table. This optimization is possible because a typical analytic query distributes rows among processing stages:

Filtering tables and producing row sets
Joining row sets
Aggregating rows

The unit of work between stages is called a data flow operator (DFO).

VECTOR GROUP BY会对每个dimension创建一个key vector structure以及一个临时表

VECTOR GROUP BY aggregation uses a DFO for each dimension to create a key vector structure and temporary table. When aggregating measure columns from the fact table, the database uses this key vector to translate a fact join key to its dense grouping key. The late materialization step joins on the dense grouping keys to the temporary tables.

Optimizer Chooses IM Aggregation

IM aggregation的使用由优化器基于成本评估，在join key有较低的基数(如性别)时倾向使用

The optimizer decides whether to use vector transformation based on the size of the key vector (that is, the distinct join keys), the number of distinct grouping keys, and other factors. The optimizer tends to choose this transformation when dimension join keys have low cardinality.

Oracle Database uses VECTOR GROUP BY aggregation to perform data aggregation when the following conditions are met:

The queries or subqueries aggregate data from a fact table and join the fact table to one or more dimensions.

Multiple fact tables joined to the same dimensions are also supported assuming that these fact tables are connected only through joins to the dimension. In this case, VECTOR GROUP BY aggregates fact table separately and then joins the results on the grouping keys.

The dimensions and fact table are connected to each other only through join columns.

Specifically, the query must not have any other predicates that refer to columns across multiple dimensions or from both a dimension and the fact table. If a query performs a join between two or more tables and then joins the result to the fact, then VECTOR GROUP BY aggregation treats the multiple dimensions as a single dimension.

VECTOR GROUP BY aggregation does not support the following:

Semi-joins and anti-joins across multiple dimensions or between a dimension and the fact table
Equijoins across multiple dimensions
Aggregations performed using the DISTINCT function

Bloom filters与VECTOR GROUP BY是排它的，只能用一个

Bloom filters and VECTOR GROUP BY aggregation and are mutually exclusive. Therefore, if a query uses Bloom filters to join row sets, then VECTOR GROUP BY aggregation is not applicable to the processing of this query.

Key Vector

dense join key与dense grouping key概念

A key vector is a data structure that maps between dense join keys and dense grouping keys.

A dense key is a numeric key that is stored as a native integer and has a range of values. A dense join key represents all join keys whose join columns come from a particular fact table or dimension.

A dense grouping key represents all grouping keys whose grouping columns come from a particular fact table or dimension. A key vector enables fast lookups.

Example 7-2 Key Vector

hr.locations有country_id列的部分值如下：

Assume that the hr.locations tables has values for country_id as shown (only the first few results are shown):

SQL> SELECT country_id FROM locations;

如果条件使用了country_id='US'，则此过滤条件对应的key vector类似如下数组，其中1表示符合0表示不符合

A complex analytic query applies the filter WHERE country_id='US' to the locations table. A key vector for this filter might look like the following one-dimensional array:

In the preceding array, 1 is the dense grouping key for country_id='US'. The 0 values indicate rows in locations that do not match this filter. If a query uses the filter WHERE country_id IN ('US','JP'), then the array might look as follows, where 2 is the dense grouping key for JP and 1 is the dense grouping key for US:

Two Phases of IM Aggregation

第一步是所有dimension table并行执行的，第二步是对fact table的顺序过滤

Typically, VECTOR GROUP BY aggregation processes each dimension in sequence, and then processes the fact table. When performing IM aggregation, the database proceeds as follows:

Process each dimension sequentially as follows:

Find the unique dense grouping keys.
Create a key vector.
Create a temporary table (CURSOR DURATION MEMORY).

The following figure illustrates the steps in this phase, beginning with the scan of the dimension table in DFO 0, and ending with the creation of a temporary table. In the simplest form of parallel GROUP BY or join processing, the database processes each join or GROUP BY in its own DFO.

Figure 7-2 Phase 1 of In-Memory Aggregation

Process the fact table.

Process all the joins and aggregations using the key vectors created in the preceding phase.
Join back the results to each temporary table.

Figure 7-3 illustrates phase 2 in a join of the fact table with two dimensions. In DFO 0, the database performs a full scan of the fact table, and then uses the key vectors for each dimension to filter out nonmatching rows. DFO 2 joins the results of DFO 0 with DFO 1. DFO 4 joins the result of DFO 2 with DFO 3.

Figure 7-3 Phase 2 of In-Memory Aggregation

IM Aggregation: Scenario

This section gives a conceptual example of how VECTOR GROUP BY aggregation works.

4.1 Sample Analytic Query of a Star Schema

This sample star schema in this scenario contains the sales_online fact table and two dimension tables: geography and products.

Each row in geography is uniquely identified by the geog_id column. Each row in products is uniquely identified by the prod_id column. Each row in sales_online is uniquely identified by the geog_id, prod_id, and amount sold.

Table 7-1 Sample Rows in geography Table

country	state	city	geog_id
USA	WA	seattle	2
USA	WA	spokane	3
USA	CA	SF	7
USA	CA	LA	8

Table 7-2 Sample Rows in products Table

manuf	category	subcategory	prod_id
Acme	sport	bike	4
Acme	sport	ball	3
Acme	electric	bulb	1
Acme	electric	switch	8

Table 7-3 Sample Rows in sales_online Table

prod_id	geog_id	amount
8	1	100
9	1	150
8	2	100
4	3	110
2	30	130
6	20	400
3	1	100
1	7	120
3	8	130
4	3	200

A manager asks the business question, "How many Acme products in each subcategory were sold online in Washington, and how many were sold in California?" To answer this question, an analytic query of the sales_online fact table joins the products and geography dimension tables as follows:

SELECT p.category, p.subcategory, g.country, g.state, SUM(s.amount)

FROM sales_online s, products p, geography g

WHERE s.geog_id = g.geog_id

AND s.prod_id = p.prod_id

AND g.state IN ('WA','CA')

AND p.manuf = 'ACME'

GROUP BY category, subcategory, country, state

注意这里group by了p表的category/subcategory以及g表的country/state，下面的临时表中会分别对属于已表的列进行group by，相当于拆分了操作

4.2 Step 1: Key Vector and Temporary Table Creation for geography Dimension

In the first phase of VECTOR GROUP BY aggregation for this query, the database creates a dense grouping key for each city/state combination for cities in the states of Washington or California.

In Table 7-6, the 1 is the USA,WA grouping key, and the 2 is the USA,CA grouping key.

Table 7-4 Dense Grouping Key for geography

country	state	city	geog_id	dense_gr_key_geog
USA	WA	seattle	2	1
USA	WA	spokane	3	1
USA	CA	SF	7	2
USA	CA	LA	8	2

A key vector for the geography table looks like the array represented by the final column in Table 7-5. The values are the geography dense grouping keys. Thus, the key vector indicates which rows in sales_online meet the geography.state filter criteria (a sale made in the state of CA or WA) and which country/state group each row belongs to (either the USA,WA group or USA,CA group).

Table 7-5 Online Sales

prod_id	geog_id	amount	key vector for geography
8	1	100	0
9	1	150	0
8	2	100	1
4	3	110	1
2	30	130	0
6	20	400	0
3	1	100	0
1	7	120	2
3	8	130	2
4	3	200	1

Internally, the database creates a temporary table similar to the following:

CREATE TEMPORARY TABLE tt_geography AS

SELECT MAX(country), MAX(state), KEY_VECTOR_CREATE(...) dense_gr_key_geog

FROM geography

WHERE state IN ('WA','CA')

GROUP BY country, state

Table 7-6 shows rows in the tt_geography temporary table. The dense grouping key for the USA,WA combination is 1, and the dense grouping key for the USA,CA combination is 2.

Table 7-6 tt_geography

country	state	dense_gr_key_geog
USA	WA	1
USA	CA	2

4.3 Step 2: Key Vector and Temporary Table Creation for products Dimension

The database creates a dense grouping key for each distinct category/subcategory combination of an Acme product.

For example, in Table 7-7, the 4 is the dense grouping key for an Acme electric switch.

Table 7-7 Sample Rows in products Table

manuf	category	subcategory	prod_id	dense_gr_key_prod
Acme	sport	bike	4	1
Acme	sport	ball	3	2
Acme	electric	bulb	1	3
Acme	electric	switch	8	4

A key vector for the products table might look like the array represented by the final column in Table 7-8. The values represent the products dense grouping key. For example, the 4 represents the online sale of an Acme electric switch. Thus, the key vector indicates which rows in sales_online meet the products filter criteria (a sale of an Acme product).

Table 7-8 Key Vector

prod_id	geog_id	amount	key vector for products
8	1	100	4
9	1	150	0
8	2	100	4
4	3	110	1
2	30	130	0
6	20	400	0
3	1	100	2
1	7	120	3
3	8	130	2
4	3	200	1

Internally, the database creates a temporary table similar to the following:

CREATE TEMPORTARY TABLE tt_products AS

SELECT MAX(category), MAX(subcategory), KEY_VECTOR_CREATE(...) dense_gr_key_prod

FROM products

WHERE manuf = 'ACME'

GROUP BY category, subcategory

Table 7-9 shows rows in this temporary table.

Table 7-9 tt_products

category	subcategory	dense_gr_key_prod
sport	bike	1
sport	ball	2
electric	bulb	3
electric	switch	4

4.4 Step 3: Key Vector Query Transformation

In this phase, the database processes the fact table.

The optimizer transforms the original query into the following equivalent query, which accesses the key vectors:

SELECT KEY_VECTOR_PROD(prod_id),

KEY_VECTOR_GEOG(geog_id),

SUM(amount)

FROM sales_online

WHERE KEY_VECTOR_PROD_FILTER(prod_id) IS NOT NULL

AND KEY_VECTOR_GEOG_FILTER(geog_id) IS NOT NULL

GROUP BY KEY_VECTOR_PROD(prod_id), KEY_VECTOR_GEOG(geog_id)

The preceding transformation is not an exact rendition of the internal SQL, which is much more complicated, but a conceptual representation designed to illustrate the basic concept.

4.5 Step 4: Row Filtering from Fact Table

This phase obtains the amount sold for each combination of grouping keys.

The database uses the key vectors to filter out unwanted rows from the fact table. In Table 7-10, the first three columns represent the sales_online table. The last two columns provide the dense grouping keys for the geography and products tables.

Table 7-10 Dense Grouping Keys for the sales_online Table

prod_id	geog_id	amount	dense_gr_key_prod	dense_gr_key_geog
7	1	100	4
9	1	150
8	2	100	4	1
4	3	110	1	1
2	30	130
6	20	400
3	1	100	2
1	7	120	3	2
3	8	130	2	2
4	3	200	1	1

As shown in Table 7-11, the database retrieves only those rows from sales_online with non-null values for both dense grouping keys, indicating rows that satisfy all the filtering criteria.

Table 7-11 Filtered Rows from sales_online Table

geog_id	prod_id	amount	dense_gr_key_prod	dense_gr_key_geog
2	8	100	4	1
3	4	110	1	1
3	4	200	1	1
7	1	120	3	2
8	3	130	2	2

4.6 Step 5: Aggregation Using an Array

The database uses a multidimensional array to perform the aggregation.

In Table 7-12, the geography grouping keys are horizontal, and the products grouping keys are vertical. The database adds the values in the intersection of each dense grouping key combination. For example, for the intersection of the geography grouping key 1 and the products grouping key 1, the sum of 110 and 200 is 310.

Table 7-12 Aggregation Array

dgkp/dgkg	1	2
1	110,200
2		130
3		120
4	100

4.7 Step 6: Join Back to Temporary Tables

In the final stage of processing, the database uses the dense grouping keys to join back the rows to the temporary tables to obtain the names of the regions and categories.

The results look as follows:

CATEGORY SUBCATEGORY COUNTRY STATE AMOUNT

-------- ----------- ------- ----- ------

electric bulb USA CA 120

electric switch USA WA 100

sport ball USA CA 130

sport bike USA WA 310

（三）IM Aggregation操作

IM aggregation is integrated with the optimizer. No new SQL or initialization parameters are required. IM aggregation does not need additional indexes, foreign keys, or dimensions.

You can use the following pairs of hints:

Query block hints

VECTOR_TRANSFORM enables the vector transformation on the specified query block, regardless of costing. NO_VECTOR_TRANSFORM disables the vector transformation from engaging on the specified query block.

Table hints

You can use the following pairs of hints:

VECTOR_TRANSFORM_FACT includes the specified FROM expressions in the fact table generated by the vector transformation. NO_VECTOR_TRANSFORM_FACT excludes the specified FROM expressions from the fact table generated by the vector transformation.
VECTOR_TRANSFORM_DIMS includes the specified FROM expressions in enabled dimensions generated by the vector transformation. NO_VECTOR_TRANSFORM_DIMS excludes the specified from expressions from enabled dimensions generated by the vector transformation.

（四）In-Memory Aggregation示例

In this example, the business question is "How many products were sold in each category in each calendar year?"

You write the following query, which joins the times, products, and sales tables:

SELECT t.calendar_year, p.prod_category, SUM(quantity_sold)

FROM times t, products p, sales s

WHERE t.time_id = s.time_id

AND p.prod_id = s.prod_id

GROUP BY t.calendar_year, p.prod_category;

Example 7-3 VECTOR GROUP BY Execution Plan

The following example shows the execution plan contained in the current cursor. Steps 4 and 8 show the creation of the key vectors for the dimension tables times and products. Steps 17 and 18 show the use of the previously created key vectors. Steps 3, 7, and 15 show the VECTOR GROUP BY operations.

SQL_ID 0yxqj2nq8p9kt, child number 0

-------------------------------------

SELECT t.calendar_year, p.prod_category, SUM(quantity_sold) FROM

times t, products p, sales f WHERE t.time_id = f.time_id AND

p.prod_id = f.prod_id GROUP BY t.calendar_year, p.prod_category

Plan hash value: 2377225738

---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------

| 0|SELECT STATEMENT | | | |285(100)| | | |

| 1| TEMP TABLE TRANSFORMATION | | | | | | | |

| 3| VECTOR GROUP BY | | 5| 80 | 3(100)|00:00:01| | |

| 4| KEY VECTOR CREATE BUFFERED | :KV0000 |1826|29216| 3(100)|00:00:01| | |

| 5| TABLE ACCESS INMEMORY FULL | TIMES |1826|21912| 1(100)|00:00:01| | |

| 7| VECTOR GROUP BY | | 5| 125 | 1(100)|00:00:01| | |

| 8| KEY VECTOR CREATE BUFFERED | :KV0001 | 72| 1800| 1(100)|00:00:01| | |

| 9| TABLE ACCESS INMEMORY FULL | PRODUCTS | 72| 1512| 0 (0)| | | |

|10| HASH GROUP BY | | 18| 1440|282 (99)|00:00:01| | |

|11| HASH JOIN | | 18| 1440|281 (99)|00:00:01| | |

|12| HASH JOIN | | 18| 990 |278(100)|00:00:01| | |

|13| TABLE ACCESS FULL |SYS_TEMP_0FD9D6644_11CBE8| 5| 80 | 2 (0)|00:00:01| | |

|14| VIEW | VW_VT_AF278325 | 18| 702 |276(100)|00:00:01| | |

|15| VECTOR GROUP BY | | 18| 414 |276(100)|00:00:01| | |

|16| HASH GROUP BY | | 18| 414 |276(100)|00:00:01| | |

|17| KEY VECTOR USE | :KV0000 |918K| 20M|276(100)|00:00:01| | |

|18| KEY VECTOR USE | :KV0001 |918K| 16M|272(100)|00:00:01| | |

|19| PARTITION RANGE ALL | |918K| 13M|257(100)|00:00:01|1|28|

|20| TABLE ACCESS INMEMORY FULL| SALES |918K| 13M|257(100)|00:00:01|1|28|

|21| TABLE ACCESS FULL |SYS_TEMP_0FD9D6645_11CBE8| 5 | 125| 2 (0)|00:00:01| | |

---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

11 - access("ITEM_10"=INTERNAL_FUNCTION("C0") AND "ITEM_11"="C2")

12 - access("ITEM_8"=INTERNAL_FUNCTION("C0") AND "ITEM_9"="C2")

Note

-----

- vector transformation used for this statement

45 rows selected.

彦祖的小号

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimizing Joins with In-Memory Aggregation

（一）IM Aggregation简介IM aggregation又称VECTOR aggregation或VECTOR GROUP BYaggregation，注这里的aggregation与GROUP BY并非指原查询有涉及聚合函数或是GROUP BY查询，而是IM aggregation的实现过程会涉及使用聚合函数及GROUP BY进行SQL transformation，在下面原理部分会详细说明IM aggregation用于优化一个大表连结多个小表的查询IM aggregatio...
复制链接

扫一扫

专栏目录