学习 hash join

最新推荐文章于 2023-10-18 12:14:48 发布

cuibizhang4005

最新推荐文章于 2023-10-18 12:14:48 发布

阅读量198

点赞数

------------------关于散列连接的概念
最优散列连接：读取build表，按hash放到bucket中，如果bulid表数据都能放入内存，就是最优散列连接。
一遍散列连接：读取build表，按hash放到bucket中，如果bulid表数据不能全部放入内存，就向temp写数据，build读完后
               开始读probe表，按照一定算法，读完后，会向temp写出可能和前一步写到temp上的build表数据匹配的值，这些值按                        partion(区) 为单元划分。
               然后一个区一个区读入temp中的build数据，如果一次能把一个区的build数据完全读入内存处理，就是一遍散列连接。
多遍散列连接：在一遍散列连接基础上如果不能一次读完一个区的build数据到内存，则把这个区build数据分成几部分，一个一个读，读一个                就再读temp上对应区的probe数据进行匹配，因为每读build的每一个部分都要读该区probe的所有数据，则probe的数据会被读n              遍，所以就叫多遍散列连接

-------hash连接的执行计划的一些解释
----例1
access("A"."SITE_ID"="B"."SITE_ID")
       filter("A"."CATEGORY"="B"."FROM_CAT" OR
              "A"."CATEGORY2"="B"."FROM_CAT") 、
代表在"A"."SITE_ID"="B"."SITE_ID"列上构建hash ,在category,category2 ,from_cat上进行过滤
-----例2
access("A"."CATEGORY"="B"."FROM_CAT" AND
              "A"."SITE_ID"="B"."SITE_ID")
代表在"A"."CATEGORY"="B"."FROM_CAT" AND
              "A"."SITE_ID"="B"."SITE_ID"列上构建hash
---------一下是10104跟踪的关键部分的解释
Join Type: INNER join
Original hash-area size: 4374014 ：hash_size的值
Memory for slot table: 2949120   ：可用于创建bucket的内存
Calculated overhead for partitions and row/slot managers: 1424894
Hash-join fanout: 8
Number of partitions: 8 ：分区数
Number of slots: 24      ：簇数
Multiblock IO: 15        ：一次io读的块数，也就是一个簇包含的块数，系统按簇为单位进行读写
Block size(KB): 8        ：块大小
Cluster (slot) size(KB): 120 : 簇的大小
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 192
Estimated build size (KB): 0   ：为了执行最优连接还要多少内存，本例已经是最优的所以为0
Estimated Build Row Length (includes overhead): 15

----分区信息
Total number of partitions: 8 ：分区总数
Number of partitions which could fit in memory: 8 ：能在内存的分区数
Number of partitions left in memory: 8 ：留在内存的分区数
Total number of slots in in-memory partitions: 7 ：在内存的总的slots数
Total number of rows in in-memory partitions: 19 ：在内存中总的记录数
(used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 113040

### Partition Distribution ###
Partition:0    rows:0          clusters:0      slots:0      kept=1
Partition:1    rows:1          clusters:1      slots:1      kept=1
Partition:2    rows:2          clusters:1      slots:1      kept=1
Partition:3    rows:5          clusters:1      slots:1      kept=1
Partition:4    rows:3          clusters:1      slots:1      kept=1
Partition:5    rows:4          clusters:1      slots:1      kept=1
Partition:6    rows:1          clusters:1      slots:1      kept=1
Partition:7    rows:3          clusters:1      slots:1      kept=1
Partition：区号
rows:该区的行数
clusters:改区的全部行占用的zu数
slosts:改区当前在内存的zu数
kept:标志这个分区是否仍在内存中1 是 0 否
----构建表存储桶的信息
Number of buckets with   0 rows:         17
Total buckets: 32 Empty buckets: 17 Non-empty buckets: 15
Total number of rows: 19
Maximum number of rows in a bucket: 2
----解释：
Total buckets:总buckets ：32
Empty buckets:空buckets:17
Non-empty buckets:非空buckets15
总行数：otal number of rows
桶中最大记录数：Maximum number of rows in a bucket 2
Number of buckets with   0 rows:         17
代表包含0行记录的buckets个数为17个
-----多遍HASH连接
Getting a pair of flushed partions. (其中的一个分区，每个分区的信息格式都是一样的)
BUILD PARTION: nrows:42728 size=(175 slots, 1400K) ---写到磁盘上一个分区的build 表行数
PROBE PARTION: nrows:42728 size=(175 slots, 1400K) ---写到磁盘上一个分区的probe 表行数
175 slots：代表这些记录占用的slots(本例一个 slots占一个块，所以是175块)
Number of blocks that may be used to build the hash hable 8 :可以用来建立hash table的快数，由于有175块，所以这个分区要被分                                                                成 175/8=22次读入

Number of rows left to be iterated over (start of function): 42728 ：现在的还没处理行数
Number of rows iterated over this function call: 2100：本次处理的行数
Number of rows left to be iterated over (end of function): 40628：剩余代处理的行数

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/69265/viewspace-464315/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/69265/viewspace-464315/

cuibizhang4005

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
学习 hash join

------------------关于散列连接的概念最优散列连接：读取build表，按hash放到bucket中，如果bulid表数据都能放入内存，就是最优散列连接。一遍散列连接：读取build表，按hash放到bucke...
复制链接

扫一扫