1:数据分布(Data Distribution)
Certain PEs support multiplexor channels and others provide LAN support.
数据一般总是以8-bit ASCII码形式存储在AMP中。如果以EDCDIC形式输入,PE会在哈希化和分布发生前将其转换为ASCII
Teradata没有预分配表空间的概念。所有哈希分布的表的行通过哈希值分布到所有的AMP中存储。
(Data)FROM HOST--->(Parsing Engine)CONVERTED AND HASHED--> (BYNET)DISTRIBUTED-->(AMP)FORMATTED-->(Vdisk)STORED
2:哈希化
哈希化是TeradataRDBMS利用的通过主索引值来分布数据行的机制.
It takes up to 16 columns of mixed data as input and generates a single 32-bit binary value called a Row Hash.
Row Hash是行的逻辑存储位置。这个值的一部分用来定位AMP。
Teradata使用Row Hash值来分布,放置和获取行。
哈希算法是随机的但一致的。虽然连续的PI值不会产生连续的哈希值,同样的主索引值却总是得到相同的Row Hash值。相同Row Hash值的行总是分布在同一个AMP中。
注意:一个计算出的值并不能说是真正的随机的。不同的PI值很少产生相同的Row Hash值。当这情况发生的话,那就是哈希同义或者哈希冲突
Input to the algorithm is the Primary Index (PI) value of a row.
The Hashing Algorithm creates a fixed length value from any length input string.
The output from the algorithm is the Row Hash:
– A 32-bit binary value.
– The logical storage location of the row.
– Used to identify the AMP of the row.
– Table ID + Row Hash is used to locate the Cylinder and Data Block.
– Used for distribution, placement, and retrieval of the row.
Row Hash uniqueness depends directly on PI uniqueness.
主索引值的唯一性决定了Row Hash值的唯一性,而Row Hash值的唯一性决定了数据分布性。
Good data distribution depends directly on Row Hash uniqueness.
The algorithm produces (approximately) random,but consistent, Row Hashes.
The same PI value and data type combination always hash identically.
Rows with the same Row Hash will always go to the same AMP.
Different PI values rarely produce the same Row Hash (Collisions).
----------------------------------------------------------------
3:多列哈希化
在处理多列组成的索引哈希算法使用加和乘来作为可交换操作
算法将不同的数字数据类型哈希化出同样数字类型值Row Hash
PI = (A, B)
=Hash(A)*Hash(B)+Hash(A)+Hash(B)=Hash (B) * Hash (A) + Hash (B) + Hash (A)
例子:
A B Hash (A) * Hash (B) + Hash (A) + Hash (B)
1 100 1 100 1 100 =201
2 99 2 99 2 99 =299
-----------------------------------------------------------------
4:主索引哈希映射
主索引值通过哈希算法产生Row Hash.行通过连接层。通过哈希图确定哪个AMP接收该行。哈希图入口由BYNET维护。Hash Map entries are maintained by the BYNET.
-------------------------------------------------------------
5:哈希图
哈希图是确定哪个AMP获取行的机制。它们在系统的每个节点上被复制。有四种哈希图
[1]Current Configuration Primary(指明行存储在哪里)
[2]Current Configuration Fallback(指明行的拷贝存储在哪里)
[3]Reconfiguration Primary(指明在系统重新配置后行移动到哪里)
[4]Reconfiguration Fallback(指明行的拷贝在系统重新配置后移动到哪里)
每个节点上都有四个哈希图
每个哈希图是65536个入口的排列
The Communications Layer Interfaces通过设计好的哈希图检查着所有的进入信息
Only the AMP whose number appears in the referenced Hash Map entry is interrupted.
----------------------------------------------------------------
6:主哈希图
The Fallback Hash Map IS NOT an exact copy of the Primary Hash Map.
Row Hash的前16 bits是DSW(目的地选择词)
DSW在哈希图中指向一个入口
The referenced Hash Map entry identifies the AMP for the row hash.
“Destination Selection Word” is also referred to as “Bucket Number.”
The Primary Hash Map identifies which AMP the first (Primary) copy of a row belongs to.
The Communications Layer只使用Row Hash中的DSW来确定哪个AMP接收行。而AMP使用全部32 bit的Row Hash来确定行的逻辑磁盘存储位置
系统具有相同数量的AMP的主哈希图是一样的
Fallback Hash Maps differ due to clustering differences at each site.
----------------------------------------------------------------
7:数据分布--------Teradata RDBMS for UNIX
潜在问题:
Skewing of Hash Bucket Distribution:Hash Bucket 分布的滞后
Row Hash Synonyms (with NUPIs):哈希同义
Row Hash Collisions:哈希冲突
Hashing Numeric Values:哈希化数量值
Redistribution with Joins:连接的重新分布
----------------------------------------------------------------
8:哈希关系表达式Hash-Related Expressions
TeradataRDBMS包含了对SQL的扩充->Teradata SQL.
HASHROW (column(s))--------->Returns the row hash value of a given sequence
返回给定序列的哈希值
eg:
SELECT COUNT(*)/ COUNT (DISTINCT(HASHROW (C1,C2)))
(FLOAT) FROM T;
HASHBUCKET (hashrow)---->The grouping for the specific hash value
eg:
SELECT HASHBUCKET (HASHROW(C1,C2)), COUNT(*) FROM T
GROUP BY 1 ORDER BY 1;
HASHAMP (hashbucket)---->The AMP that owns the hash bucket
eg:
SELECT HASHAMP (HASHBUCKET (HASHROW (C1,C2))),
COUNT(*) FROM T3 GROUP BY 1 ORDER BY 1;
HASHBAKAMP (hashbucket)-->The fallback AMP that owns the hash bucket
----------------------------------------------------------------------
9:哈希函数
[1]确定the Hash Buckets
如果你怀疑由于哈希冲突而导致的数据滞后,你能使用HASHBUCKET函数来确定每个hash bucket中的行数量.
The hashbucket function requires the HashRow of the columns that make up the primary index or are being considered for the primary index.
[2]确定主AMP
一旦你确定由于哈希冲突而导致数据滞后存在,使用hashbucket来确定拥有过量hash bucket的AMP
eg:计算选定主索引的NUPI重复和同义冲突的影响:
SELECT hashrow (empno, deptno) as “Hash Value”,
count (*)
FROM employee
GROUP BY 1
ORDER BY 2 DESC;
Hash Value Count(*)
63524 27
89087 27
(Portion of output cut due to length…)
863 1
28465 1
eg:预测主索引选择后的AMP分布情况
SELECT hashamp (hashbucket (hashrow (empno, deptno)))
as “AMP”, Count (*)
FROM employee
GROUP BY 1
ORDER BY 2 DESC;
AMP Count(*)
33 3467
28 3337
(Portion of output cut due to length…)
3 3113
29 3101
----------------------------------------------------
10:不平均数据分布的影响
数据行的分布基于主索引的哈希值
在一个袭用中Hash buckets平均分布到节点中的AMP中
歪斜的AMP中的数据分布会影响节点CPU的并行处理效率
Node CPU utilization will reflect these differences because a node is only as fast as its slowest AMP.
---------------------------------------------------
11:数据分布:另外的问题
数据是基于Row Hash code来分布的。Row hash code的hash bucket部分定位了哪个AMP来存储行。
对于连接组合的列,Row hash value是连接列计算出来的.
列的连接只能发生在连接列的Row hash value在同一个AMP上
For joins on columns, row hash value is computed for the join columns.
Joins on columns can take place only when the row hash value of the columns are on the same AMP.
When join columns are different from the primary index,the rows must be redistributed. This is a necessary overhead.
-------------------------------------------------------------
12:歪斜的连接和聚类处理
如果你的数据没有歪斜但是系统并行处理效率还是很差的话,你应该在连接或者聚类处理过程中寻找歪斜的标识。
poor parallel efficiency occurs when the join field is highly skewed.
Skewed Join and Aggregation Processing
Note: Skewed processing can also occur with Referential Integrity when the referencing
column has skewed demographics, e.g., referenced column is city code.
Join:连接
SELECT ...
FROM T1, T2
WHERE T1.city = T2.city;
Aggregation:聚类
SELECT . . .
FROM T1
WHERE . . .
GROUP BY city;
------------------------------------------------------
13:重新配置Reconfiguration
Reconfiguration:是对哈希图的重新设置控制并改变系统中AMP数量的过程
系统重新设置哈希图通过重新标识哈希入口来映射新的AMP配置。
This is done in a way that minimizes the number of rows (and Hash Map Entries) reassigned to a new AMP.
在行以后完成以后,主哈希图的重新配置成为了主哈希图的当前配置。并且备份哈希图的重新配置变成了当前备份哈希图的当前配置。
系统创建新的哈希图来容纳新配置
Old and new maps are compared.(新哈希图和老哈希图是可比的)
Each AMP reads its rows, and moves only those that hash to a new AMP.
It is not necessary to offload and reload data due to a reconfiguration.
不需要数据重组
移动到新AMP中的行的百分比=新的AMP数量/老的+新的AMP的总和
------------------------------------------------------------
14:使用主索引
SELECT语句:SELECT * FROM tablename WHERE primaryindex = value(s)
|
|
分析器输出三部分Table ID,Row Hash和Primary Index value组成的信息.
48 bit Table ID从数据字典中查得。
32 bit Row Hash value由哈希算法产生
Primary Index value来自于SQL请求
|
|
BYNETs使用(16bit)DSW来定位AMP to interrupt and pass on the message.
|
|
AMP使用Table ID和Row Hash 来确定和定位合适的数据块,然后使用Row Hash和PI值来定位到指定的行.PI值用来区别哈希同义
Only the AMP whose number appears in the referenced
Hash Map is interrupted.
-------------------------------------------------------------
15:Row ID
光靠Row Hash并不足以确定出表中的某一行。由于Row Hash是基于主索引值的,所以多行可以有相同的Row Hash.这会导致哈希同义或者NUPI重复。
Row ID=(32 bit)Row Hash+(32 bit)Uniqueness Value
For Primary Index retrievals, only the Row Hash and Primary Index values are needed to find the qualifying row(s).
主索引中,使用Row Hash+PI值来确定行
The Uniqueness Value is needed for Secondary Index support.
次索引中,使用Row Hash+Uniqueness Value.
Teradata使用Row ID作为次索引的指示器
在插入的时候,系统同时存储了数据值和Row ID
ROW ID = ROW HASH and UNIQUENESS VALUE
Row Hash:
Row Hash is based on Primary Index value.
Multiple rows in a table could have the same Row Hash.
NUPI duplicates and hash synonyms have the same Row Hash.
Uniqueness Value:
Type system creates a numeric 32-bit Uniqueness Value.
The first row for a Row Hash has a Uniqueness Value of 1.
Additional rows have ascending Uniqueness Values.
Row IDs determine sort sequence within a Data Block.决定排序顺序
Row IDs support Secondary Index performance.
The Row ID makes every row within a table uniquely identifiable.
Duplicate Rows:
Row ID uniqueness does not imply data uniqueness.
Row ID的唯一性并不说明了数据的唯一性
------------------------------------------------------------
16:定位行
DSW定位了哈希图的入口
哈希图的入口标识了一个配置中的特定AMP
[1]一旦目标AMP被确定了,AMP中的Master Index用来标识出Cylinder Index
[2]Cylinder Index标识出目标数据块
[3]通过三段信息(Table ID,Row Hash,PI)来定位到数据块中的行
The AMP accesses its Master Index.-->The Master Index is always memory-resident.(Master Index总是驻留内存的)
An entry in the Master Index标识了一个索引列和它的当前版本
The Cylinder Index may or may not be memory resident.
An entry in the Cylinder Index标识出数据块
数据块是物理输入输出设备可以驻留内存也可以不驻留内存
A search of the Data Block(s) locates the row(s).
AMP MEMORY
|---------------------|
| Master Index |
|_____________________|
| Cylinder Index Cache|
|_____________________|
| Data Cache |
|_____________________|
----------------------------------------------------------------
17:表ID Table ID
Table ID是三部分信息的第一部分.是分析器提供的48 bit number
Table ID由两个主要成分组成:
[1]The first component of the Table ID is the Unique Value.
[2]The second component of the Table ID is known as the Subtable ID.
Table ID=UNIQUE VALUE (32 bit)+ SUB-TABLE ID(16 bit)
The Subtable ID is a 16-bit value that tells the file system which type of blocks to search for.
表,视图和宏的Unique Value都来自于数据字典中的DBC.Next.
The Table ID, together with the Row ID, gives Teradata a way to uniquely
identify every single row in the entire system.
Unique Value for Tables, Views, and Macros comes from DBC.Next dictionary table.
有时候会将Unique Value直接代表Table ID使用
Unique Value一般定义表的类型:
普通数据表
永久日志
Spool文件或者volatile table
Subtable ID identifies what part of a table the system
is looking at:
Table Header
Primary data rows
Fallback data rows
First secondary index primary rows
First secondary index fallback rows
Second secondary index primary rows
Second secondary index fallback rows
Third secondary index primary rows
and so on…
Table ID加上了Row ID使得每个行在系统中都是唯一的
--------------------------------------------------------
18:Master Index Foramt
Master Index是一张显示在该AMP中的所有data cylinder的表
Entries in the Master Index are sorted by the first two columns that show
the lowest Table ID and Row Hash that can be found on the cylinder.
不包含数据的Cylinder不在Master Index中显示
Master Index中的关键元素:
Header
Cylinder Index Descriptors(CIDs)
Cylinder Index Descriptor Reference Array(CID Ref.Array)
Master Index defines all cylinders in use for the AMP.
Three elements are Header, CID, and CID Ref. Array.
CIDs reside in the Master Index Heap.
CIDs define First Table ID/Row ID and Last Table ID/Row Hash.
There is a CID Ref. Array pointer for every CID entry.
Each master index entry contains the following data sorted on Table ID and row hash:[1]Lowest Table ID in the cylinder
[2]Lowest RowID value on the cylinder (associated with the lowest Table ID)
[3]Highest Table ID in the cylinder
[4]Highest row hash (not RowID) value on the cylinder (associated with the highest Table ID)
[5]Cylinder number
系统使用Table ID和RowID(或者row hash value)来检索master index从而获得cylinder number
-----------------------------------------------------------------
19:Master Index
Teradata使用Master Index 来定位包含符合的Table ID和Row Hash的Cylinder.
在每个AMP中每个cylinder只有对应的一个入口。
Cylinders with data appear on the Master Index.
Cylinders without data appear on the free Cylinder List.
Each index entry identifies its cylinder’s lowest Table ID and Row ID.
每个索引入口标识出它的cylinder`s lowest Table ID和 Row ID.
索引入口存储在Table ID和row hash 中
很多表都有相同cylinder的行
A table may have rows on many cylinders on different PDISKS on an AMP.
The Free Cylinder List is sorted by Cylinder Number.
------------------------------------------------------------------
20:Cylinder Index设计规划
每个cylinder都有它自己的Cylinder Index(CI).CI描述了在cylinder中的一列数据块和空扇区.Table ID和Lowest Row ID序列化CI
It is a tool to identify which block a specific row is in.
CI包括四个主要部分:
Cylinder Index Header
Cylinder Index Heap
Subtable Reference Array Descriptors
Data Block Descriptor Reference Array
一个Cylinder包含着许多不同表的数据块
一个表(或者子表)可能跨越超过一个cylinder
An SRD(Subtable Reference Descriptors) exists on a CI for every sub-table that has Data Blocks on the cylinder.
Each SRD existing in the CI’s heap is referenced
by an SRD Reference Array entry.
The DBD Reference Array Indexes to the Data
Blocks for a particular sub table.
The DBD Reference Array entries are in a
descending order sort based on the sub table ID.
---------------------------------------------------------------
21:Cylinder Index
There is an entry for each block on this cylinder.
Blocks containing rows appear in the Cylinder Index.
Those entries are sorted ascending on Table ID and row hash.
Blocks without data appear on the Free Block List.
The Free Block list is sorted ascending on Sector Number.
Only rows belonging to the same table and subtable appear in a block.
Blocks belonging to the same table and sub-table can vary in size.
-----------------------------------------------------------------
22:General Block Layout物理块设计
块是Teradata I/O的设备.块中包含1条或者更多的数据行。都属于一个表。
数据块的最大尺寸是:63.5KB
一个数据块包括:The Data Block Header数据块头部
The Row Heap 行堆
The Row Reference Array
行不能被数据块分割。Row Reference Array 被放置在数据块的最后。
[Data Block Format]
Header---Row- ---|
-Row---Row-- |-->Row Heap
--Row---Row-- ---|
-Reference Array--Trailer--
一个数据块是物理输出输入单位(Max size=63.5kb).
The block header包括Table ID(6 bytes)
只有同一个表和子表中的行在一个数据块中
块可以从512字节到63.5KB(1 to 127磁盘扇区)
Blocks within a table vary in size.
The system adjusts block sizes dynamically
Data blocks are not chained together
Rows are not split across block bundaries
Reference Array Pointers are sorted in ascending order on Row ID within blocks
Rows are(16 bit) word aligned within a block
-------------------------------------------------------------------
23:General Row Layout
ROW LEGNTH+ROW ID+ADDITIONAL OVERHEAD+COLUMN DATA VALUES+ROW REF.ARRAY
2bytes 4bytes+4bytes 2bytes 可变的 2bytes
|
|
用来确定行在块中的起始位置
Teradata支持可变长的行
行的最大长为64KB
主索引值确定了Row Hash值
系统产生Uniqueness Value
表中的行是可变长的。行最大长度为64KB
在SET表中不允许出现重复行
-----------------------------------------------------------------
24:AMP Read I/O 概要
Master Index是常驻内存的
如果Cylinder Index不在内存中,则AMP读取Cylinder Index
如果Data Blocks不在内存中,则AMP读取数据块
AMP memory, cache size, and locality of reference determine if either of these steps require physical I/O.
Often, the Cylinder Index is memory resident and a Unique Primary Index retrieval requires only one (1) I/O.
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16723161/viewspace-1019832/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/16723161/viewspace-1019832/