第一章 系统安装与物理设计
一、概述
主要任务包括:
n 决定DW/BI系统的规模
n 生产、测试、开发环境的软硬件配置
n 逻辑模型到物理模型转换
n 如何对事实表分区
n 物理数据如何存储
n 索引计划
二、系统模型考虑
可以从四个方面来考虑:数据容量、应用复杂性、同时在线用户数与系统可用性需求。
Overall System Size | Small | Medium | Large |
Range of Configurations | One “Commodity” 32-bit or 64-bit 4-way | One high-end (8 or more processors) 64-bit system with max memory, or distribute components to multiple commodity servers | Distributed system on high-end 64-bit servers; some clusters |
Data Volume | < 500 M fact rows, < 50 GB atomic fact data | < 5 B fact rows, < 500 GB atomic fact data | Billions and billions… |
Usage Complexity | > 60% Simple use 30% Medium complexity < 10% Demanding use | 50% Simple use 35% Medium 15% Demanding | 35% Simple 40% Medium 25% Demanding |
Number of Users | < a dozen simultaneous users | < 200 simultaneous users | Thousands |
Availability | Several hours of downtime each night acceptable | < 1 hour downtime or query slowdown each night | 24x7 with excellent query performance |
三、系统配置考虑
预算
单机/分布式系统
2.存储系统选择
RAID/SAN
Note | Directly attached RAID disks can offer better performance than SANs, particularly for sequential I/O. However, the advantages of the SAN technology usually, although not always, outweigh the difference in performance for DW/BI applications. |
3.高可用性
并行计算;分区技术;数据库镜像技术
四、软件安装与配置
角色不同,需要安装的软件也不同。
We strongly recommend that all relational database files be placed on RAID arrays, either RAID-1 or RAID-5, preferably with hardware controllers rather than managed by the operating system. RAID-1 is significantly better than RAID-5, although predictably more expensive, because it’s faster both for writes and for recovery after a disk failure.
The SQL Server system databases master, model, and msdb can be placed on the RAID-1 array that holds the operating system. These databases are typically very small, and there’s usually plenty of room for them on that array. Alternatively, place them on their own small RAID-1 array. A third alternative is to place them on the same RAID array as user databases. It is vital that these databases, especially master, be placed on a fault-tolerant array.
The fourth system database, tempdb, could grow significantly as the data warehouse database is being used. You should pre-allocate tempdb to a large size to avoid auto-growth during query operations. Don’t place tempdb on the system RAID-1 array which is usually small. Use at least one file per CPU for high-performance tempdb operations. Be sure to spread tempdb out over many drives to maximize I/O performance. If the total number of drives on your system is limited, you can place tempdb on the same RAID array as the user databases, especially if you use RAID-1 or RAID-0+1 for the user databases. However, if you set up the user databases on a RAID-5 array, you should consider separating out tempdb onto its own RAID-1 array to minimize potential bottlenecks.
五、物理设计
代理键
使用数值型,并且占用字节尽可能小
2.字符字段
定义三种类型 :
Short string with 20 characters
Intermediate string with 50 characters
Long string with 100 characters
小于10位用CHAR型
3.Insert an Unknown Member Row
为每个维度插入一条缺省数据
4.Table and Column Extended Properties
将元数据定义在数据库对象的扩展属性上,如为表个字段增加描述,SCD等其它元数据
5.Housekeeping Columns
定义审核维,并在事实表中与它关联;用来跟踪数据如何被加载进DW系统,一般在ETL阶段填充数据。
6.索引与约束
应当定义主外键,但在加载时应让其失效。少定义复合索引,通常一个外键定义一个索引。聚集索引一般定义为日期字段。
7.事实表分区
分区函数两端应是空分区。在初始加载时通常用到swap partition,以利于并行加载
8.Staging Tables
只将加载时需要查询的数据放出关系数据库,如不需要转换,则应使用文件系统。
9.元数据
定义元数据库