A data warehouse is a central aggregation of data (which can be distributed physically);
A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.
Database Schema
Tables
Relationships
Domains
Business Rules
Required Components to Build a Data Warehouse
Data Extraction Tools
Extracted Data
Metadata
DBMS
Data Management Tools
Presentation Services
End-User Analysis Tools
User Training Courses and Materials
Fact Tables
A central table in a data warehouse schema that contains numerical measures and keys relating facts to dimension tables. Fact tables contain data that describes specific events within a business, such as bank transactions or product sales.
Dimension Tables
A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created
Dimension Types
Star Schema
Snow Flake Schema
Parent – Child
A parent-child dimension is based on two dimension table columns that together define the lineage relationships among the members of the dimension. One column, called the member key column, identifies each member; the other column, called the parent key column, identifies the parent of each member.
Virtual
A logical dimension that is based on the values of properties of members of a physical dimension. For example, a virtual dimension that contains the colours red, green, and blue can be based on the colour member property of a product dimension
Data Mining Model
Introduction to Data Mining
Model
Rules
Patterns
Training a model
Algorithms
Decision Trees
Clustering
Design Guideline
Use a star schema if possible
Design dimension tables for the users.
Apply commonsense normalization to dimension table design. Do not over-summarize in the fact table.
Use a common fact table structure for similar data
Do not create auxiliary tables of summarized data.
Create indexes on key fields
Ensure referential integrity
OLAP
On-Line Analytical Processing (OLAP) is the technology that enables client applications to efficiently access this data. OLAP provides many benefits to analytical users
OLAP provides a multidimensional presentation of data warehouse data, creating cubes that organize and summarize data for efficient analytical querying. The design of the data warehouse structure can affect how easily these cubes can be designed and constructed.
Network Communication
Oracle uses its Transparent Network Substrate - TNS
Microsoft provides Network Libraries set using network Tabulate Data Stream - TDS
Secure application data
Identifying object privilege
Granting object privilege
Both Oracle and SQL Server follow the same object privileges (two-tier permission model)
First tier is access to an object
Second is permission to commands, such as insert and delete
ad hoc
[5Ad5hCk]
adv.
特别
数据挖掘系统研究报告
数据挖掘是从数据中自动地抽取模式、关联、变化、异常和有意义的结构
数据挖掘大部分的价值在于利用数据挖掘技术改善预言模型
数据挖掘文化
知识发现文化(KD)
输出的是规则
预言模型文化(PM)
输出的是预言模型
共同点
两种文化输入的都是学习集(learning sets)
目的都是尽可能多的自动化数据挖掘过程
数据挖掘过程并不能完全自动化,只能半自动化
数据挖掘过程
步骤 步骤名称 描述
1 数据仓库
Data Warehouse 数据仓库管理用于决策支持的数据。在该步骤内,数据从操作型系统以及第三方的数据源聚集、清洗、以及转换到数据仓库中,供决策分析使用。
2 数据挖掘
Data Mining 在这个步骤中,数据从数据仓库抽取出来,用来产生预言模型或者规则集。该步骤可以自动化。
3 预言模型
Predictive Modeling 在该步骤内,为了产生一个优化的模型,一个或多个预言模型被选择或者联合。这些预言模型可能从数据挖掘系统产生,也可能从统计模型中产生,或者通过第三方购买 。
4 预言记分
Predictive Scoring 在这个步骤中,选择的预言模型对操作型数据或者交易数据进行记分(score) 。
数据挖掘系统
第一代数据挖掘系统
支持一个或少数几个数据挖掘算法,这些算法设计用来挖掘向量数据(vector-valued data),这些数据模型在挖掘时候,一般一次性调进内存进行处理。许多这样的系统已经商业化。
第二代数据挖掘系统
目前的研究,是改善第一代数据挖掘系统,开发第二代数据挖掘系统。第二代数据挖掘系统支持数据库和数据仓库,和它们具有高性能的接口,具有高的可扩展性。例如,第二代系统能够挖掘大数据集、更复杂的数据集、以及高维数据。这一代系统通过支持数据挖掘模式(data mining schema)和数据挖掘查询语言(DMQL)增加系统的灵活性。
第三代数据挖掘系统
第三代的特征是能够挖掘Internet/Extranet的分布式和高度异质的数据,并且能够有效地和操作型系统集成。这一代数据挖掘系统关键的技术之一是提供对建立在异质系统上的多个预言模型以及管理这些预言模型的元数据提供第一级别(first class)的支持。
第四代数据挖掘系统
第四代数据挖掘系统能够挖掘嵌入式系统、移动系统、和普遍存在(ubiquitous)计算设备产生的各种类型的数据 。
Module 2 Defining Data Warehouse Structures
OLAP Components
Numeric Measures
Data values or facts that users analyze
Dimensions
Business categories that provide context to numeric measures
Sourced from columns in star schema dimensions
Members are organized into hierarchies
Cubes
Combine dimensions and measures into one conceptual model
Logical storage medium for an OLAP database
Module 3 Populating Data Warehouse Structures
Verifying the Dimension Source Data
Dimension Data Load Examples
Maintaining Integrity of the Dimension
Managing Changing Dimension Data
Type 1: Overwriting the Dimension Record
Type 2: Writing Another Dimension Record
Loading OLAP Dimensions
Module 10 Overview
Defining Dimension Basics
Working with Balanced Dimensions
Each level corresponds to a dimension table column
Dimension contains a finite number of levels
Can come from star or snowflake dimension tables
Working With Unbalanced Dimensions
Dimension contains an indeterminate number of levels
Visible levels defined from member’s parent
Usually handled using a parent-child dimension
Working With Time Dimensions
Using Multiple Hierarchies
Handling Lots of Children
Module 11 Using Measures
Defining Measures
Categories of Measures
Design Alternatives
Categories of Measures
“Additive”
Non-additive
Semi-additive (over time)
Distinct Count