数据仓库学习笔记

 A data warehouse is a central aggregation of data (which can be distributed physically);
A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.

  Database Schema
Tables
Relationships
Domains
Business Rules


  Required Components to Build a Data Warehouse
Data Extraction Tools
Extracted Data
Metadata
DBMS
Data Management Tools
Presentation Services
End-User Analysis Tools
User Training Courses and Materials

  Fact Tables
A central table in a data warehouse schema that contains numerical measures and keys relating facts to dimension tables. Fact tables contain data that describes specific events within a business, such as bank transactions or product sales.

  Dimension Tables
A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created

  Dimension Types
Star Schema
Snow Flake Schema
Parent – Child
 A parent-child dimension is based on two dimension table columns that together define the lineage relationships among the members of the dimension. One column, called the member key column, identifies each member; the other column, called the parent key column, identifies the parent of each member.
Virtual
 A logical dimension that is based on the values of properties of members of a physical dimension. For example, a virtual dimension that contains the colours red, green, and blue can be based on the colour member property of a product dimension
Data Mining Model


  Introduction to Data Mining
Model
 Rules
 Patterns
Training a model
Algorithms
 Decision Trees
 Clustering


  Design Guideline
Use a star schema if possible
Design dimension tables for the users.
Apply commonsense normalization to dimension table design. Do not over-summarize in the fact table.
Use a common fact table structure for similar data
Do not create auxiliary tables of summarized data.
Create indexes on key fields
Ensure referential integrity

  OLAP
On-Line Analytical Processing (OLAP) is the technology that enables client applications to efficiently access this data. OLAP provides many benefits to analytical users
OLAP provides a multidimensional presentation of data warehouse data, creating cubes that organize and summarize data for efficient analytical querying. The design of the data warehouse structure can affect how easily these cubes can be designed and constructed.

  
  Network Communication
Oracle uses its Transparent Network Substrate - TNS
Microsoft provides Network Libraries set using network Tabulate Data Stream - TDS


Secure application data
Identifying object privilege
Granting object privilege

Both Oracle and SQL Server follow the same object privileges (two-tier permission model)
First tier is access to an object
Second is permission to commands, such as insert and delete


ad hoc
[5Ad5hCk]
adv.
特别


   数据挖掘系统研究报告
数据挖掘是从数据中自动地抽取模式、关联、变化、异常和有意义的结构
数据挖掘大部分的价值在于利用数据挖掘技术改善预言模型

  数据挖掘文化
知识发现文化(KD)
 输出的是规则
预言模型文化(PM)
 输出的是预言模型
共同点
 两种文化输入的都是学习集(learning sets)
 目的都是尽可能多的自动化数据挖掘过程
 数据挖掘过程并不能完全自动化,只能半自动化


 数据挖掘过程
步骤 步骤名称   描述
1 数据仓库
 Data Warehouse  数据仓库管理用于决策支持的数据。在该步骤内,数据从操作型系统以及第三方的数据源聚集、清洗、以及转换到数据仓库中,供决策分析使用。
2 数据挖掘
 Data Mining  在这个步骤中,数据从数据仓库抽取出来,用来产生预言模型或者规则集。该步骤可以自动化。
3 预言模型
 Predictive Modeling 在该步骤内,为了产生一个优化的模型,一个或多个预言模型被选择或者联合。这些预言模型可能从数据挖掘系统产生,也可能从统计模型中产生,或者通过第三方购买 。
4 预言记分
 Predictive Scoring 在这个步骤中,选择的预言模型对操作型数据或者交易数据进行记分(score) 。
 
 数据挖掘系统
第一代数据挖掘系统
 支持一个或少数几个数据挖掘算法,这些算法设计用来挖掘向量数据(vector-valued data),这些数据模型在挖掘时候,一般一次性调进内存进行处理。许多这样的系统已经商业化。
第二代数据挖掘系统
 目前的研究,是改善第一代数据挖掘系统,开发第二代数据挖掘系统。第二代数据挖掘系统支持数据库和数据仓库,和它们具有高性能的接口,具有高的可扩展性。例如,第二代系统能够挖掘大数据集、更复杂的数据集、以及高维数据。这一代系统通过支持数据挖掘模式(data mining schema)和数据挖掘查询语言(DMQL)增加系统的灵活性。
第三代数据挖掘系统
 第三代的特征是能够挖掘Internet/Extranet的分布式和高度异质的数据,并且能够有效地和操作型系统集成。这一代数据挖掘系统关键的技术之一是提供对建立在异质系统上的多个预言模型以及管理这些预言模型的元数据提供第一级别(first class)的支持。
第四代数据挖掘系统
 第四代数据挖掘系统能够挖掘嵌入式系统、移动系统、和普遍存在(ubiquitous)计算设备产生的各种类型的数据 。


   Module 2 Defining Data Warehouse Structures
OLAP Components
Numeric Measures
 Data values or facts that users analyze
Dimensions
 Business categories that provide context to numeric measures
 Sourced from columns in star schema dimensions
 Members are organized into hierarchies
Cubes
 Combine dimensions and measures into one conceptual model
 Logical storage medium for an OLAP database

 
   Module 3 Populating Data Warehouse Structures
Verifying the Dimension Source Data
Dimension Data Load Examples
Maintaining Integrity of the Dimension
Managing Changing Dimension Data
Type 1: Overwriting the Dimension Record
Type 2: Writing Another Dimension Record
Loading OLAP Dimensions

   Module 10 Overview
Defining Dimension Basics
Working with Balanced Dimensions
 Each level corresponds to a dimension table column
 Dimension contains a finite number of levels
 Can come from star or snowflake dimension tables
Working With Unbalanced Dimensions
 Dimension contains an indeterminate number of levels
 Visible levels defined from member’s parent
 Usually handled using a parent-child dimension
Working With Time Dimensions
Using Multiple Hierarchies
Handling Lots of Children


   Module 11 Using Measures
Defining Measures
Categories of Measures
Design Alternatives

 Categories of Measures
“Additive”
Non-additive
Semi-additive (over time)
Distinct Count

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值