文章目录
Main Contents
- Data warehouse 数据仓库
- OLAP 联机决策分析应用
- Data mining 数据挖掘
- Information retrieval 信息检索
- Semi-structured data and XML 半结构化数据和XML
8.1 Data warehouse & OLAP
- 操作型数据库,OLTP 联机事务处理应用
- Challenges come from huge amount of data in network era
- How to use them efficiently?
- How to find useful information in such a data ocean? 挖掘信息
- If they cannot be used by human being, they are garbage.
- Data is the most important resources
- The importance of decision-making scientifically
- Data requirements in decision-making
Data Requirements in Decision-making
- Need summarized data while not detail data 需要统计数据而不是细节数据
- Need historical data 需要历史数据
- Need large amount of external data (multi data source) 大量外部数据
- Need decision subject oriented data, while not the data facing daily transaction process 面向决策主题
- The data don’t need real time updating. They are mainly read
- OLTP and OLAP
OLTP & OLAP
On-Line Transaction Processing (OLTP) On-Line Analytical Processing (OLAP)
Data Feature Detail data Summarized data
Data prescripotion Current data Current and historical data
Data Source Inner data of a enterprise Inner & Extenal data; Distributed; Heterogeneous 异构的
Data organization Surrounding transaction processing Surrounding decision subject
Data Updating Updated instantly Periodical or on demand
Data Amount Involving less data in one operation Involving less amount of data in one operation
Operating Feature Mainly simple and repeated short transaction Mainly long transaction processing complex queries
What is Data warehouse
- A data warehouse is a repository of information gathered from multiple sources 来自于多个不同的数据源,
- Decision subject oriented 面向决策主题
- Providers a single consolidated interface to data 提供了单一的访问接口
- Data stored for an extended period, providing access to historical data 数据存储相当长的周期,提供对历史数据的访问
- Mainly retrieved 主要是查询
- Data/Updates are periodically downloaded from online transaction processing (OLTP) systems. 数据更新时周期性的
- Typically, download happens each night. 每天晚上下载
- Data may not be completely up-to-date, but is recent enough for analysis.
- Running large queries at the warehouse ensures that OLTP system are not affected by the decision-support workload. 运行大的查询
Warehousing Issues
- Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas. 语义集成,
- Heterogeneous Source: Must access data from a variety of source formats and repositories. 异构数据源
- Replication capabilities can be exploited here.
- Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. 加载,刷新,清洗
- Metadata Management: Must keep track of source 源数据的管理
Software Solutions that use Warehouse
- Online Analytical Processing (OLAP)
- enables users to analyses data across multiple dimensions and hierarchies 多维主题分析
- Analysis and Query Reporting Solutions 分析和查询报表的解决方案
- custom built analysis tools use mathematical models to produce speicialized interactive solutions
- Data Mining 数据挖掘
- enable users to identify patterns and corelations with a set of data, or to create predictive models from the data 分析寻找规律
An Example
Cross-tabulation of number by size and color of sample relation sales
- enable users to identify patterns and corelations with a set of data, or to create predictive models from the data 分析寻找规律
Multidimensional Data Model
- Collection of numeric measures, which depend on a set of dimensions.
- E.g., measure Sales, dimensions Product(key: pid), Location (locid), and Time (timeid).
Dimension Hierarchies
- For each dimension, the set of values can be organized in a hierarchy.
MOLAP vs ROLAP
- Multidimensional data can be stored physically in an (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.
- The main relation which relates dimensions relation, to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.
- E.g., Products(pid, pname, category, price)
- Fact tables are much larger than dimensional tables.
Materialized View
- Star or snowflake schema are common storing schema in data warehouse. But decisionmaking are generally not based on star or snowflake schema directly. They are based on different kinds of summarized data computed from star or snowflake schema Because the schema. computation of aggregation function is very time-consuming, the computing results are often stored as materialized view in data warehouse.
- Take P (Products), S (Stores), D (Dates) as example. Their star schema as above.
PSD View
-
Take the computation of SUM function as example. Other aggregation functions are similar.
-
Sales of every product in every store at every day.
CREATE VIEW PSD (PID, SID, DID, TotalSales) AS
SELECT PID, SID, DID, SUM (SaledMoney) AS TotalSales
FROM Sales
GROUP BY PID, SID, DID;
PS, SD, and PD View
-
Total sales of every product in every store (for all times)
-
SD and PD views are similar (for all products or for all stores)
CREATE VIEW PS (PID, SID, TotalSales) AS
SELECT PID, SID, SUM(TotalSales) AS TotalSales
FROM PSD
GROUP BY PID, SID; -
PS View can be expressed as PS ALL
P, S, and D View
-
Total sales of every product (for all stores and all times)
-
S and D views are similar (aggregated according to store or date respectively)
CREATE VIEW P (PID, TotalSales) AS
SELECT PID, SUM(TotalSales) AS TotalSales
FROM PS /* or PD */
GROUP BY PID; -
P View can be expressed as P ALL ALL
All View
Total sales for all products and all stores and all times
CREATE VIEW ALL (TotalSales) AS
SELECT SUM(TotalSales) AS TotalSales
Principles of Database Systems, Xu Lizhen 21
FROM P /* or S or D */
Reliant Relationships Between Views
- There are three dimensions in Sales: {P, S, D}, every sub-set of it is corresponding to a view. These are equivalent to 23 elements of power-set ρ({P, S, D}).
- The following is the reliant relationships between these materialized views:
OLAP Queries
-
A common operation is to aggregate a measure over one or more dimensions.
- Find total sales.
- Find total sales for each city, or for each state.
- Find top five products ranked by total sales.
-
Roll-up: Aggregating at different levels of a dimension hierarchy. 上卷,在某一个维度上,取更高的统计力度
- E.g., Given total sales by city, we can roll-up to get sales by state.
-
Drill-down: The inverse of roll-up. 下钻
- E.g., Given total sales by state, can drill-down to get total sales by city.
-
E.g., Can also drill-down on different dimension to get total sales by product for each state.
-
Pivoting: Aggregation on selected dimensions.
-
Slicing and Dicing: Equality and range selections on one or more dimensions.
The CUBE Operator
-
Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
-
CUBE PID SID DID BY SUM Sales PID, SID, Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:
Examples of Queries on Cube
SELECT PID, SID, Quarter, SUM (SaledMoney) AS TotalSales
FROM Sales S, Dates D
WHERE S.DID=D.DID
CUBE BY PID, SID, Quarter;
— Generate a data cube including 2³ views.
- CUBE(TV,No1,Q1)
- Query TV’s total sales of store No1 in first quarter.
- CUBE(ALL,No1,Q1)
- Query total sales of store No1 in first quarter.
- CUBE(ALL,ALL,Q1)
- Query total sales in first quarter.
- CUBE(ALL,ALL,ALL)
- Query total sales in whole year.
- CUBE(TV,No1,Q1) + CUBE(VCD,No1,Q1)
- CUBE(ALL,ALL,Q1) / CUBE(ALL,ALL,ALL) * 100%