8. 数据库在新领域的应用

最新推荐文章于 2021-06-22 10:49:18 发布

林林丶

最新推荐文章于 2021-06-22 10:49:18 发布

阅读量426

点赞数

本文深入探讨了数据仓库和在线分析处理(OLAP)的基本概念及其在决策制定中的重要作用。文章强调了数据仓库如何从多个来源收集信息，提供统一的数据访问接口，并存储大量历史数据以供分析使用。同时，详细解释了OLTP与OLAP的区别，以及OLAP在多维数据分析上的优势。

摘要由CSDN通过智能技术生成

文章目录

Main Contents
8.1 Data warehouse & OLAP
Materialized View
PSD View
PS, SD, and PD View
P, S, and D View
All View
OLAP Queries
The CUBE Operator
Examples of Queries on Cube

Main Contents

Data warehouse 数据仓库
OLAP 联机决策分析应用
Data mining 数据挖掘
Information retrieval 信息检索
Semi-structured data and XML 半结构化数据和XML

8.1 Data warehouse & OLAP

操作型数据库，OLTP 联机事务处理应用
Challenges come from huge amount of data in network era
- How to use them efficiently?
- How to find useful information in such a data ocean? 挖掘信息
- If they cannot be used by human being, they are garbage.
Data is the most important resources
The importance of decision-making scientifically
Data requirements in decision-making

Data Requirements in Decision-making

Need summarized data while not detail data 需要统计数据而不是细节数据
Need historical data 需要历史数据
Need large amount of external data (multi data source) 大量外部数据
Need decision subject oriented data, while not the data facing daily transaction process 面向决策主题
The data don’t need real time updating. They are mainly read
OLTP and OLAP

OLTP & OLAP

                	On-Line Transaction Processing (OLTP)   	On-Line Analytical Processing (OLAP)

Data Feature Detail data Summarized data
Data prescripotion Current data Current and historical data
Data Source Inner data of a enterprise Inner & Extenal data; Distributed; Heterogeneous 异构的
Data organization Surrounding transaction processing Surrounding decision subject
Data Updating Updated instantly Periodical or on demand
Data Amount Involving less data in one operation Involving less amount of data in one operation
Operating Feature Mainly simple and repeated short transaction Mainly long transaction processing complex queries

What is Data warehouse

A data warehouse is a repository of information gathered from multiple sources 来自于多个不同的数据源，
- Decision subject oriented 面向决策主题
- Providers a single consolidated interface to data 提供了单一的访问接口
- Data stored for an extended period, providing access to historical data 数据存储相当长的周期，提供对历史数据的访问
- Mainly retrieved 主要是查询
- Data/Updates are periodically downloaded from online transaction processing (OLTP) systems. 数据更新时周期性的
  - Typically, download happens each night. 每天晚上下载
  - Data may not be completely up-to-date, but is recent enough for analysis.
- Running large queries at the warehouse ensures that OLTP system are not affected by the decision-support workload. 运行大的查询
  Warehousing Issues
Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas. 语义集成，
Heterogeneous Source: Must access data from a variety of source formats and repositories. 异构数据源
- Replication capabilities can be exploited here.
Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. 加载，刷新，清洗
Metadata Management: Must keep track of source 源数据的管理

Software Solutions that use Warehouse

Online Analytical Processing (OLAP)
- enables users to analyses data across multiple dimensions and hierarchies 多维主题分析
Analysis and Query Reporting Solutions 分析和查询报表的解决方案
- custom built analysis tools use mathematical models to produce speicialized interactive solutions
Data Mining 数据挖掘
- enable users to identify patterns and corelations with a set of data, or to create predictive models from the data 分析寻找规律
  An Example
  Cross-tabulation of number by size and color of sample relation sales

Multidimensional Data Model

Collection of numeric measures, which depend on a set of dimensions.
- E.g., measure Sales, dimensions Product(key: pid), Location (locid), and Time (timeid).

Dimension Hierarchies

For each dimension, the set of values can be organized in a hierarchy.

MOLAP vs ROLAP

Multidimensional data can be stored physically in an (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.
The main relation which relates dimensions relation, to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.
E.g., Products(pid, pname, category, price)
Fact tables are much larger than dimensional tables.

Materialized View

Star or snowflake schema are common storing schema in data warehouse. But decisionmaking are generally not based on star or snowflake schema directly. They are based on different kinds of summarized data computed from star or snowflake schema Because the schema. computation of aggregation function is very time-consuming, the computing results are often stored as materialized view in data warehouse.
Take P (Products), S (Stores), D (Dates) as example. Their star schema as above.

PSD View

Take the computation of SUM function as example. Other aggregation functions are similar.
Sales of every product in every store at every day.

CREATE VIEW PSD (PID, SID, DID, TotalSales) AS
SELECT PID, SID, DID, SUM (SaledMoney) AS TotalSales
FROM Sales
GROUP BY PID, SID, DID;

PS, SD, and PD View

Total sales of every product in every store (for all times)
SD and PD views are similar (for all products or for all stores)

CREATE VIEW PS (PID, SID, TotalSales) AS
SELECT PID, SID, SUM(TotalSales) AS TotalSales
FROM PSD
GROUP BY PID, SID;
PS View can be expressed as PS ALL

P, S, and D View

Total sales of every product (for all stores and all times)
S and D views are similar (aggregated according to store or date respectively)

CREATE VIEW P (PID, TotalSales) AS
SELECT PID, SUM(TotalSales) AS TotalSales
FROM PS /* or PD */
GROUP BY PID;
P View can be expressed as P ALL ALL

All View

Total sales for all products and all stores and all times

CREATE VIEW ALL (TotalSales) AS
SELECT SUM(TotalSales) AS TotalSales
Principles of Database Systems, Xu Lizhen 21
FROM P /* or S or D */

Reliant Relationships Between Views

There are three dimensions in Sales: {P, S, D}, every sub-set of it is corresponding to a view. These are equivalent to 23 elements of power-set ρ({P, S, D}).
The following is the reliant relationships between these materialized views:

OLAP Queries

A common operation is to aggregate a measure over one or more dimensions.
- Find total sales.
- Find total sales for each city, or for each state.
- Find top five products ranked by total sales.
Roll-up: Aggregating at different levels of a dimension hierarchy. 上卷，在某一个维度上，取更高的统计力度
- E.g., Given total sales by city, we can roll-up to get sales by state.
Drill-down: The inverse of roll-up. 下钻
- E.g., Given total sales by state, can drill-down to get total sales by city.
E.g., Can also drill-down on different dimension to get total sales by product for each state.
Pivoting: Aggregation on selected dimensions.
Slicing and Dicing: Equality and range selections on one or more dimensions.

The CUBE Operator

Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
CUBE PID SID DID BY SUM Sales PID, SID,  Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

Examples of Queries on Cube

SELECT PID, SID, Quarter, SUM (SaledMoney) AS TotalSales
FROM Sales S, Dates D
WHERE S.DID=D.DID
CUBE BY PID, SID, Quarter;

— Generate a data cube including 2³ views.

CUBE(TV,No1,Q1)
- Query TV’s total sales of store No1 in first quarter.
CUBE(ALL,No1,Q1)
- Query total sales of store No1 in first quarter.
CUBE(ALL,ALL,Q1)
- Query total sales in first quarter.
CUBE(ALL,ALL,ALL)
- Query total sales in whole year.
CUBE(TV,No1,Q1) + CUBE(VCD,No1,Q1)
- CUBE(ALL,ALL,Q1) / CUBE(ALL,ALL,ALL) * 100%