8. 数据库在新领域的应用

本文深入探讨了数据仓库和在线分析处理(OLAP)的基本概念及其在决策制定中的重要作用。文章强调了数据仓库如何从多个来源收集信息,提供统一的数据访问接口,并存储大量历史数据以供分析使用。同时,详细解释了OLTP与OLAP的区别,以及OLAP在多维数据分析上的优势。
摘要由CSDN通过智能技术生成

Main Contents

  • Data warehouse 数据仓库
  • OLAP 联机决策分析应用
  • Data mining 数据挖掘
  • Information retrieval 信息检索
  • Semi-structured data and XML 半结构化数据和XML

8.1 Data warehouse & OLAP

  • 操作型数据库,OLTP 联机事务处理应用
  • Challenges come from huge amount of data in network era
    • How to use them efficiently?
    • How to find useful information in such a data ocean? 挖掘信息
    • If they cannot be used by human being, they are garbage.
  • Data is the most important resources
  • The importance of decision-making scientifically
  • Data requirements in decision-making

Data Requirements in Decision-making

  • Need summarized data while not detail data 需要统计数据而不是细节数据
  • Need historical data 需要历史数据
  • Need large amount of external data (multi data source) 大量外部数据
  • Need decision subject oriented data, while not the data facing daily transaction process 面向决策主题
  • The data don’t need real time updating. They are mainly read
  • OLTP and OLAP

OLTP & OLAP

                	On-Line Transaction Processing (OLTP)   	On-Line Analytical Processing (OLAP)    

Data Feature Detail data Summarized data
Data prescripotion Current data Current and historical data
Data Source Inner data of a enterprise Inner & Extenal data; Distributed; Heterogeneous 异构的
Data organization Surrounding transaction processing Surrounding decision subject
Data Updating Updated instantly Periodical or on demand
Data Amount Involving less data in one operation Involving less amount of data in one operation
Operating Feature Mainly simple and repeated short transaction Mainly long transaction processing complex queries

What is Data warehouse

  • A data warehouse is a repository of information gathered from multiple sources 来自于多个不同的数据源,
    • Decision subject oriented 面向决策主题
    • Providers a single consolidated interface to data 提供了单一的访问接口
    • Data stored for an extended period, providing access to historical data 数据存储相当长的周期,提供对历史数据的访问
    • Mainly retrieved 主要是查询
    • Data/Updates are periodically downloaded from online transaction processing (OLTP) systems. 数据更新时周期性的
      • Typically, download happens each night. 每天晚上下载
      • Data may not be completely up-to-date, but is recent enough for analysis.
    • Running large queries at the warehouse ensures that OLTP system are not affected by the decision-support workload. 运行大的查询
      Warehousing Issues
  • Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas. 语义集成,
  • Heterogeneous Source: Must access data from a variety of source formats and repositories. 异构数据源
    • Replication capabilities can be exploited here.
  • Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. 加载,刷新,清洗
  • Metadata Management: Must keep track of source 源数据的管理

Software Solutions that use Warehouse

  • Online Analytical Processing (OLAP)
    • enables users to analyses data across multiple dimensions and hierarchies 多维主题分析
  • Analysis and Query Reporting Solutions 分析和查询报表的解决方案
    • custom built analysis tools use mathematical models to produce speicialized interactive solutions
  • Data Mining 数据挖掘
    • enable users to identify patterns and corelations with a set of data, or to create predictive models from the data 分析寻找规律
      An Example
      Cross-tabulation of number by size and color of sample relation sales

Multidimensional Data Model

  • Collection of numeric measures, which depend on a set of dimensions.
    • E.g., measure Sales, dimensions Product(key: pid), Location (locid), and Time (timeid).

Dimension Hierarchies

  • For each dimension, the set of values can be organized in a hierarchy.

MOLAP vs ROLAP

  • Multidimensional data can be stored physically in an (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.
  • The main relation which relates dimensions relation, to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.
  • E.g., Products(pid, pname, category, price)
  • Fact tables are much larger than dimensional tables.

Materialized View

  • Star or snowflake schema are common storing schema in data warehouse. But decisionmaking are generally not based on star or snowflake schema directly. They are based on different kinds of summarized data computed from star or snowflake schema Because the schema. computation of aggregation function is very time-consuming, the computing results are often stored as materialized view in data warehouse.
  • Take P (Products), S (Stores), D (Dates) as example. Their star schema as above.

PSD View

  • Take the computation of SUM function as example. Other aggregation functions are similar.

  • Sales of every product in every store at every day.

    CREATE VIEW PSD (PID, SID, DID, TotalSales) AS
    SELECT PID, SID, DID, SUM (SaledMoney) AS TotalSales
    FROM Sales
    GROUP BY PID, SID, DID;

PS, SD, and PD View

  • Total sales of every product in every store (for all times)

  • SD and PD views are similar (for all products or for all stores)

    CREATE VIEW PS (PID, SID, TotalSales) AS
    SELECT PID, SID, SUM(TotalSales) AS TotalSales
    FROM PSD
    GROUP BY PID, SID;

  • PS View can be expressed as PS ALL

P, S, and D View

  • Total sales of every product (for all stores and all times)

  • S and D views are similar (aggregated according to store or date respectively)

    CREATE VIEW P (PID, TotalSales) AS
    SELECT PID, SUM(TotalSales) AS TotalSales
    FROM PS /* or PD */
    GROUP BY PID;

  • P View can be expressed as P ALL ALL

All View

Total sales for all products and all stores and all times

CREATE VIEW ALL (TotalSales) AS
SELECT SUM(TotalSales) AS TotalSales
Principles of Database Systems, Xu Lizhen 21
FROM P /* or S or D */

Reliant Relationships Between Views

  • There are three dimensions in Sales: {P, S, D}, every sub-set of it is corresponding to a view. These are equivalent to 23 elements of power-set ρ({P, S, D}).
  • The following is the reliant relationships between these materialized views:

OLAP Queries

  • A common operation is to aggregate a measure over one or more dimensions.

    • Find total sales.
    • Find total sales for each city, or for each state.
    • Find top five products ranked by total sales.
  • Roll-up: Aggregating at different levels of a dimension hierarchy. 上卷,在某一个维度上,取更高的统计力度

    • E.g., Given total sales by city, we can roll-up to get sales by state.
  • Drill-down: The inverse of roll-up. 下钻

    • E.g., Given total sales by state, can drill-down to get total sales by city.
  • E.g., Can also drill-down on different dimension to get total sales by product for each state.

  • Pivoting: Aggregation on selected dimensions.

  • Slicing and Dicing: Equality and range selections on one or more dimensions.

The CUBE Operator

  • Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

  • CUBE PID SID DID BY SUM Sales PID, SID,  Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

Examples of Queries on Cube

SELECT PID, SID, Quarter, SUM (SaledMoney) AS TotalSales
FROM Sales S, Dates D
WHERE S.DID=D.DID
CUBE BY PID, SID, Quarter;

— Generate a data cube including 2³ views.

  • CUBE(TV,No1,Q1)
    • Query TV’s total sales of store No1 in first quarter.
  • CUBE(ALL,No1,Q1)
    • Query total sales of store No1 in first quarter.
  • CUBE(ALL,ALL,Q1)
    • Query total sales in first quarter.
  • CUBE(ALL,ALL,ALL)
    • Query total sales in whole year.
  • CUBE(TV,No1,Q1) + CUBE(VCD,No1,Q1)
    • CUBE(ALL,ALL,Q1) / CUBE(ALL,ALL,ALL) * 100%
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值