文章目录
- Rest of DB \textbf{Rest of DB} Rest of DB
- 1. Data Administration \textbf{1. Data Administration} 1. Data Administration
- 2. DB Backup \textbf{2. DB Backup} 2. DB Backup
- 3. Data Warehouse \textbf{3. Data Warehouse} 3. Data Warehouse
- 4. Distributed DB \textbf{4. Distributed DB} 4. Distributed DB
- 5. NoSQL: \textbf{5. NoSQL: } 5. NoSQL: 专门用来存数据
更多 资料与 笔记
Rest of DB \textbf{Rest of DB} Rest of DB
1. Data Administration \textbf{1. Data Administration} 1. Data Administration
1.1. Overview \textbf{1.1. Overview} 1.1. Overview
1️⃣ Capacity Planning: \text{Capacity Planning:} Capacity Planning: 预测未来 DB \text{DB} DB的负荷 → \text{→} →预测系统会何时饱和 → \text{→} →尽可能延长饱和的时间
2️⃣大小计算
Size \textbf{Size} Size 描述 Database Size \text{Database Size} Database Size Sum of all Table sizes \text{Sum of all Table sizes} Sum of all Table sizes Table Size \text{Table Size} Table Size Number of rows(Cardinality) * Average row width(Degree) \text{Number of rows(Cardinality) * Average row width(Degree)} Number of rows(Cardinality) * Average row width(Degree) Row Size \text{Row Size} Row Size Sum of sizes of all attributes \text{Sum of sizes of all attributes} Sum of sizes of all attributes 1.2. \textbf{1.2. } 1.2. 例题
1️⃣ DB \text{DB} DB更新数据
表格 起始量 更新 Suppliers \text{Suppliers} Suppliers 0 \text{0} 0 新增 1000 Suppliers/Year \text{1000 Suppliers/Year} 1000 Suppliers/Year Puerchases \text{Puerchases} Puerchases 0 \text{0} 0 新增 10000 Purchases/Year \text{10000 Purchases/Year} 10000 Purchases/Year, 1-Purchase ↔ 对应 10-PurchaseItem \text{1-Purchase}\xleftrightarrow{对应}\text{10-PurchaseItem} 1-Purchase对应 10-PurchaseItem Items \text{Items} Items 0 \text{0} 0 新增 2000 Items/Year \text{2000 Items/Year} 2000 Items/Year
- PurchaseItem \text{PurchaseItem} PurchaseItem作为联结实体,起始量也一定是 0 \text{0} 0
2️⃣ Size of Row \text{Size of Row} Size of Row
Table Name \textbf{Table Name} Table Name Size of Row(Byte) \textbf{Size of Row(Byte)} Size of Row(Byte) Purchase(Normalized) \text{Purchase(Normalized)} Purchase(Normalized) 50 \text{50} 50 Purchase(Denormalized) \text{Purchase(Denormalized)} Purchase(Denormalized) 110 \text{110} 110 Supplier \text{Supplier} Supplier 50 \text{50} 50 PurchaseItem \text{PurchaseItem} PurchaseItem 12 \text{12} 12 Item \text{Item} Item 50 \text{50} 50 3️⃣ Normalized \text{Normalized} Normalized模式下:数据库运行两年后,会占用多大存储空间 ? \text{ ?} ?
- 2×1000×50+2×10000×50+2×100000×12+2×2000×50=3700000Bytes=3613.28KB \text{2×1000×50+2×10000×50+2×100000×12+2×2000×50=3700000Bytes=3613.28KB} 2×1000×50+2×10000×50+2×100000×12+2×2000×50=3700000Bytes=3613.28KB
4️⃣ Denoemalized \text{Denoemalized} Denoemalized模式下:数据库运行两年后,会占用多大存储空间 ? \text{ ?} ?
- 2×10000×110+2×100000×12+2×2000×50=4800000Bytes=4687.5KB \text{2×10000×110+2×100000×12+2×2000×50=4800000Bytes=4687.5KB} 2×10000×110+2×100000×12+2×2000×50=4800000Bytes=4687.5KB
2. DB Backup \textbf{2. DB Backup} 2. DB Backup
2.1. Overview \textbf{2.1. Overview} 2.1. Overview
1️⃣可能出现数据丢失的 Fails \text{Fails} Fails
Failure Type \textbf{Failure Type} Failure Type Description \textbf{Description} Description Statement failure \text{Statement failure} Statement failure 语法错误 User Process failure \text{User Process failure} User Process failure 处理过程中的错误,比如 Power Outage \text{Power Outage} Power Outage断电 Network failure \text{Network failure} Network failure 断网 User error \text{User error} User error 用户手欠导致的错误,比如误删掉某个表格 Memory failure \text{Memory failure} Memory failure 内存出错 Media Failure \text{Media Failure} Media Failure 硬盘出错 2️⃣ Backup \text{Backup} Backup的意义就在于,出现上述情况时防止数据丢失
2.2. Backup \textbf{2.2. Backup} 2.2. Backup种类
1️⃣ Backup \text{Backup} Backup种类: Physical & Logical \text{Physical \& Logical} Physical & Logical
- Physical & Logical \text{Physical \& Logical} Physical & Logical性质概述
Item \textbf{Item} Item Physical \textbf{Physical} Physical Logical \textbf{Logical} Logical Backup \text{Backup} Backup的内容 实体表格( File+Dictionary \text{File+Dictionary} File+Dictionary) SQL \text{SQL} SQL代码 数据库状态 希望 Offline(Cold Backup) \text{Offline(Cold Backup)} Offline(Cold Backup),但也可以 Online \text{Online} Online 必须 Online \text{Online} Online 备份速度 快(直接复制文件表格,无需转成 SQL \text{SQL} SQL) 慢(需转成 SQL \text{SQL} SQL) Log \text{Log} Log日志文件 包含在备份中 不包含在备份中 关于系统配置 备份后的文件要在相似的系统配置上才能运行 备份独立于具体的机器 - 如何选择
情景 选择 需要 Time-Critical \text{Time-Critical} Time-Critical要求 Pyhsical \text{Pyhsical} Pyhsical DB \text{DB} DB需要在不同硬件配置上运行 Logical \text{Logical} Logical
- 如果是就选择
2️⃣ Backup Option \text{Backup Option} Backup Option
Option \textbf{Option} Option 描述 适用情景 Online \text{Online} Online 备份时数据库要在线( Available \text{Available} Available) 数据库 24h×7day \text{24h×7day} 24h×7day不停歇工作,逻辑备份 Offline \text{Offline} Offline 备份时数据库下线( Shut Down \text{Shut Down} Shut Down) 物理备份 Full \text{Full} Full 备份起始于 DB \text{DB} DB诞生 数据库不常用(周末) Incremental \text{Incremental} Incremental 备份起始于上次备份完 数据库需要频繁使用(工作日) Onsite \text{Onsite} Onsite 备份在本地==(更快)== Time-Critical \text{Time-Critical} Time-Critical,需要快速导入备份数据 Offsite \text{Offsite} Offsite 备份在云端 上传云端要有意义
- Full & Incremental \text{Full \& Incremental} Full & Incremental通常会混用
- Offsite \text{Offsite} Offsite上传云端有意义的实例:
- 有多个工厂共用一套 DB \text{DB} DB时,可以本地也可以云端
- 只有一个工厂时,则必须上传云端,以防自然灾害数据损失
3. Data Warehouse \textbf{3. Data Warehouse} 3. Data Warehouse
3.1. Data Warehouse \textbf{3.1. Data Warehouse} 3.1. Data Warehouse概述
1️⃣两种 DB \text{DB} DB
- 概述
DB Type \textbf{DB Type} DB Type 适用情况 Transactional \text{Transactional} Transactional 涉及表格少(不需要太多 Join \text{Join} Join),比如 Relational DB \text{Relational DB} Relational DB Informational \text{Informational} Informational 涉及表格多,解决 Analytical \text{Analytical} Analytical问题 - 二者间的关系: Trans DB → Data Warehouse Info DB \text{Trans DB}\xrightarrow{\text{Data Warehouse}}\text{Info DB} Trans DBData WarehouseInfo DB
- Informantional \text{Informantional} Informantional数据库是 Read Only \text{Read Only} Read Only的,如果硬要修改则遵从以下步骤
- 从 Info DB \text{Info DB} Info DB回到 Trans DB \text{Trans DB} Trans DB
- 修改 Trans DB \text{Trans DB} Trans DB,然后将修改 ETL \text{ETL} ETL到 Info DB \text{Info DB} Info DB
更改Trans DB → ETL/DBA Process { ETL: 即Extract/Transform/Load过程 DBA: 即DB Administration(人为干预) 更改Info DB/Data Warehouse 更改\text{Trans DB}\xrightarrow[\text{ETL/DBA Process}]{\begin{cases}\text{ETL: 即Extract/Transform/Load过程}\\\\\text{DBA: 即DB Administration(人为干预)}\end{cases}}更改\text{Info DB/Data Warehouse} 更改Trans DB{ETL: 即Extract/Transform/Load过程DBA: 即DB Administration(人为干预)ETL/DBA Process更改Info DB/Data Warehouse2️⃣ Data Warehouse \text{Data Warehouse} Data Warehouse特点
特点 描述 Subject Oriented \text{Subject Oriented} Subject Oriented Data Warehouse \text{Data Warehouse} Data Warehouse需要有特定的分析目标 Validated, Integrated Data \text{Validated, Integrated Data} Validated, Integrated Data 数据转为同一模式(即 Dimensional Model \text{Dimensional Model} Dimensional Model) Time Variant \text{Time Variant} Time Variant 存在时间维度,历史数据被用来分析趋势 Non-Volatile \text{Non-Volatile} Non-Volatile(非易失性) 用户对 Data Warehouse \text{Data Warehouse} Data Warehouse只读,如果要更改则见下补充 3.2. Dimensional Modelling \textbf{3.2. Dimensional Modelling} 3.2. Dimensional Modelling
1️⃣ Fact & Dimension: \text{Fact \& Dimension:} Fact & Dimension: 可类比为因变量/自变量
Item \text{Item} Item 描述 示例 Fact \text{Fact} Fact (因变量)测量与分析的目标 产品 G \text{G} G过去三月的销量 Dimension \text{Dimension} Dimension (自变量)衡量目标的维度 产品销售地区,产品促销力度,产品投放时间 2️⃣ Star Schema: Fact Table+Dimension Table \text{Star Schema: Fact Table+Dimension Table} Star Schema: Fact Table+Dimension Table
- Fact Table \text{Fact Table} Fact Table置于中间
- 存放了所有 Dimension Table \text{Dimension Table} Dimension Table的 PK \text{PK} PK(其实是 PFK \text{PFK} PFK),这样利于不同表格快速相 Join \text{Join} Join
- 还存放了 Fact \text{Fact} Fact,此处为 PRICE/QUANTITY \text{PRICE/QUANTITY} PRICE/QUANTITY
- Dimension Table: \text{Dimension Table:} Dimension Table:
- 置于周围,有 N \text{N} N个 Dimension Table \text{Dimension Table} Dimension Table就是 N-Dimension \text{N-Dimension} N-Dimension
- 处于 Denormalized \text{Denormalized} Denormalized状态,若存在 Hierarchies \text{Hierarchies} Hierarchies(比如 Product \text{Product} Product高 Sale \text{Sale} Sale一级)则可标准化
- 当问到一个模型是不是 Star Shceme \text{Star Shceme} Star Shceme,需要阐述
- 存在(几个) Dimension \text{Dimension} Dimension
- 每个 Dimension \text{Dimension} Dimension都是 Denormalized \text{Denormalized} Denormalized的
3️⃣ Star Schema → 向外延展 ( 可以延申很多层 ) 将Dimension Table完全规范化 Snowflake Schema \text{Star Schema}\xrightarrow[向外延展(可以延申很多层)]{将\text{Dimension Table}完全规范化}\text{Snowflake Schema} Star Schema将Dimension Table完全规范化向外延展(可以延申很多层)Snowflake Schema
4. Distributed DB \textbf{4. Distributed DB} 4. Distributed DB
4.1. Distributed DB \textbf{4.1. Distributed DB} 4.1. Distributed DB概念
1️⃣几种数据库
Type \textbf{Type} Type 描述 Centralized DB \text{Centralized DB} Centralized DB 将一个数据库放到一个 Location \text{Location} Location Decentralized DB \text{Decentralized DB} Decentralized DB 将一个数据库分为几块,每块放到不同的 Location \text{Location} Location ( Aka Node \text{Aka Node} Aka Node) Distributed DB \text{Distributed DB} Distributed DB 将一个数据库分为几块,每块放到不同的 Location \text{Location} Location,并通过网络互联 2️⃣ Distributed DB \text{Distributed DB} Distributed DB结构示意图
- DB \text{DB} DB不同部分由 Communication Network/Link \text{Communication Network/Link} Communication Network/Link互联,任一部分可访问其他所有部分的数据
- 此处有两个 DB1 \text{DB1} DB1这是允许的,但是要确保连个 DB1 \text{DB1} DB1的数据尽可能一致
4.2. Distributed DB \textbf{4.2. Distributed DB} 4.2. Distributed DB的特点
1️⃣ Distributed DB \text{Distributed DB} Distributed DB的优点
- 加快访问/处理数据的速度:
- 原理:将数据存储在最长使用它的地区
- 示例: L1 \text{L1} L1处要频繁访问 DB1 \text{DB1} DB1,将 DB1 \text{DB1} DB1放在 L1 \text{L1} L1处就可(相比访问整体)加快访问速度
- 允许 Modular Growth \text{Modular Growth} Modular Growth/具有 Horizontal Scalability \text{Horizontal Scalability} Horizontal Scalability(横向扩展)
- Vertical Scaling: \text{Vertical Scaling:} Vertical Scaling: 在 Centralized DB \text{Centralized DB} Centralized DB中增加磁盘空间
- Horizontal Scaling: \text{Horizontal Scaling:} Horizontal Scaling: 在 Distributed DB \text{Distributed DB} Distributed DB中增加节点,并用网络连接;成本要低得多
- 稳定性更高:一个节点崩了,其它节点不受牵连,整个系统不至于崩( Partition Tolerance \text{Partition Tolerance} Partition Tolerance)
2️⃣ Distributed DB \text{Distributed DB} Distributed DB的缺点
- Management and Control \text{Management and Control} Management and Control更复杂,维护更复杂
- 需要保证数据一致( Integrity \text{Integrity} Integrity):比如有多个 DB1 \text{DB1} DB1节点备份,用户改变一 DB1→ \text{DB1→} DB1→所有 DB1 \text{DB1} DB1都变
- 安全问题更严峻:节点越多越容易被黑
3️⃣ Features \text{Features} Features
- Local Transparency: \text{Local Transparency: } Local Transparency: 用户不必知道具体某个数据存放在哪里
- Local Autonomy: \text{Local Autonomy: } Local Autonomy: 如果 Communication \text{Communication} Communication断掉了,还可以继续使用本地的节点
4.3. Distribution Option \textbf{4.3. Distribution Option} 4.3. Distribution Option
1️⃣ Partitioned/Replicated \text{Partitioned/Replicated} Partitioned/Replicated方式:但注意一般实际会采用二者的结合
Option \textbf{Option} Option 描述:将 DB _ _ \small\text{DB}\_\_ DB__ 示例 可靠性 访问(本地)数据 存储 Replicated \text{Replicated} Replicated 完整复制几块 ABC→3×ABC \small\text{ABC→3×ABC} ABC→3×ABC 高 慢 大 Partitioned \text{Partitioned} Partitioned 拆为不同几块 ABC→A/B/C或AB/C \small\text{ABC→A/B/C或AB/C} ABC→A/B/C或AB/C 低 快 小 Mixed \text{Mixed} Mixed 前二者混合 ABC→AB/BC/C \small\text{ABC→AB/BC/C} ABC→AB/BC/C N/A \text{N/A} N/A N/A \text{N/A} N/A N/A \text{N/A} N/A 🤔 Replicated \text{Replicated} Replicated方式特点
- 优点:减少了 Network Traffic \text{Network Traffic} Network Traffic,因为所有数据都在本地,无需频繁联网取数据
- 缺点: Update \text{Update} Update操作很费时(难以保持一致性),比如改动一个 Node \text{Node} Node其余所有点必定都需更新
2️⃣ Partitioned \text{Partitioned} Partitioned的两种方式: 把数据横着/竖着切开 →Horizontal/Vertical \text{→Horizontal/Vertical} →Horizontal/Vertical
方式 描述:分割表的操作 数据还原 Horizontal \text{Horizontal} Horizontal 把不同的 Row \text{Row} Row放在不同的 Location \text{Location} Location 将不同 Location \text{Location} Location数据 UNION \text{UNION} UNION Vertical \text{Vertical} Vertical 把不同的 Column \text{Column} Column放在不同的 Location \text{Location} Location 将不同 Location \text{Location} Location数据 JOIN \text{JOIN} JOIN
- Vertical \text{Vertical} Vertical会改变表格的 Schema \text{Schema} Schema,因此更复杂
4.4. CAP \textbf{4.4. CAP} 4.4. CAP理论:不可能三角
1️⃣概述
性质 描述 对应更新选项 Partition Tolerance \text{Partition Tolerance} Partition Tolerance 某节点网络失联后,系统任能运行 N/A \text{N/A} N/A Consistency \text{Consistency} Consistency 所有用户看到的数据都一致 数据立即更新( Synchronous \text{Synchronous} Synchronous) Availability \text{Availability} Availability 每个请求都有回应,即使节点故障 数据延迟更新( Asynchronous \text{Asynchronous} Asynchronous) 2️⃣ Distributed DB: \text{Distributed DB:} Distributed DB: 由于有多个节点,必须选择 Partition Tolerance \text{Partition Tolerance} Partition Tolerance
- 当节点失联后还使用节点 → \text{→} →再选择 Availability \text{Availability} Availability
- 当节点失联后就丢弃节点 → \text{→} →再选择 Consistency \text{Consistency} Consistency
3️⃣ Transactional DB: \text{Transactional DB:} Transactional DB: 只有一个节点(不存在网络问题) → \to →大概率只能选 Availability+Consistency \text{Availability+Consistency} Availability+Consistency
5. NoSQL: \textbf{5. NoSQL: } 5. NoSQL: 专门用来存数据
5.1. NoSQL \textbf{5.1. NoSQL} 5.1. NoSQL概述
1️⃣ NoSQL \text{NoSQL} NoSQL的引入:要解决 Objected Oriented \text{Objected Oriented} Objected Oriented(面向对象)问题,而传统 Relational Model \text{Relational Model} Relational Model办不到
2️⃣ NoSQL \text{NoSQL} NoSQL的作用:存储大量数据
- Bigdata \text{Bigdata} Bigdata的特点: 3V \text{3V} 3V ( Volume \text{Volume} Volume/量大, Variety \text{Variety} Variety/种类多, Velocity \text{Velocity} Velocity/数据生成快)
特点 示例 Volume \text{Volume} Volume Scalability / Data Lake \text{Scalability / Data Lake} Scalability / Data Lake Variety \text{Variety} Variety Bespoke QL / Different Data Structure \text{\textcolor{red}{\colorbox{yellow}{Bespoke QL}} / Different Data Structure} Bespoke QL / Different Data Structure Velocity \text{Velocity} Velocity Streaming Data / Eventual Consistency \text{Streaming Data / \textcolor{red}{\colorbox{yellow}{Eventual Consistency}}} Streaming Data / Eventual Consistency - NoSQL \text{NoSQL} NoSQL存储数据的方式: Schema On Read \text{Schema On Read} Schema On Read
方式 描述 适用 Schema On Read \text{Schema On Read} Schema On Read 先快速存储数据,稍后再为数据选择 Model \text{Model} Model NoSQL \text{NoSQL} NoSQL Schema On Write \text{Schema On Write} Schema On Write 先决定数据的 Model \text{Model} Model,再存储数据 传统数据库 - Data Lake: \text{Data Lake:} Data Lake: 存储数据的大型集成 Repository \text{Repository} Repository
3️⃣ NoSQL \text{NoSQL} NoSQL更契合 Distributed DB \text{Distributed DB} Distributed DB的思想:
- 在 CAP \text{CAP} CAP中选择 Partition Tolerance+Availability \text{Partition Tolerance+Availability} Partition Tolerance+Availability
- 由于主要用于存数据,所以比起一致性对数据可访问性要求更高
4️⃣优点: Flexible Model / Scalability / Performance / High Availability \text{Flexible Model / Scalability / Performance / High Availability} Flexible Model / Scalability / Performance / High Availability
5.2. NoSQL \textbf{5.2. NoSQL} 5.2. NoSQL种类
1️⃣ Key-Value \text{Key-Value} Key-Value
- 结构:类似于目录结构,通过 Primary Key \text{Primary Key} Primary Key索引,对应其下的 Value \text{Value} Value可以放任何东西
- 特点:
- 最灵活,图片/视频/ PDF \text{PDF} PDF都能放 (杂七杂八格式要放一起时,大概率用 Key-Value \text{Key-Value} Key-Value)
- 结构化程度最弱( Unstructured \text{Unstructured} Unstructured)
2️⃣ Document: \text{Document:} Document:
- 结构:将数据存储在文件中,比如 Json/XML \text{Json/XML} Json/XML,比较偏门的还有== MARC \text{MARC} MARC==
- 特点: Semi-Structured \text{Semi-Structured} Semi-Structured或者说 Object-Relational Structured \text{Object-Relational Structured} Object-Relational Structured
- 示例: MangoDB \text{MangoDB} MangoDB
3️⃣ Column Family: \text{Column Family:} Column Family: 类似于 Vertical Partition \text{Vertical Partition} Vertical Partition将每列存储为一个表格
4️⃣ Graphic-Oriented: \text{Graphic-Oriented:} Graphic-Oriented:
- 图边 ↔ \xleftrightarrow{} 关系,图节点 ↔ \xleftrightarrow{} 实体
- 适合追溯不同实体间的关系 ( Track Relationship \text{Track Relationship} Track Relationship)
5.3. BASE \textbf{5.3. BASE} 5.3. BASE理论 For NoSQL \textbf{For NoSQL} For NoSQL
对应 Transaction \text{Transaction} Transaction的 ACID \text{ACID} ACID性质
1️⃣ Basically Avaliable: \text{Basically Avaliable:} Basically Avaliable: 选择 Partition Tolerance \text{Partition Tolerance} Partition Tolerance基础上,确保 Availbility \text{Availbility} Availbility而非 Consistency \text{Consistency} Consistency
2️⃣ Evental Consistency: \text{Evental Consistency:} Evental Consistency: 系统从停止接收数据开始,终将会(慢慢更新)到达 Evental Consistency \text{Evental Consistency} Evental Consistency
3️⃣ Soft State: \text{Soft State:} Soft State: 在最终一致状态( Evental Consistency \text{Evental Consistency} Evental Consistency)前,系统状态会随时改变