200316 Introduction to Databases LECTURE 9: Disks, I/O, Indexes 纯理论

🚩2020/03/16
我太喜欢这个老师了😂上课风趣👇(虽然很多梗我翻译成中文也听不太懂)
在这里插入图片描述

很抱歉没有认真听课,好多英文听不太懂,有些东西太偏cs理论一头雾水,准备花时间慢慢去琢磨,然后整理出笔记方便期末复习(这部分笔记是上课prof提到的点,skip部分课件)
【仅供学习分享,遵纪守法共同进步】
老师上课的课件 书本Database System Concepts 7th Edition


what is database?

The term database refers to a collection of data that is managed by a DBMS(Database Management System数据库管理系统)
The DBMS is expected to

  • allow users to create new databases and specify their schemas (logical structure of the data), usung a specialized data-definition language(DDL)
  • give users the ability to query and modify the data (query language or data-manipulation langauge(DML))
  • support the storage if very large amounts of data over a long period of time, allowing efficient access to the data for queries and database modifications
  • enable durability, the recovery of the database in the face of failuers, errors of many kinds, or intentional misuse
  • control access to data from many users at once, without allowing unexpected interations among users (called isolation) and without actions on the data to be performed partially but not completely (called atomicity)

Purpose of Database Systems

In the early days, database applications were built directly on top of file systems, which leads to:

  • Data redundancy and inconsistency: data is stored in multiple file formats resulting induplication of information in different files
  • Difficulty in accessing data
    • Need to write a new program to carry out each new task
  • Data isolation
    • Multiple files and formats
  • Integrity problems
    • Integrity constraints (e.g., account balance > 0) become “buried” in program code rather than being stated explicitly
    • Hard to add new constraints or change existing ones
  • Atomicity of updates
    • Failures may leave database in an inconsistent state with partial
    updates carried out
    • Example: Transfer of funds from one account to another should either complete or not happen at all
  • Concurrent access by multiple users
    • Concurrent access needed for performance
    • Uncontrolled concurrent accesses can lead to inconsistencies
    Ex: Two people reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the same time
  • Security problems
    • Hard to provide user access to some, but not all, data
    Database systems offer solutions to all the above problems
    在这里插入图片描述

Lecture 9 DBMS Architecture and Implementation: Disks, I/O, Indexes

(Database Systems Concepts, V7, Ch. 12)

Classification of Physical Storage Media

  1. Can differentiate storage into:
  • volatile(不稳定的) storage: loses contents when power is switched off
  • non-volatile storage:
    • Contents persist even when power is switched off.
    • Includes secondary and tertiary storage, as well as batter-backed up main-memory.
  1. Factors affecting choice of storage media include
    • Speed with which data can be accessed
    • Cost per unit of data
    • Reliability

在这里插入图片描述

P.S :
cache is basically on the same chip as the processor. Main memory is addressable. It’s on another chip. 除这两个其他的are external
magnetic tapes 磁带
optical disk 光盘 like physical DVD
magnetic disk 磁盘: the spinning disks, the hard disk drives, so if you’ve got a computer with a hard disk drive or you’ve got an external storage device, it’s got a magnetic hard disk.
flash memory: semiconductor memory. There’s no moving parts, but it’s not volatile.

  • primary storage: Fastest media but volatile (cache, main memory).
  • secondary storage: next level in hierarchy, non-volatile, moderately fast access time
    • Also called on-line storage
    • E.g., flash memory, magnetic disks
  • tertiary(第三的) storage: lowest level in hierarchy, non-volatile, slow access time
    • also called off-line storage and used for archival storage
    • e.g., magnetic tape, optical storage
    • Magnetic tape

这里太理论了没听懂
在这里插入图片描述

  • Read-write head
  • Surface of platter divided into circular tracks
    • Over 50K-100K tracks per platter on typical hard disks
  • Each track is divided into sectors
    • A sector is the smallest unit of data that can be read or written
    • Sector size typically 512 bytes(最小)
    • Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks)
  • To read/write a sector
    • disk arm swings to position head on right track
    • platter spins continually; data is read/written as sector passes under head
  • Head-disk assemblies
    • multiple disk platters on a single spindle (1 to 5 usually)
    • one head per platter, mounted on a common arm.
  • Cylinder i consists of i t h i^{th} ith track of all the platters
  • Disk controller – interfaces between the computer system and the disk
    drive hardware.
    • accepts high-level commands to read or write a sector
    • initiates actions such as moving the disk arm to the right track and actually reading or writing the data
    • Computes and attaches checksums to each sector to verify that data is read back correctly. (If data is corrupted, with very high probability stored checksum won’t match recomputed checksum)
    • Ensures successful writing by reading back sector after writing it
    • Performs remapping of bad sectors

Performance Measures of Disks

  • Access time – the time it takes from when a read or write request is
    issued to when data transfer begins. Consists of:
    Seek time – time it takes to reposition the arm over the correct track.
    Rotational latency – time it takes for the sector to be accessed to appear under the head.
  • Data-transfer rate – the rate at which data can be retrieved from or stored to the disk.
  • Disk block is a logical unit for storage allocation and retrieval
  • Sequential access pattern
    • Successive requests are for successive disk blocks
    • Disk seek required only for first block
  • Random access pattern
    • Successive requests are for blocks that can be anywhere on disk
    • Each access requires a seek
    • Transfer rates are low since a lot of time is wasted in seeks
  • I/O operations per second (IOPS)
    Number of random block reads that a disk can support per second
  • Mean time to failure (MTTF) – the average time the disk is expected to
    run continuously without any failure.

RAID

RAID (redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.

  • Disk organization techniques that manage a large numbers of disks, providing a view of a single disk of
    high capacity and high speed by using multiple disks in parallel,
    high reliability by storing data redundantly, so that data can be recovered even if a disk fails
  • The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    P.S. 橙色的是error correcting codes
    One of these things(蓝色) fails, I can still recover the data
    比较常见的是RAID-0/1/5/6
    在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

过时但是是基础👇
在这里插入图片描述


Data Storage Structures

(Database Systems Concepts, V7, Ch. 13)

File Organization

  • The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields.
    P.S. We describe it abstractly: file is a table, records as a row, field as a column
  • One approach
    • Assume record size is fixed
    • Each file has records of one particular type only
    • Different files are used for different relations
    This case is easiest to implement; will consider variable length records later
  • We assume that records are smaller than a disk block

Terminology

• A tuple in a relation maps to a record. Records may be
– Fixed length
all the records take up the same number of bytes. so all the rows are the same size
– Variable length
within a table, the rows have got different sizes
– Variable format (which we will see in Neo4J, DynamoDB, etc).
• A block
– Is the unit of transfer between disks and memory (buffer pools).
– Contains multiple records, usually but not always from the same relation.
• The database address space contains
– All of the blocks and records that the database manages
– Including blocks/records containing data
– And blocks/records containing free space.

# Fixed length record
# Sample DDL
CREATE TABLE `products` (
`product_id` char(8) NOT NULL,
`product_name` varchar(16) NOT NULL,
`product_description` char(8) NOT NULL,
`product_brand` enum('IBM','HP','Acer','Lenovo','Some really long brand
name') NOT NULL, PRIMARY KEY (`product_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

enum 枚举用法

1. Fixed-length records

Deletion of record i: alternatives:
• moverecords i+1,…,n to i,…,n–1 -> Record 3 deleted
在这里插入图片描述

• move record n to i -> Record 3 deleted and replaced by record 11
在这里插入图片描述
• do not move records, but link all free records on a free list
在这里插入图片描述

2. Variable-length records

(1)一条记录中如何取出每一列?
在这里插入图片描述
(2)一个block中如何提取每一条记录?-> Slotted Page Structure
在这里插入图片描述

一个空的block 开始插入时第一个记录存在 block 的末尾, 同时在Block Header 中的数组加一项指向这个记录。第二条记录接着往前存,以此类推。 可用空间被夹在了中间部分.

删除一条记录直接回收它所暂用的空间,指针数组中把它标记为已删除(如可以把它的 size 设置为 -1)。它前面的记录依次往后移动, 同时 header 中的指针数组也要相应的更新。因为一个 block 4~8k,这个移动并没有太大影响。

Slotted-page 要求所有的指针不能直接指向记录本身,只能指向header 中指针数组项(二级指针)。这样记录移动时,引用该记录的地方不需要改变。
作者:lesliefang 链接:https://www.jianshu.com/p/a79995718fc8 来源:简书

Organization of Records in Files

  • Heap
    在这里插入图片描述

  • Sequential
    在这里插入图片描述在这里插入图片描述

  • In a multitable clustering file organization records of several different relations can be stored in the same file
    • Motivation: store related records on the same block to minimize I/O
    在这里插入图片描述
    在这里插入图片描述

  • B + B^+ B+-tree file organization
    • Ordered storage even with inserts/deletes
    • More on this in Chapter 14

  • Hashing – a hash function computed on search key; the result specifies in
    which block of the file the record should be placed
    • More on this in Chapter 14

Partitioning

  • Table partitioning: Records in a relation can be partitioned into smaller
    relations that are stored separately (Take the records from a single table and keep them into different files.)
    E.g., transaction relation may be partitioned into transaction_2018, transaction_2019, etc.
  • Queries written on transaction must access records in all partitions
    • Unless query has a selection such as year=2019, in which case only one partition in needed
  • Partitioning
    • Reduces costs of some operations such as free space management
    • Allows different partitions to be stored on different storage devices
    E.g., transaction partition for current year on SSD, for older years on magnetic disk

Data Dictionary Storage

在这里插入图片描述

show table from information_schema;
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值