Relational Database Management Systems(RDBMS):
Around since 1970s
Countless examples in which they actually do make sense
The dawn of Big Data:
Previously: ignore data sources because no cost-effective way to store everything, one option was to prune, by retaining only data for the last N days.
Today: store everything! Pruning fails in providing a base to build useful mathematical models.
Batch processing
Hadoop and MapReduce:
Excels at storing (semi- and/or un-) structured data
Data interpretation takes place at analysis-time
Flexibility in data classification
Batch Processing: A complement to RDBMS:
Scalable sink for data, processing launched when time is right
Optimized for large file storage
Optimized for “streaming” access
Random access:
Users need to “interact” with data, especially that “crunched” after a MapReduce job
This is historically where RDBMS excel: random access for structured data.
Column-Oriented Databases
Data layout:
(1). Save their data grouped by columns
(2). Subsequent column values are stored contiguously on disk
(3). This is substantially different from traditional RDBMS, which save and store data by row.
Specialized databases for specific workloads:
(1). Reduced I/O
(2). Better suited for compression-> Efficient use of bandwidth
Indeed, column values are often similar and differ little row-by-row
(3). Real-time access to data
Important NOTE:
(1). HBase is not a column-oriented DB in the typical term
(2). HBase uses an on-disk column storage format
(3). Provides key-based access to specific cell of data, or a sequential range of cells.
The Problem with RDBMS
RDBMS are still relevant
(1). Persistence layer for frontend application
(2). Store relational data
(3). Works well for a limited number of records.
Example: Hush
(1). Used throughout this course
(2). URL shortener service
Let’s see the “scalability story” of such a service
(1). Assumption: service must run with a reasonable budget
(2). few thousands users: use a LAMP stack
Normalize data
Use foreign keys
Use Indexes
Find all short URLs for a given user
JOIN user and shorturl tables
Stored Procedures
(1). Consistently update data from multiple clients
(2). Underlying DB system guarantees coherency
Transactions
(1). Make sure you can update tables in an atomic fashion
(2). RDBMS-> Strong Consistency(ACID properties)
(3). Referential Integrity
Scaling up to tens of thousands of users
(1). Increasing pressure on the database server
(2). Adding more application servers is easy: they share their state on the same central DB
(3) CPU and I/O start to be a problem on the DB
Master-Slave architecture
(1). Add DB server so that READS can be served in parallel
(2). Master DB takes all the writes(which are fewer in the Hush Application)
(3). Slaves DB replicate Master DB and serve all reads (but you need a load balancer)
Scaling up to hundreds of thousands
(1). READS are still the bottlenecks
(2). Slave servers begin to fall short in serving clients requests.
Caching
(1). Add a caching layer.e.g. Memcached or Redis
(2). Offload READS to a fast in-memory system
You lose consistency guarantees
Cache invalidation is critical for having DB and Caching layer consistent
(3). WRITES are the bottleneck
(4). The master DB is hit too hard by WRITE load
(5). Vertical scalability: beef up your master server
This becomes costly, as you may also have to replace your RDBMS
SQL JOINS becomes a bottleneck
(1). Schema de-normalization
(2). Cease using stored procedures, as they become slow and eat up a lot of server CPU
(3). Materialized views(they speed up READS)
Drop secondary indexes as they slow down WRITES
What if your application needs to further scale up?
Vertical scalability vs. Horizontal calability
Sharding
Partition your data across multiple databases
Essentially your break horizontally your tables and ship them to different servers
This is done using fixed boundaries
Re-sharding to achieve load-balancing
This is an operational mightmare
Re-sharding takes a huge toll and I/O resources