Databases Revisited
Scalability, Performance, Relaxed Consistency, Agility, Intricacy and Necessity (SPRAIN)
Databases Revisited (cont.)
Motivation of NoSQL
• Designed to handle large amount of data across multiple servers
• There is a lot of unorganized data out there
• Easy to implement and deploy
• Mimics traditional relational database systems, but with triggers and lightweight transactions
• Raw, simple data structures
CAP Theorem Revisited
Guarantee to fulfill two out of three conditions
o Consistency
o All clients have same view of data
o Availability
o Writeable in the face of node failure
o Partition Tolerance
o Processing can continue in the face of network failure (crashed router, broken network)
Cassandra: A Quick Introduction
Apache Cassandra is
• an open source
• distributed and decentralized / distributed storage system (database)
• a wide column store database
• for managing very large amounts of structured data spread out across the world
• It provides highly available service with no single point of failure
• The current version: 4.0
• It is scalable, fault-tolerant, and consistent
• It is a column-oriented database
• Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable
• Created at Facebook, it differs sharply from relational database management systems
• Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful "column family” data model
• Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more
Cassandra: A Brief History
• Cassandra was created to power the Facebook Inbox Search
• Facebook open-sourced Cassandra in 2008 and became an
Apache Incubator project
• In 2010, Cassandra graduated to a top-level project, regular update and releases followed
Cassandra: An Evolution
• Google Bigtable (2006)
• Consistency model: strong
• Data Model: sparse map
• Clones: Hbase, Hypertable
• Amazon Dynamo (2007)
• O(1) DHT- distributed hash table
• Consistency Model: Client tune-able
• Clones: Riak, Voldemort
Cassandra ~= Bigtable + Dynamo
Cassandra: General Design Features
Emphasis on performance over analysis
• Still has support for analysis tools such as Hadoop
Organization
• Rows are organized into tables
• First component of a table’s primary key is the partition key
• Rows are clustered by the remaining columns of the key
• Columns may be indexed separately from the primary key
• Tables may be created, dropped, altered at runtime without blocking queries Language
• CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)
Cassandra: Some Features
Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
Fast linear-scale performance − Cassandra is linearly scalable,i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency
Flexible, Familiar interface – Cassandra Query Language is quite similar to SQL. Relatively easy query language.
Cassandra: Some Highlights
Staged Event Architecture
• A general-purpose framework for high concurrency & load conditioning.
• Decomposes applications into stages separated by queues
Data Replication
• Configurable replication factor • Replica placement strategy
• Rack unaware →Simple Strategy
• Rack aware → Old Network Topology Strategy
• Data center shard → Network Topology Strategy
Peer-to-Peer Cluster
• Decentralized design
• Each node has the same role
• No single point of failure
• Avoids issues of master-slave DBMS’s
• No bottlenecking
Fault Tolerant/Durability
• Failures happen all the time with multiple nodes
• Replication
• Data is automatically replicated to multiple nodes
• Allows failed nodes to be immediately replaced
• Distribution of data to multiple data centers
• An entire data center can go down without data loss occurrin
Cassandra Architecture
• Partitioning
• How data is partitioned across nodes
• Replication
• How data is duplicated across nodes
• Cluster Membership
• How nodes are added, deleted to the cluster
Gossip Protocol
• Network Communication protocols inspired for real life rumour spreading.
• Periodic, Pairwise, inter-node communication.
• Low frequency communication ensures low cost.
• Random selection of peers.
• Example – Node A wish to search for pattern in data • Round 1 – Node A searches locally and then gossips with node B.
• Round 2 – Node A, B gossips with C and D.
• Round 3 – Nodes A, B, C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust.
Write Operation Stages
o Logging data in the Commit Log
o Writing data to the Mem-Table
o Flushing data from the Mem-Table
o Storing data on disk in SSTables
Write Operation
o Commit Log
o First place a write is recorded
o Crash recovery mechanism
o Write not successful until recorded in commit log
o Once reecorded in commit log, data is written to Mem-Table
o Mem-Table
o Data structure in memory
o Once Mem-Table size reaches a threshold, it is flushed (appended) to SSTable
o Several may exist at once (1 current, any others waiting to be flushed)
o First place read operations look for data
o SSTable
o Kept on disk
o Immutable once written
o Periodically compacted for performance
Cassandra Data Model
Data Model
o Table is a multi-dimensional map indexed by key (row key)
o Columns are grouped into Column Families
o 2 Types of Column Families
o Simple
o Super (nested Column Families)
o Each Column has
o Name
o Value
o Timestamp
Key-Value Model
o Cassandra is a column-oriented NoSQL system
o Column families: Sets of keyvalue pairs
o Column family as a table and keyvalue pairs as a row (using relational database analogy)
o A row is a collection of columns labeled with a name
Cassandra Row
o The value of a row is itself a sequence of key-value pairs
o Such nested key-value pairs are columns
o Key = column name
o A row must contain at least 1 column
Key Space
o A Key Space is a group of column families together
o It is only a logical grouping of column families and provides an isolated scope for names
Cassandra Query Language (CQL)
Interface Data Definition Language (DDL)
o Creating a keyspace - namespace of tables
CREATE KEYSPACE “Key Space Name”
WITH replication = {‘class’: ’Strategy name’, ‘replication_factor’: No of replicas};
o To use namespace: USE “Key Space Name”;
o Creating tables:
CREATE TABLE table name (
column 1 name data type PRIMARY KEY,
column 2 name data type,
column 1 name data type, )
Interface Data Manipulation Language (DML)
o Inserting data
INSERT INTO table name (, , …) VALUES (, , …);
o Querying tables
SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows
SELECT * FROM table; SELECT column FROM table WHERE condition;
Some Considerations
o Cassandra is designed as a distributed database management system
o Use it when you have a lot of data spread across multiple servers
o Cassandra write performance is always excellent, but read performance depends on write patterns
o It is important to spend enough time to design proper schema around the query pattern
o Having a high-level understanding of some internals is a plus o Ensures a design of a strong application built atop Cassandra.
Cassandra: Advantages
o Perfect for time-series data
o High performance
o Decentralization
o Nearly linear scalability
o Replication support
o No single points of failure
o MapReduce support
Cassandra: Disadvantages
o No referential integrity
o No concept of JOIN
o Querying options for retrieving data are limited
o Sorting data is a design decision
o No GROUP BY
o No support for atomic operations
o if operation fails, changes can still occur
o First think about queries, then about data model