· Authors:
o Prashant Malik
o Karthnik Ranganathan
o Avinash Lakshman
· Structured storage system over P2p (keys are consistent hashed over servers)
· Initially aimed at email inbox search problem
· Design goals:
o Cost Effective
o Highly Available
o Incrementally Scalable
o Efficient data layout
o Minimal administration
· Why Cassandra
o MySQL drives too many random I/Os
o File-based solutions require far too many locks
· What is Cassandra
o Structured storage over a distributed cluster
o Redundancy via replication
o Supports append/insert without reads
o Supports a caching layer
o Supports Hadoop operations
· Cassandra Architecture
o Core Cassandra Services:
§ Messaging (async, non-blocking)
§ Failure detector
§ Cluster membership
§ Partitioning scheme
§ Replication strategy
o Cassandra Middle Layer
§ Commit log
§ Mem-table
§ Compactions
§ Hinted handoff
§ Read repair
§ Bootstrap
o Cassandra Top Layer
§ Key, block, & column indexes
§ Read consistency
§ Touch cache
§ Cassandra API
§ Admin API
§ Read Consistency
o Above the top layer:
§ Tools
§ Hadoop integration
§ Search API and Routing
· Cassandra Data Model
o Key (uniquely specifies a “row”)
§ Any arbitrary string
o Column families are declared or deleted in advance by administrative action
§ Columns can be added or deleted dynamically
§ Column families have attribute:
· Name: arbitrary string
· Type: simple,
o Key can “contain” multiple column families
§ No requirement that two keys have any overlap in columns
o Columns can be added or removed arbitrarily from column families
o Columns:
§ Name: arbitrary string
§ Value: non-indexed blob
§ Timestamp (client provided)
o Column families have sort orders
§ Time-based sort or name-based sort
o Super-column families:
§ Big tables calls them locality groups
§ Super-column families have a sort order
§ Essentially a multi-column index
o System column families
§ For internal use by Cassandra
o Example from email application
§ Mail-list (sorted by name)
· All mail that includes a given word
§ Thread-list (sorted by time)
· All threads that include a given word
§ User-list (sorted by time)
· All mail that includes a given word user
· Cassandra API
o Simple get/put model
· Write model:
o Quorum write or aysnc mode (used by email application)
o Async: send request to any node
§ That node will push the data to appropriate nodes but return to client immediately
o Quorum write:
§ Blocks until quorum is reached
o If node down, then write to another node with a hint saying where it should be written two
§ Harvester every 15 min goes through and find hints and moves the data to the appropriate node
o At write time, you first write to a commit log (sequential)
§ After write to log it is sent to the appropriate nodes
§ Each node receiving write first records it in a local log
· Then makes update to appropriate memtables (1 for each column family)
§ Memtables are flushed to disk when:
· Out of space
· Too many keys (128 is default)
· Time duration (client provided – no cluster clock)
§ When memtables written out two files go out:
· Data File
· Index File
o Key, offset pairs (points into data file)
o Bloom filter (all keys in data file)
§ When a commit log has had all its column families pushed to disk, it is deleted
· Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)
· Write properties:
o No locks in critical path
o Sequential disk access only
o Behaves like a write through cache
§ If you read from the same node, you see your own writes. Doesn’t appear to provide any guarantee on read seeing latest change in failure case
o Atomicity guaranteed for a key
o Always writable
· Read Path:
o Connect to any node
o That node will route to the closes data copy which services immediately
o If high consistency required, don’t return from local immediately
§ First send digest request to all replicas
§ If delta is found, the updates are sent to the nodes that don’t have current data (read repair)
· Replication supported via multiple consistent hash rings:
o Servers are hashed over ring
o Keys are hashed over ring
o Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)
· Cluster membership
o Cluster membership and failure detection via gossip protocol
· Accrual failure detector
o Default sets PHI to 5 in Cassandra
o Detection is 10 to 15 seconds with PHI=5
· UDP control messages and TCP for data messages
· Complies with Staged Event Driven Architecture (SEDA)
· Email system:
o 100m users
o 4B threads
o 25TB with 3x replication
o Uses and joins across 4 tables:
§ Mailbox (user_id to thread_id mapping)
§ Msg_threads (thread to subject mapping)
§ Msg_store (thread to message mapping)
§ Info (user_id to user name mapping)
· Able to load using Hadoop at 1.5TB/hour
o Can load 25TB at network bandwidth over Cassandra Cluster