Related Work
Replica files for high availability at the expense of consistency
Ficus
Coda
DFS not using any centralized server
Farsite
DFS using master/slave
GFS master is now made fault tolerant using Chubby abstraction
Distrubted Relational Database allows disconnected operations and provides eventual data consistency
Bayou
Allow disconnected operations and guarantee eventual consistency
allow application level resolution
Bayou
Perform system level conflict resolution
Coda
Ficus
Data Model
table
distributed multi dimensional map indexed by a key
key
a string with no size restrictions
value
an object which is highly structured
operation
every operation under a single row key is atomic per replica
column
columns are grouped toghther into sets called column families similar to Bigtable
Cassandra exposes two kinds of columns families: Simple and Super column families.
Super column families can be visualize as a column family within a column family
API
insert(table, key, rowMutation)
get(table, key, columnName)
delete(table, key, columnName)
System Architecture
Characteristics the system needs to have
load balancing
membership
failure detection
failure recovery
replica
synchronization
overload handling
state transfer
concurrency
job scheduling
request marshalling
request routing
system monitoring
alarming
configuration management
Partitioning
consistent hashing
uses an order preserving hash function
the basic consistent hasing algorithm presents some changes
1.the random position assignment of each node on the ring leads to non-uniform data and load distribution
2.the basic algorithm is oblivious to the heterogeneity in the performance of nodes, solutions to this problem:
? 1) every node get assigned to multiple positions in the circle(like in Dynamo)
? 2)analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes
Cassandra uses the second solution
Replication
Each key k, is assigned to a coordinator node.
The coordinator is in charge of the replication of the data items
Cassandra elects a leader amongst its nodes using a system called Zookeeper
all nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for
preference list
every node is aware of every other node in the system
Cassandra provides durability guarantees by relaxing the quorum requirements
each row is replicated across multiple data centers
data centers are connected through high speed network links
Membership
Cluster membership: based on Scuttlebutt: a very efficient anti-entropy Gossip based mechanism
Failure Detection
use a modified version if the Accrual Failure Detector
? the failure detection module does't emit a Boolean value stating a node is up or down
? instead the failure detection module emits a value which represents a suspicion level for each of monitored nodes
Bootstrapping
when a node starts for the first time, it chooses a random token for its position in the ring
for fault tolerance, the mapping is persisted to disk locally and also in Zookeeper
the token information is then gossiped around the cluster
the node reads its configuration file which contains a list of a few contact points within the cluster
the initial contact points are called seeds of the cluster
Scaling the Cluster
Local Persistence
relies on the local file system for data persistence
typical write operation:
? a commit log for durability and recoverability
? an update into an in-memory data structure
typical read operation:
? queries the in-memory data structure
? looiking into the files on disk
? a bloom filter, summarizing the keys in the file, is also stored in each file and also kept in memory
? a key in a column family could have many columns which are further away from the key.
? maintain column indices which allow us to jump to the right chunk on disck for column retrieval
?Implementation Details
the Cassandra process on a single machine is primarily consists of the following abstractions:
? partitioning module
? cluster membership
? failure detection module
? storage engine module
? using java
?