Facebook Cassandra Architecture and Design

最新推荐文章于 2024-08-15 18:00:43 发布

macyang

最新推荐文章于 2024-08-15 18:00:43 发布

阅读量944

点赞数

分类专栏： distributed system 文章标签： cassandra facebook application system thread email

distributed system 专栏收录该内容

76 篇文章 0 订阅

订阅专栏

· Authors:

o Prashant Malik

o Karthnik Ranganathan

o Avinash Lakshman

· Structured storage system over P2p (keys are consistent hashed over servers)

· Initially aimed at email inbox search problem

· Design goals:

o Cost Effective

o Highly Available

o Incrementally Scalable

o Efficient data layout

o Minimal administration

· Why Cassandra

o MySQL drives too many random I/Os

o File-based solutions require far too many locks

· What is Cassandra

o Structured storage over a distributed cluster

o Redundancy via replication

o Supports append/insert without reads

o Supports a caching layer

o Supports Hadoop operations

· Cassandra Architecture

o Core Cassandra Services:

§ Messaging (async, non-blocking)

§ Failure detector

§ Cluster membership

§ Partitioning scheme

§ Replication strategy

o Cassandra Middle Layer

§ Commit log

§ Mem-table

§ Compactions

§ Hinted handoff

§ Read repair

§ Bootstrap

o Cassandra Top Layer

§ Key, block, & column indexes

§ Read consistency

§ Touch cache

§ Cassandra API

§ Admin API

§ Read Consistency

o Above the top layer:

§ Tools

§ Hadoop integration

§ Search API and Routing

· Cassandra Data Model

o Key (uniquely specifies a “row”)

§ Any arbitrary string

o Column families are declared or deleted in advance by administrative action

§ Columns can be added or deleted dynamically

§ Column families have attribute:

· Name: arbitrary string

· Type: simple,

o Key can “contain” multiple column families

§ No requirement that two keys have any overlap in columns

o Columns can be added or removed arbitrarily from column families

o Columns:

§ Name: arbitrary string

§ Value: non-indexed blob

§ Timestamp (client provided)

o Column families have sort orders

§ Time-based sort or name-based sort

o Super-column families:

§ Big tables calls them locality groups

§ Super-column families have a sort order

§ Essentially a multi-column index

o System column families

§ For internal use by Cassandra

o Example from email application

§ Mail-list (sorted by name)

· All mail that includes a given word

§ Thread-list (sorted by time)

· All threads that include a given word

§ User-list (sorted by time)

· All mail that includes a given word user

· Cassandra API

o Simple get/put model

· Write model:

o Quorum write or aysnc mode (used by email application)

o Async: send request to any node

§ That node will push the data to appropriate nodes but return to client immediately

o Quorum write:

§ Blocks until quorum is reached

o If node down, then write to another node with a hint saying where it should be written two

§ Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o At write time, you first write to a commit log (sequential)

§ After write to log it is sent to the appropriate nodes

§ Each node receiving write first records it in a local log

· Then makes update to appropriate memtables (1 for each column family)

§ Memtables are flushed to disk when:

· Out of space

· Too many keys (128 is default)

· Time duration (client provided – no cluster clock)

§ When memtables written out two files go out:

· Data File

· Index File

o Key, offset pairs (points into data file)

o Bloom filter (all keys in data file)

§ When a commit log has had all its column families pushed to disk, it is deleted

· Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)

· Write properties:

o No locks in critical path

o Sequential disk access only

o Behaves like a write through cache

§ If you read from the same node, you see your own writes. Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o Atomicity guaranteed for a key

o Always writable

· Read Path:

o Connect to any node

o That node will route to the closes data copy which services immediately

o If high consistency required, don’t return from local immediately

§ First send digest request to all replicas

§ If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

· Replication supported via multiple consistent hash rings:

o Servers are hashed over ring

o Keys are hashed over ring

o Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

· Cluster membership

o Cluster membership and failure detection via gossip protocol

· Accrual failure detector

o Default sets PHI to 5 in Cassandra

o Detection is 10 to 15 seconds with PHI=5

· UDP control messages and TCP for data messages

· Complies with Staged Event Driven Architecture (SEDA)

· Email system:

o 100m users

o 4B threads

o 25TB with 3x replication

o Uses and joins across 4 tables:

§ Mailbox (user_id to thread_id mapping)

§ Msg_threads (thread to subject mapping)

§ Msg_store (thread to message mapping)

§ Info (user_id to user name mapping)

· Able to load using Hadoop at 1.5TB/hour

o Can load 25TB at network bandwidth over Cassandra Cluster

macyang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Facebook Cassandra Architecture and Design

 · Authors: o Prashant Malik o Karthnik Ranganathan o Avinash Lakshman · Structured storage system over P2p (keys are consistent hashed over servers) · Initially aimed at email inbox search proble
复制链接

扫一扫

专栏目录