Dynamo使用到的技术:
1.动态哈希表(DHTs)
2.一致性哈希(Consistent Hashing)
3.版本(Versioning)
4.矢量时钟(Vector Clocks)
5.仲裁(Quorum)
6.基于反熵的恢复(Anti-Entropy Based Recovery)
Amazon Simple Storage Service(Amazon S3)
Dynamo
Dynamo相关内容:
1.Provides a simple primary-key only interface
2.Data is partitioned and replicated using Consistent Hashing
3.Consistency is facilitated by object versioning
4.The consistency among replicas during updates is maintained by a quorum-like technique
and a decentralized replica synchronization protocol.
5.Employs a gossip based distributed failure detection and membership protocol.
6.Eventally-Consistent
7.Has a simple key/value interface,highly available with a clearly defined consistency window
System Assumptions and Requirements
Query Model:
1.simple read and write operations to a data item that is uniquely identified by a key.
2.state is stored as binary objects identified by unique keys.(usually less than 1MB)
3.No operations span multiple data items
4.there is no need for releational schema
ACID Properties:
1.data stores that provide ACID guarantees tend to have poor availability
2.dynamo targets applications that operate with weaker consistency if this results in high availability
3.dynamo doest not provide any isolation guarantees that permits only single key updates.
Efficiency:
1.services have stringent latency requirements whiich are in general measured at the 99.9th percentile of the distribution.
2.the tradeoffs are in performance, cost efficiency, availability and durability guarantees.
Other Assumptions:
1.dynamo is used only by amazon's internal services.
2.its operation environment is assumed to be non-hosile
3.no security related requirements such as authentication and authorization
4.each service uses its distinct instance of Dyanmo
Service Level Agreements(SLA)
1.give services control over their system properties, such as durability and consistency
2.let services make their own tradeoffs between functionality, performance and cost-effetiveness.
Design Consisderations
1.commercial systems traditionally demand synchronous replica coordination
in order to provide a strongly consistent data access interface
2.Optimistic replication
1) suitable for systems prone to server and network failures
2)changes are allowd to propagate to replicas in the background
3)concurrent, disconnected work is tolerated
4)challenges:
a.it can lead to confilicting changes which must be detected and resolved
b.the change of (a) introduces two problems:
i) when to resolve them
ii) who resolves them
3.Dynamo is designed to be an eventually consistent data store
4.When to resolve update conflicts
1)Many traditional data stores execute conflict resolution during writes, so writes may be rejected
if the data store cannot reach all at a given time(W = N )
2)Dynamo is "always writeable"
5.Who resolves the conflicts
1)the application is the most suitable to resolve the conflicts
2)the data store can only perform simple oerations to resolve conflicts such as "the last wins"
Other key principles embraced in the design
Incremental scalability
Symmetry: every node in Dynamo should have the same set of responsibilities as its peers
Decentralization: the design should favor decentralized peer-to-peer techniques
Heterogeneity: the system needs to be able to exploit heterogeneity in the infrastructure it runs on.
Related Work
P2P Systems
use globally consisten protocol:
Freenet
Gnutella
use routing mechanisms:
Pastry
Chord
built on top of routing overlays
OceanStore
PAST
DFS and DataBases
Replicate files for high availability at the expense of consistency.
Ficus
Coda
DFS
Farsite: achieves high availability and scalability using replication
GFS
Bayou: Distributed relational database system that allows disconnected operations and provides eventual consistency
Antiquity: wide-area distributed storage system designed to handle multiple server failures.
use a secure log to preserve data integrity, replicates each log on multiple servers for durability.
use Byzantine fault tolerance protocols to ensure data consistency.
Bittable: managing structured data. it maintains a sparse, multi-dimensional sorted map
allows applications to access their data using multiple attributes
Dynamo target requirements
1.Dynamo is targeted mainly at applications that nned an "always writeable" data store
2.Dynamo is built for an infrastructure within a single administrative domain where all nodes are trusted
3.Applications that use Dyanmo do not require support for hierarchical namespaces or coplex relational schema
4.Dynamo is built for latency sensitive applications that require at least 99.9% of
read and write operations to be performed within a few undered milliseconds
System Architecture
core distributed systems techniques used in Dynamo:
partitioning
replication
versioning
membership
failure handling
scaling
Problem Technique AdvantagePartitioningConsistent HashingIncremental ScalabilityHigh Availability for writesVector clocks with reconciliation during readsVersion size is decoupled from update ratesHandling temporary failuresSloppy Quorum and hinted handoffProvides high availability and durability guarantee when some of the replicas are not availaleRecovering from permanent failuresAnti-entropy using Merkle treesSynchronizes divergent replicas in the backgroundMembership and failure detectionGossip-based membership protocol and failure detectionPreserves symmetry and avoids having a centralized registry for storing membership and node liveness informationPartitioning Algorithm
1.Dynamo's partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts
2.Use a variant of consistent hashing: in the ring, the node are all "virtual node",
and a real node can responsible for more than one virtual node
Replication
1.each data item is replicated at N hosts, N is configured "per-instance"
2.each key has a responsible node, the responsible node replicates these keys at the N-1 clockwise successor nodes in the ring
Data Versioning
1.Dynamo uses vector clocks in order to capture causality between different versions of the same object.
2.A vector clock is effectively a list of (node, counter) pairs
?
3.The list size are limited by removing the oldest pairs
Execution of get() and put() operations
Two strategies that a client can use to select a node:
1) route its request through a generic load balancer taht will select a node based on load information
2)use a partiotion-aware client library that routes requests directly to the appropriate coordinator nodes
Main Consistency
1)Dyanmo uses a consistency protocol similar to those used in quorum systems
2)Two key configurable values:R and W
R: the minimum number of nodes that must participate in a successful read operation
? W: the minimum number of nodes that must participate in a successful write operation
set R + W > N yields a quorum-like system
Handling Failures: Hinted Handoff
1.all read and write operations are performed on the first N healthy nodes from the preference list
2.if the target node(A) is down then send the replica to a node(D) next the preference list with hint
that it is tend to be sent to A, and upon detecting A has recovered, D will send the replica to A
Handling Permanent failures: Replica Synchronization
Merkle tree
? 1)a Merkle tree is a hash tree where leaves are hashes of the values of individual keys
? 2)parent nodes higher in the tree are hashes of their respective children
? 3)minimize the amount of data that needs to be transferred for synchroniation
Membership and Failure Detection
1.gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership
2.each node contacts a peer chosen at random every second and
the two nodes efficiently reconcile their persisted membership change histories.
External Discovery
Seeds: Typically seeds are fully functional nodes in the Dynamo ring.
Failure Detection
1.a purely local notion of failure detection is entirely sufficient:
node A may consider node B failed if node B does not respond to node A's messages
2.a periodically retries B to check for the latter's recovery
Adding/Removing Storage Nodes
1. when node x is added into the system, it gets assigned a number of tokens that are randomly scattered on the ring