0. prelude
写了两篇之后我发现自己的懒癌又犯了,尝试接下来的博客用英语来记录的课程内容,算是对一种对继续写的激励吧…orz
1. Overview design
Spanner is a semi-relational and globally distributed database with strong consistency and support of distributed transaction.
Figure 1 is a deployment instance of Spanner. Location proxy help clients locate their data in corresponding zone. This paper describes transaction process in the spanserver
in detail.
Each spanserver
is responsible for between 100 and 1000 instances of tablet
like a mapping {key:string, timestamp:int64}->string
. Each tablet
is replicated by Paxos protocol.
So here is an overall image. A great deal of spanservers
, each with many tablets
stored in it. For single tablet
, all spanservers
storing it form a Paxos group with one server being selected as group leader.
2. TrueTime API
In each datacenter, there are a set of time master machines and time slave daemon per machine. Each time master machine will be equipped with either GPS receivers with dedicated antennas or an atomic clock. Every time slave daemon will select a part of time master machines to correct its own local clock based on Marzullo's algorithm
periodically.
Three functions in TrueTime API.
TT.now()
returns an interval [ a , b ] [a, b] [a,b] which guarantees inclusion of current real time.TT.after(t)
returns true if time t has definitely passedTT.before(t)
returns true if time t has definitely not arrived.
3. transaction process
design Goal: support distributed transaction and lock free read-only transaction with strong consistency.
- Read-write transaction is a standard two phase commit driven by clients.
- Client will first read objects from leader of Paxos group and acquire locks simultaneously, then buffer all writes in local memory.
- After that, it choose the leader of some Paxos group as transaction coordinator and send all writes into corresponding Paxos groups.
- Leader of Paxos group need to send prepare message including a prepare timestamp to coordinator when it locks all written objects successfully.
- After receiving all prepare message, coordinator will choose a appropriate commit timestamp(I will ignore the details about how to choose the timestamp by using TrueTime API to guarantee strong consistency) , log a commit record through Paxos, undergo a commit wait stage, then send commit message to other Paxos groups and the client.
- Lock-free read-only transaction is implemented by multi-version features.
- Each read-only transaction will also be given a read timestamp, then each
spanserver
will return requested data which has the biggest timestamp less than read timestamp. - A tricky thing is the transaction may need to read data cross multiple Paxos groups, which need to determine a consistent read timestamp. Spanner’s solution is to set read timestamp
TT.now().latest
by client. - Any replica server can deal with read request once its safe time (guarantee future write timestamp will be greater than this one) is greater than read timestamp.
4. takeaway
Implement lock-free read transaction by multi-version or snapshot isolation trick.