The paper of Castro-Liskov talks about a practical algorithm that is able to tolerate Byzantine faults. The practical algorithm works in asynchronous environments, such as the Internet, and is able to speed up the response time. Byzantine fault happens in distributed computing systems where there is imperfect information about which nodes is treacherous. For a distributed systems to work, the healthy nodes must work out a consensus despite the presence of treacherous nodes.
The practical algorithm (PBFT) offers safety and liveness, provided at most maximum [(n-1)/3] nodes are simultaneously faulty. Safety means that the systems functions as a centralised systems that executes operations atomically. Liveness means that nodes eventually receive replies to their requests.
The algorithm can be used to implement any deterministic replicated service with state and some operations. The operations can be reads or writes of the service state. Deterministic 是指在执行程序，同样的参数和状态会产生同样的结果。
The algorithm is a form of state machine replication, where the service is modeled as a state machine that is replicated across different nodes in a distributed system. Each state machine replica maintains the state and implements the operations. The view is snapshot of the moving state of replicas. In a view, one replica is the primary, the others are backups.
The algorithm works roughly as follows:
- A client sends a request to invoke a service operation to the primary
- The primary multicasts the request to the backup nodes
- Nodes execute the request and send a reply to the client
- The client waits for 1 replies from different nodes with the same result; this is the result of the operation.
If the nodes are deterministic and starting from same state, all healthy nodes agree on the outcome of the execution of request despite faulty nodes.
In normal operation, where primary node receives client request, it starts a three phase protocol to send the request to the replica nodes. The three phases are pre-prepare, prepare and commit. The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the primary, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views. (Total order means any pairs in the set is comparable.)
In the figure above, it shows the operation of the algorithm in the normal case of no primary faults. Replica 0 is the primary, replica 3 is faulty, and C is the client.
The view-change protocol provides liveness by allowing the system to make progress when the primary fails. View changes are triggered by backup request timeouts. A backup is waiting for a request if it received a valid request and has not executed it. If timeout, the backup starts a view change to move the system to the next view.
To improve the response time of the algorithm, three optimisations are applied. The first avoids sending large replies. Only one nodes sends the result. The other nodes send the digest of the result.
The second reduces the number of message delays. After request is executed, the nodes send tentative replies to the client. If 2f+1 relies are matching, the request is guaranteed to commit eventually, and there is no retransmission of the request.
The third improves the performance of read only operations that do not modify the state. Read only request are executed immediately and reply is sent only after all request are reflected in the tentative state are committed. This is to prevent the client from observing uncommitted state.
The Istanbul BFT consensus is inspired by PBFT.