"Partition Tolerance" 这个形容词确实挺容易 confuse 的,《A Critique of the CAP Theorem》文章曾这样批评:we can interpret partition tolerance as meaning “a network partition is among the faults that are assumed to be possible in the system.”
It is misleading to say that an algorithm “provides partition tolerance,” and it is better to say that an algorithm “assumes that partitions may occur.”
至于 Network Partition 应当理解为 CAP 理论中讨论的故障模型,这里需要注意 Network Partition 并非节点 Crash(节点 Crash 属于 FLP 的故障模型),更侧重于 "节点双方一时联系不上对方" 的一个状态。
造成 Partition 的原因可能是网络不可达,也可能是 GC 的 Stop The World 阻塞太久,也可能是 CPU 彪到一个死循环上,总之种种血案。aphyr 曾整理过这么一批血案可以参考: aphyr/partitions-post