Cloud Foundry HA with NATS and other explaination (by James Bayer)

There has been another post on this previously. When running on vSphere / SAN, this is generally not an issue as we have relied on vSphere HA features for several years and it offers a robust IaaS that restarts any SPOF VMs immediately. This is how we have been running CloudFoundry.com for about 2 years without meaningful downtime related to single points of failure. If you do not have HA IaaS with resilient storage, then Cloud Foundry is not fully multi-site or HA with out of the box configurations yet. We are working on removing all single points of failure for environments like AWS that do not have the same capabilities as vSphere.
- We have recently worked on MySQL support for Cloud Controller DB, which means that when running on AWS that RDS could be used.
- There has also been some discussions about removing single points of failure in NATS recently on the GitHub issues: https://github.com/cloudfoundry/cf-release/issues/32
- Health Manager is currently a SPOF
- UAA 1.4.x (almost deployed) will support horizontal scaling

So we are actively working on this, but we do not have all the pieces finished. We will be updating the cloudfoundry.github.com docs as we get closer.


Health Manager (HM) only operates as a single node and is therefore not HA, but a CF system should operate in degraded mode. In this degraded mode the actual state of the world with respect to application state, instances, etc and the intended state of the world will drift until HM becomes available again.

NATS is still a SPOF but we have completed a bunch of work to make sure CF components behave well when it is not around. Basically, the system should operate in a degraded mode until NATS returns. Previously many components did not behave well when NATS went away. In a subsequent set of work, we will consider other HA options for NATS including things like running a Hot/Warm NATS pair and use something like VRRP to migrate the IP, clustered NATS, or other options which keep NATS available. We decided that planning to lose NATS completely was the better path now for overall system health than try to prevent something that could conceivably happen.

Examples of how this current set of NATS work affects particular CF components:

- All - should attempt to periodically reconnect to NATS instead of exiting or giving up.

- Cloud Controller - In the degraded mode when NATS is unavailable Cloud Controller API requests to make writable changes to apps don't take effect such as push (will fail to stage), scale (should take effect in CC DB but not be communicated to DEAs), delete (not sure what happens here until I try it, but I'd expect to remove the app from CC DB and have HM garbage collect the app later). Some read operations that need NATS like stats will also not work while NATS is unavailable.

- Router - when the router cannot reach NATS it will not expire the routes it knows about so existing apps will continue getting routed to. 

- Health Manager - when the HM cannot connect to NATS, start/kill commands should not be evaluated until the NATS connection can be restored.

- DEAs - when the DEA cannot connect to NATS apps should not be stopped and the DEA should not exit.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值