From gearman's main page, they mention running with multiple job servers so if a job server dies, the clients can pick up a new job server. Given the statement and diagram below, it seems that the job servers do not communicate with each other.
Our question is what happens to those jobs that are queued in the job server that died? What is the best practice to have high-availability for these servers to make sure jobs aren't interrupted in a failure?
You are able to run multiple job servers and have the clients and workers connect to the first available job server they are configured with.
This way if one job server dies, clients and workers automatically fail over to another job server. You probably don't want to run too many job servers, but having two or three is a good idea for redundancy.
As far as I know there is no proper way to handle this at the moment, but as long as you run both job servers with permanent queues (using MySQL or another datastore - just don't use the same actual queue for both servers), you can simply restart the job
server and it'll load its queue from the database. This will allow all the queued tasks to be submitted to available workers, even after the server has died.
There is however no automagical way of doing this when a job server goes down, so if both the job server and the datastore goes down (a server running both locally goes down) will leave the tasks in limbo until it gets back online.
The permanent queue is only read on startup (and inserted / deleted from as tasks are submitted and completed).
I'm not sure about the complexity required to add such functionality to gearmand and whether it's actually wanted, but simple "task added, task handed out, task completed"-notifications between servers shouldn't been too complicated to handle.