I'd like to use MySQL as a job queue. Multiple machines will be producing and consuming jobs. Jobs need to be scheduled; some may run every hour, some every day, etc.
It seems fairly straightforward: for each job, have a "nextFireTime" column, and have worker machines search for the job with the nextFireTime, change the status of the record to "inProcess", and then update the nextFireTime when the job ends.
The problem comes in when a worker dies silently. It won't be able to update the nextFireTime or set the status back to "idle".
Unfortunately, jobs can be long-running, so a reaper thread that looks for jobs that have been inProcess too long isn't an option. There's no timeout value that would work.
Can anyone suggest a design pattern that would properly handle unreliable worker machines?
解决方案
Using MySQL as a job queue generally ends in pain, as it's a very poor fit for the usual goals of an RDBMS. User 'toong' already linked to http://www.engineyard.com/blog/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you/, which has a lot of interesting stuff to say about it. Unreliable workers are only one of the complications.
There are many, many systems for handling job distribution, mostly distinguished by the sophistication of their queueing and scheduling capabilities. On the simple FIFO end are things like Resque, Celery, Beanstalkd, and Gearman; on the sophisticated end are things like GridEngine, Torque/Maui, and PBS Pro. I highly recommend the new Amazon Simple Workflow system, if you can tolerate reliance on an Amazon service (I believe it does not require that you be in EC2).
To your original question: right now we're implementing a per-node supervisor that can tell if the node's jobs are still active, and sending a heartbeat back to a job monitor if so. It's a pain, but as you are discovering and will continue to discover, there are a lot of details and error cases to manage. Mostly, though, I have to encourage you to do yourself a favor by learning about this domain and build the system properly from the start.