We are thinking of evaluating the following 4 tools:
Kafka (LinkedIn) : Looks promising, but it's written in Scala. We have heard mixed reports about Scala so a bit concerned about its future.
Flume: Will be replaced by Flume NG which is NOT production ready. Not clear when it will be.
Scribe (FB): Not under active development. Will be replaced by Calligraphus - no idea when?
Storm (Twitter) : Looks promising, but not clear if it is designed for log processing in mind, although, I can't see why it can't be used for that purpose.
Storm + Kafka is a very effective log processing solution. A number of users of Storm use this combination, including us at Twitter in a few instances. Kafka gives you a high throughput, reliable way to persist/replay log messages, and Storm gives you the ability to process those messages in arbitrarily complex ways.
We've been developing logging and reporting solutions on top of Storm which archives and streams logging information. Further, the ability of Storm to add another stream for the exceptional case has been key to making our logging infrastructure useful. I'd highly recommend it, whether you use Kafka or you use AMQP or even direct syslog traffic at a spout. A custom Log4j appender is easy to write.
It sounds like Storm by itself is not enough to do the log processing. A tool such as Kafka is needed for persistence. I guess then Storm can be used as a 'Consumer'?
Pardon my naive question but what functionality does Storm provide that's not built into Kafka? Sounds to me like we will have to maintain a cluster of machines for Kafka + a cluster of machines for Storm (plus our existing Hadoop cluster). Trying to figure out if so many layers are indeed needed.
Storm is then used as it's marketed: a distributed stream processor. It will do whatever you need to do to actually process the logs (conditionally filter, extract text, etc, etc) in a distributed manner. Log processing is a really good use case for Storm, since typically there are a LOT of logs - it is truly a real time big data problem. So, instead of centralizing the logs and churning over the data using MapReduce, you're doing that work as streams within a Storm cluster...and your output is what you would normally output from your M/R algorithms.