Traditional log file aggregation is a respectable and scalable approach for supporting offline use cases like reporting or batch processing; but is too high latency for real-time processing and tends to have rather high operational complexity. On the other hand, existing messaging and queuing systems are okay for real-time and near-real-time use-cases, but handle large unconsumed queues very poorly often treating persistence as an afterthought.
This creates problems for feeding the data to offline systems like Hadoop that may only consume some sources once per hour or per day. Kafka is intended to be a single queuing platform that can support both offline and online use cases.
Even Kafka writes message to broker disk still, performance wise it is better than some of the in-memory storing message queues. Kafka does sequential file I/O. Kafka keeps a single pointer into each partition of a topic, rather than a per message state. All message prior to the pointer is considered consumed, and all messages after it are consider unconsumed. This eliminate most of the random I/O in acknowledge messages. By moving pointer forward many messages at a time we can implicitly acknowledge them all.
Kafka (a persistent, efficient, distributed message queue) played the role of our work/data queue. Storm filled the role of our data intake / flow mechanism, and Cassandra our system of record for all things storage.
Technologies
Storm
Stream processing:
Storm can be used to process a stream of new data and update databases in real-time. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
Continuous computation:
Storm can do a continuous query and stream the results to clients in real-time. An example is streaming trending topics on Twitter into browsers. The browsers will have a real-time view on what the trending topics are as they happen.
Distributed RPC:
Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets
Kafka
In a typical setup, there is a single Zookeeper instance and a cluster of Kafka servers. For each server, a unique BrokerId is configured to uniquely identify a broker. When Kafka broker starts up, it registers itself with zookeeper. Topics and partitions are only created when a producer registers with the broker. The number of partitions is specified and this is for each topic in that particular server. All existing topics and partitions are also registered with ZK. ZK stores the mapping from broker id to (host, port) and the mapping from (topic, broker id) to number of partitions. Zookeeper matches consumers and partition data on server and keeps this matching up-to-date as set of available consumer and broker changes.
When new Producer is instantiated, it looks at either Zookeeper (for automatic broker discovery) or broker list (for static list of Kafka brokers defined by brokerid, host and port). Internally, the producer client keeps a local copy of the list of brokers and their number of partitions.
There are two consumer APIs: high level API (ConsumerConnector) and low level API (SimpleConsumer). The big difference between here is that the high level API does broker discovery, consumer rebalancing and keep track of state (i.e. offsets) in zookeeper, while the low level API does not.
Since the consumer needs to replay using specific offsets (e.g. storm spout which may fail), then that consumer needs to keep track of state manually. To achieve this Low Level API consumer should be used.
Storm transactional topologies require a source broker that:
Can treat a batch of tuples as an atomic unit
Can rewind to replay specific tuples it already played
Does not require an explicit dropping of tuples (like with Kestrel/RabbitMQ)
KafkaSpout
Kafka spout is a regular spout implementation that read from the Kafka cluster. Kafka Spout takes three parameters one is Kafka Properties which is used for constructing Kafka, topic is used to pull the message and schema that is responsible for deserializing message coming from Kafka. When Kafka consumers are created, it uses zookeeper for commit, offset and segment consumption tracking. The KafkaSpout stores the states of the offset it’s consumed in zookeeper. The consumer is just one thread since the real parallelism is handled by storm already. KafkaSpout has option for adjusting how the spout fetches messages from Kafka (buffer sizes, amount of messages to fetch at a time, timeouts etc.)
Cassandra
Cassandra is preferred to other Relational Database Management Systems because of the following reasons.Massively Scalable Peer-to-Peer Architecture
- Linear Scale Performance
- Fault Tolerance
- Transparent Fault Detection and Recovery
- Dynamic Schema Data Modeling
- Distributed Design and Tunable Data Consistency
- Greater I/O Performance
CassandraBolt
This Integrates Storm and Cassandra by providing a generic and configurable bolt implementation that writes Storm Tuple objects to a Cassandra Column Family.
CassandraBolt expects that a Cassandra hostname, port, and keyspace be set in the Storm topology configuration.
Architecture
1. Listener
Listener is the entry point of this architecture. Standard data collectors include HTTP, Logs, Batch Upload (S3 bucket, FTP etc). The raw activity data has to be processed as a stream which can later be consumed by Storm.
2. Collector
This is integration connector to existing environment for collecting required data.
Storm is a “distributed real-time computation system”. Kafka for its part is a messaging system to serve as the foundation for their activity stream and the data processing pipeline behind it. Storm and Kafka can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.
3. Consumer
All data processed by Collector consists of a stream of data and some associated metadata. Events flow through a series of logical nodes chained together. Consumer connects collectors, which produce events, to writers, which consume events.
In this system, the consumer looks for the messages that are in message broker and pull it instantly after seeing it. However in push based system, the producer pushes the messages to the message broker. Unfortunately, in push system, the consumer tends to be overwhelmed when it rate of consumption falls below the rate of production. A pull based system has nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed but getting the rate of transfer to fully utilize the consumer is trickier than it seems.
4. Writer
The standard data writer writes to applications (REST API), databases (MySQL/PostgreSQL, Cassandra etc) and file systems (HDFS, S3 bucket, local file system).
Here Cassandra bolt writes strom tuple objects to a Cassandra Column Family.