Hadoop Futures at Structure Big Data: DataStax Brisk, EMC, and MapR
GigaOm reported that MapR is
building a proprietary replacement for the Hadoop Distributed File System that is said to be three times faster than the current open source version. It comes with snapshots and no NameNode single point of failure (SPOF) and is said to be API compatible with HDFS, so it can be a drop-in replacement.
DataStax (formerly Riptano) provides support and commercial products for Cassandra, such as the recently announced management tool OpsCenter for Apache Cassandra. During the panel VP of Product, Ben Werther said that Brisk was motivated by customers like Netflix, which stores all their streaming data in Cassandra, and which are also heavy users of Hive for analytics. He noted Netflix wants to be able to have interactive response to Hive queries on ClickStream data without ETL delay. Werther told InfoQ that Brisk will ship within 45 days of the announcement, and that DataStax will be offering commercial support for the distribution. He also said that OpsCenter will allow managing multiple Data Centers, replica sets, and include basic Hadoop monitoring. Werther said that Twitter's Rainbird project for realtime counter analytics using Cassandra will soon be available in open source.
Brisk is based on Apache Hadoop 20.2 and includes:
- CassandraFS, a Hadoop-compatible file system that stores data using Cassandra.
- Input and output formats to read and write Cassandra column families for Hadoop jobs
- Hive support to read and write data stored in Cassandra and to allow transposing data, converting wide rows into multiple narrow rows.
- Updates for the JobTracker (JT) to allow restarting it when nodes fail. However, Werther clarified that Brisk does nothing to persist in-memory JobTracker state, so while Brisk will start up a new JT, running jobs wouldn't be able to complete
- Pre-built configuration: Werther told InfoQ DataStax had simplified the whole stack, with a set of predefined flags so Cassandra comes up as both real time and Hadoop nodes.
Cassandra is a BigTable inspired NoSQL database with a Dynamo architecture. It was initially created and open sourced by Facebook, but the majority of committers on the project work at DataStax including the project chairman and company co-founder Jonathan Ellis. Currently DataStax employs no Hadoop committers. Cassandra supports replication of data across multiple data centers, range scans, separate column families for storing data, and has recently added secondary indexes and the ability to replicate data to different replica groups to allow analysis to access a recent copy of data without impacting production serving requirements.
InfoQ asked Werther about the maturity of Cassandra and how it compares to HBase. Notably, Facebook which created Cassandra has been using HBase for serving large-scale messaging and for real-time analytics. He claimed that while Hadoop has a large community, HBase has a tiny one, whereas Cassandra has a larger community and more momentum. DataStax uses bug fixes, the backlog of unfixed bugs, community discussions, and downloads as metrics to compare adoption. In response to InfoQ's questioning about problems in past Cassandra deployments (such as for Digg) Werther said that the "rapidly maturing" technology was sometimes used a little early or in the wrong way, but that they have large successful customers including Cisco, Rackspace, Constant Contact, Real Networks, and Netflix. Werther also stated that Facebook had been invested in HBase so their decision to use it over Cassandra had more to do with internal decision making. He also claimed that consistency of storage is a red herring because Cassandra's support for eventual consistency is an option and one can run it with strong consistency.
When asked Werther said that Brisk is still being tested internally - there are no beta customers for the technology yet. InfoQ asked about large scale uses of Cassandra. Werther said the largest production deployment is a 700 node cluster being used by a government agency. In terms of transaction volumes, he said Twitter performs 200,000 writes per second for data ingest. In terms of data storage, he said there were clusters that are storing in the "low hundreds of Terabytes" of data.
|Current HDFS||Possible HDFS Improvements||CassandraFS|
|The Name Node (NN) is a single point of failure (SPOF)||Several approaches to amerliorating and elimating the NN SPOF are being developed.||CassandraFS stores data in Cassandra, which has no SPOF.|
|File metadata is held in RAM by a single process, limiting the total files.||Federated HDFS and use of BookKeeper are approaches to scaling HDFS that are being developed.||CassandraFS offers virtually unlimited file scalability.|
|No WAN Replication Support||No WAN Replication Support||Cassandra supports Multi-Data Center Replication|
|Supports append (in Cloudera Distribution for Hadoop 3 and Apache Hadoop 0.21)||n/a||The design allows for append, but the first release won't support it. However, HDFS Append has mostly been used to support HBase, which is an unlikely technology for those using Brisk.|
Werther noted that Brisk works with other Hadoop ecosystem code. In response to InfoQ's questions about how customers can load log data that didn't originate in Cassandra, he said customers could use Cloudera
Flume, which they have verified can be used works with Brisk. Likewise, Werther noted that the Cloudera
Hue browser-based interface for Hadoop works with Brisk.