Hadoop Futures at Structure Big Data: DataStax Brisk, EMC, and MapR

转载 2011年10月13日 13:17:35
he Structure Big Data conference was filled with news and rumors of new Hadoop offerings. During a MapReduce panel DataStax announced Brisk, a distribution of Hadoop using Cassandra to store data instead of the Hadoop Distributed File System.  EMC published a full page ad in the conference program that stated "05.09.11. EMC Greenplum. Apache Hadoop." And GigaOm, the conference presenter, published an article that speculated that stealth mode startup MapR Technologies "is building a propertiary version of Hadoop and is likely to launch later this year." The day after the conference Hadoop was named as the Guardian's "Innovator of the Year" and Cloudera engineer Todd Lipcon presented Hadoop in a keynote at EclipseCon.

GigaOm reported that MapR is

building a proprietary replacement for the Hadoop Distributed File System that is said to be three times faster than the current open source version. It comes with snapshots and no NameNode single point of failure (SPOF) and is said to be API compatible with HDFS, so it can be a drop-in replacement.

DataStax (formerly Riptano) provides support and commercial products for Cassandra, such as the recently announced management tool OpsCenter for Apache Cassandra. During the panel VP of Product, Ben Werther said that Brisk was motivated by customers like Netflix, which stores all their streaming data in Cassandra, and which are also heavy users of Hive for analytics. He noted Netflix wants to be able to have interactive response to Hive queries on ClickStream data without ETL delay. Werther told InfoQ that Brisk will ship within 45 days of the announcement, and that DataStax will be offering commercial support for the distribution. He also said that OpsCenter will allow managing multiple Data Centers, replica sets, and include basic Hadoop monitoring. Werther said that Twitter's Rainbird project for realtime counter analytics using Cassandra will soon be available in open source.

Brisk is based on Apache Hadoop 20.2 and includes:

  • CassandraFS, a Hadoop-compatible file system that stores data using Cassandra.
  • Input and output formats to read and write Cassandra column families for Hadoop jobs
  • Hive support to read and write data stored in Cassandra and to allow transposing data, converting wide rows into multiple narrow rows.
  • Updates for the JobTracker (JT) to allow restarting it when nodes fail. However, Werther clarified that Brisk does nothing to persist in-memory JobTracker state, so while Brisk will start up a new JT,  running jobs wouldn't be able to complete
  • Pre-built configuration: Werther told InfoQ DataStax had simplified the whole stack, with a set of predefined flags so Cassandra comes up as both real time and Hadoop nodes.

Cassandra is a BigTable inspired NoSQL database with a Dynamo architecture. It was initially created and open sourced by Facebook, but the majority of committers on the project work at DataStax including the project chairman and company co-founder Jonathan Ellis. Currently DataStax employs no Hadoop committers. Cassandra supports replication of data across multiple data centers, range scans, separate column families for storing data, and has recently added secondary indexes and the ability to replicate data to different replica groups to allow analysis to access a recent copy of data without impacting production serving requirements.

InfoQ asked Werther about the maturity of Cassandra and how it compares to HBase. Notably, Facebook which created Cassandra has been using HBase for serving large-scale messaging and for real-time analytics. He claimed that while Hadoop has a large community, HBase has a tiny one, whereas Cassandra has a larger community and more momentum. DataStax uses bug fixes, the backlog of unfixed bugs, community discussions, and downloads as metrics to compare adoption. In response to InfoQ's questioning about problems in past Cassandra deployments (such as for Digg) Werther said that the "rapidly maturing" technology was sometimes used a little early or in the wrong way, but that they have large successful customers including Cisco, Rackspace, Constant Contact, Real Networks, and Netflix. Werther also stated that Facebook had been invested in HBase so their decision to use it over Cassandra had more to do with internal decision making. He also claimed that consistency of storage is a red herring because Cassandra's support for eventual consistency is an option and one can run it with strong consistency.

When asked Werther said that Brisk is still being tested internally - there are no beta customers for the technology yet.  InfoQ asked about large scale uses of Cassandra. Werther said the largest production deployment is a 700 node cluster being used by a government agency. In terms of transaction volumes, he said Twitter performs 200,000 writes per second for data ingest. In terms of data storage, he said there were clusters that are storing in the "low hundreds of Terabytes" of data.

InfoQ interviewed Werther and lead engineer Jake Luciani about the architecture of Brisk and the file system implementation, CassandraFS. Some of the key differences between current Hadoop DFS (HDFS) versions, possible improvements to HDFS, and the planned CassandraFS are outlined below:
Current HDFS Possible HDFS Improvements CassandraFS
The Name Node (NN) is a single point of failure (SPOF) Several approaches to amerliorating and elimating the NN SPOF are being developed. CassandraFS stores data in Cassandra, which has no SPOF.
File metadata is held in RAM by a single process, limiting the total files. Federated HDFS and use of BookKeeper are approaches to scaling HDFS that are being developed. CassandraFS offers virtually unlimited file scalability.
No WAN Replication Support No WAN Replication Support Cassandra supports Multi-Data Center Replication
Supports append (in Cloudera Distribution for Hadoop 3 and Apache Hadoop 0.21) n/a The design allows for append, but the first release won't support it. However, HDFS Append has mostly been used to support HBase, which is an unlikely technology for those using Brisk.
Technically, CassandraFS creates a table with paths as the key and inodes as a value including metadata like the file owner, permissions, and a list of blocks. It then has another table with block id's as the key and serialized blocks as values. 

Werther noted that Brisk works with other Hadoop ecosystem code. In response to InfoQ's questions about how customers can load log data that didn't originate in Cassandra, he said customers could use Cloudera Flume, which they have verified can be used works with Brisk. Likewise, Werther noted that the Cloudera Hue browser-based interface for Hadoop works with Brisk.

初识Marp(二)——MapR standbox for Hadoop

概观 该MAPR沙箱Hadoop的是,轻轻地引入业务分析,目前,有抱负的Hadoop开发人员和管理员(数据库,系统和Hadoop),以Hadoop和其生态系统的大数据的承诺一个全功能的单节点群集。使...
  • maoxiao_jsd
  • maoxiao_jsd
  • 2013年12月30日 22:15
  • 1189

Big Data 及 Hadoop

Big Data及Hadoop
  • u013595419
  • u013595419
  • 2017年08月22日 11:07
  • 253

Mapr 安装hadoop的组件(一)——安装cascading

安装Cascading 以下过程使用的操作系统软件包管理器,从MAPR存储库下载安装。要手动安装软件包,请参阅准备包和存储库。 要在Ubuntu的群集上安装级联: 执行以下命令以根或使用sudo的...
  • maoxiao_jsd
  • maoxiao_jsd
  • 2013年12月30日 22:42
  • 1152


本文综合了Hortonworks、Cloudera、MapR三家主要的Hadoop发行版供应商的Hadoop应用案例,真是各有神通,不服来辩。 Cloudera:加速数据分析 Edo In...
  • English0523
  • English0523
  • 2017年04月01日 14:28
  • 666


1. 天上掉下个MapR MapR成立于2009年,但是引起媒体广泛关注是缘由GIGAOM网站2011年3月的一篇报道 《MapR,Cloudera的新对手》(http://gigaom.com/c...
  • michael_zhu_2004
  • michael_zhu_2004
  • 2012年11月27日 09:36
  • 2360

(译文)Cloudera、Hortonworks 和 MapR —— Hadoop商业发行版的对比分析

  • akityou
  • akityou
  • 2017年03月16日 14:47
  • 925


转载地址: http://wenku.baidu.com/link?url=x9xRWvuZuUMFKgN9McNY4DSAGXCi3dlUFMtnDE-Lg39_VeOUgUkm9L_1M4g1J...
  • u012749168
  • u012749168
  • 2016年10月09日 21:05
  • 4098

BigBench: Toward An Industry-Standard Benchmark for Big Data Analytics

Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark. B...
  • lanxing1251983
  • lanxing1251983
  • 2016年01月31日 06:47
  • 641

Java 初探关于Data Structure

  • maguadoutesting
  • maguadoutesting
  • 2014年10月31日 11:40
  • 777


本文原名“Don’t use Hadoop when your data isn’t that big ”,出自有着多年从业经验的数据科学家Chris Stucchio,纽约大学柯朗研究所博士...
  • hguisu
  • hguisu
  • 2013年10月10日 21:13
  • 10040
您举报文章:Hadoop Futures at Structure Big Data: DataStax Brisk, EMC, and MapR