Apache Hadoop over OpenStack Swift(在swfit框架上运行Hadoop)

Apache Hadoop over OpenStack Swift

原文地址:http://bigdatacraft.com/archives/349 By Camuel Gilyadov, on March 1st, 2012

This is a post by Constantine Peresypkin and David Gruzman.
本文由Constantine Peresypkin 和 David Gruzman 提供。

Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn’t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc…)
最近我们在做Hadoop与OpenStack Swift集成的工作。在此不对Hadoop和OpenStack做额外介绍。Swift是一个面向对象的存储系统,而且这项技术晚于RackSpace的cloud-files(还有其它一些系统如 Korea Telecom,Internap 等……

Before we go into details of Hadoop-Swift integration let’s get some relevant background:
在介绍Hadoop-Swift集成工作之前,先说明相关的背景:
  1. Hadoop already have integration with Amazon S3 and is widely used to crunch S3-stored data. http://wiki.apache.org/hadoop/AmazonS3Hadoop已经与Amazon S3存储系统集成,并且广泛的用于处理S3存储的数据。
  2. NameNode is a known SPOF in Hadoop. If it can be avoided it would be nice.NamoNode在Hadoop中是导致单节点故障的源头。如果能避免的话就更好。
  3. Current S3 integration stages all data as temporary files on local disk to S3. That’s because S3 needs to know content length in advance it is one of the required headers.当前的S3集成阶段所有数据作为S3的本地磁盘上的临时文件中。这是因为S3需要预先知道内容长度,也就是所需的头信息
  4. Current S3 also suffers form 5GB max file limitation which is slightly annoying.当前S3也受到5 GB的最大文件限制,这有点烦人。
  5. Hadoop requires seek support which means that HTTP range support is required if it is run over an object-storage . S3 supports it.Hadoop如果它运行在对象存储系统上就需要HTTP方面的支持。而S3支持它。
  6. Append file support is optional for Hadoop, but it’s required for HBase. S3 doesn’t have any append support thus native integration can not use HBase over S3.附加文件对Hadoop支持是可选的但对HBase是必须的S3没有任何的附加文件支持服务,本地集成必能用于S3之上的HBase。
  7. While OpenStack Swift is compatible with S3, RackSpace CloudFiles is not. It is because RackSpace CloudFiles disables S3-compatibility layer in Swift. This prevents existing Swift users from integration with Hadoop.OpenStack swift与S3兼容,但RackSpace CloudFiles不可以。这是因为RackSpace CloudFiles在Swift中禁用S3兼容层。防止现有的Swift用户与Hadoop集成。
  8. The only information that is available on Internet on Hadoop-Swift integration is that with using Apache Whirr! it should work. But for best of our knowledge it is relevant only to rolling out Block FileSystem on top of Swift not a Native FileSystem. In other words we haven’t found any solution on how to process data that is already stored in RackSpace CloudFiles without costly re-importing.在网络上仅有可用的关于Hadoop-Swift集成信息是使用Apache Whirr!它应该是可以使用的。但是我们知道,它只能在Swift之上取出块存储文件系统中的数据不是本地存文件系统。也就是说,对于如何处理已经存储在RackSpace CloudFiles中的数据,除了消耗严重的的重输入方式外,我们还没有发现其它解决方案。
So instrumented with above information let’s examine what we got here:
根据上述资料,我们再看一下这里:
  1. In general we instrumented Hadoop to run over Swift naively, without resorting to S3 compatibility layer.  This means it works with CloudFiles which misses the S3-compatibility layer.一般情况下,我们直接测试在Swift上运行Hadoop而没借助于S3兼容层,也就是说与CloudFiles工作的时候失去了S3兼容层。
  2. CloudFiles client SDK doesn’t have support for HTTP range functionality. Hacked it to allow using HTTP range, this is a must for Hadoop to work. CloudFiles 客户端SDK不支持HTTP方法。对Hadoop来说支持HTTP方式是必须要做到的。
  3. Removed the need for NameNode in similar way it is removed with S3 integration for Amazon.像Amazon S3的方式一样去除对NameNode 的要求。
  4. As opposed to S3 implementation we avoided staging files on local disk to and from CloudFiles/Swift. In other words data directly streamed to/from compute node RAM into CloudFiles/Swift.相对于S3实现,我们避免了送到或者来自CloudFiles /Swift的文件在本地磁盘的转存。也就是说数据直接从计算节点RAM流入/流出到 CloudFiles /Swift。
  5. Though the data is still processed remotely. Extensive data shipping takes place between compute nodes and CloudFiles/Swift. As frequent readers of this blog know we are working on technology that will allow to run code snippets directly in Swift. Look here for more details: http://www.zerovm.com. As next step we plan to perform predicate-pushdown optimization to process most of data completely locally inside ZeroVM-enabled object-storage system.数据的处理仍在是在远端,在计算节点和CloudFiles/Swift和计算节点之间有大量的数据迁移。关注这个博客的读者都知道,我们正在更新技术来允许直接在Swift上执行代码段。在http://www.zerovm.com中可以看到更详尽的信息。
  6. Support for native Swift large objects is planned also (something that’s absent in Amazon S3)
  7. We also working on append support for Swift (this could be easily done through Swift large object support which uses versioning) so even HBase will work on top of Swift, and this is not the case with S3 now.
  8. As it is the case with Hadoop S3, storing BigData in native format on Swift provides options for multi-site replication and CDN

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值