Apache Hadoop 3.0.0 GA 版本终于发布

Hadoop 同时被 2 个专栏收录
39 篇文章 4 订阅
26 篇文章 3 订阅

Apache Hadoop 3.0.0

Apache Hadoop 3.0.0 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).

This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready.

Overview

Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.


所有Hadoop JAR现在都是针对Java 8的运行时版本编译的。仍然使用Java 7或更低版本的用户必须升级到Java 8。

Support for erasure coding in HDFS

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

More details are available in the HDFS Erasure Coding documentation.


擦除编码是一种持久存储数据的方法,与复制相比,显着节省空间。像Reed-Solomon(10,4)这样的标准编码有1.4倍的空间开销,而标准HDFS复制的开销是3倍。 
由于擦除编码在重建期间施加额外的开销并且主要执行远程读取,因此传统上它被用于存储较冷的,较不频繁访问的数据。部署此功能时,用户应考虑删除编码的网络和CPU开销。 
更多细节可在HDFS擦除编码文档中找到。

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

YARN Timeline Service v.2 alpha 2 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity.

More details are available in the YARN Timeline Service v.2 documentation.


我们正在介绍YARN时间轴服务的一个主要修订的早期预览(alpha 2):v.2。YARN时间轴服务v.2解决了两个主要挑战:提高时间轴服务的可伸缩性和可靠性,并通过引入流量和聚合来提高可用性。 
提供YARN时间轴服务v.2 alpha 2,以便用户和开发人员可以对其进行测试,并提供反馈意见和建议,使其成为Timeline Service v.1.x的替代品。它只能用于测试。 
YARN Timeline Service v.2文档中提供了更多细节。

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.

More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.


Hadoop shell脚本已被重写,以修复许多长期存在的错误,并包含一些新功能。尽管保持兼容性,但有些更改可能会破坏现有的安装。 
发行说明中记录了不兼容的更改,并对HADOOP-9902进行了相关讨论。 
更多细节可以在Unix Shell指南文档中找到。Unix Shell API文档也为高级用户感到高兴,它描述了许多新功能,特别是与可扩展性相关的功能。

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.


2.x版本中提供的hadoop-client Maven工件将Hadoop的传递依赖关系拉到Hadoop应用程序的类路径上。如果这些传递依赖的版本与应用程序使用的版本冲突,则这可能会有问题。 
HADOOP-11804添加了新的hadoop-client-api和hadoop-client-runtime构件,可以将Hadoop的依赖关系集中在一个jar中。这可以避免将Hadoop的依赖泄漏到应用程序的类路径中。

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol interceptor.

Please see documentation for more details.


已经引入了ExecutionType的概念,从而应用程序现在可以请求执行类型为Opportunistic的容器。即使在调度时没有可用资源,也可以调度此类型的容器在NM处执行。在这种情况下,这些容器将在NM处排队,等待资源启动。机会容器的优先级低于默认的保证容器,因此如果需要的话,可以抢占容器来为保证容器腾出空间。这应该会提高群集利用率。 
Opportunistic容器默认由中央RM分配,但是也添加了支持以允许Opportunistic容器由分布式调度器分配,该分布式调度器被实现为AMRMProtocol拦截器。 
请参阅文档了解更多详情。

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

See the release notes for MAPREDUCE-2841 for more detail.


MapReduce增加了对map 输出收集的本地实现的支持。对于shuffle密集型工作,这可以导致30%或更多的性能提升。 
有关更多详细信息,请参阅MAPREDUCE-2841的发行说明。

Support for more than 2 NameNodes.

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.


为单个活动NameNode和单个Standby NameNode提供了HDFS NameNode高可用性的初始实现。通过将编辑复制到三个JournalNodes的仲裁中,此架构可以容忍系统中任何一个节点的故障。 
但是,有些部署需要更高的容错度。这是由这个新功能启用的,它允许用户运行多个备用NameNode。例如,通过配置三个NameNode和五个JournalNode,群集能够容忍两个节点的故障,而不是一个故障。 
HDFS高可用性文档已经更新了关于如何配置两个以上NameNodes指令。

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.


以前,多个Hadoop服务的默认端口位于Linux临时端口范围(32768-61000)内。这意味着在启动时,由于与其他应用程序的冲突,服务有时会无法绑定到端口。 
这些冲突的端口已经被移出了临时范围,影响了NameNode,Secondary NameNode,DataNode和KMS。我们的文档已经适当地更新了,但是查看HDFS-9427和HADOOP-12811的发行说明以获取端口变化列表。

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.


Hadoop现在支持与Microsoft Azure Data Lake和Aliyun对象存储系统的集成,作为替代Hadoop兼容的文件系统。

Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guide for more information.


一个DataNode管理多个磁盘。在正常的写入操作中,磁盘将被均匀地填满。但是,添加或替换磁盘可能会导致DataNode中出现明显的偏差。这种情况并不是由现有的HDFS平衡器来处理的,这个平衡器本身是关于DN内部而不是内部偏斜的。 
这种情况由新的DataNode内部平衡功能处理,通过hdfs diskbalancer CLI 调用。有关更多信息,请参阅“ HDFS命令指南”中的磁盘平衡器部分。

Reworked daemon and task heap management

A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated. See the full release notes of HADOOP-10950 for more detail.

MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.


Hadoop守护进程和MapReduce任务的堆管理已经进行了一系列更改。 
HADOOP-10950引入了配置守护进程堆大小的新方法。值得注意的是,现在可以根据主机的内存大小自动调整,而HADOOP_HEAPSIZE变量已经被弃用了。有关更多详细信息,请参阅HADOOP-10950的完整发行说明。 
MAPREDUCE-5785简化了映射的配置并减少了任务堆大小,因此不需要在任务配置中和Java选项中指定所需的堆大小。已经指定的现有配置不受此更改的影响。有关更多详细信息,请参阅MAPREDUCE-5785的完整发行说明。

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

See S3Guard for more details.


HADOOP-13345为Amazon S3存储的S3A客户端添加了一个可选功能:能够将DynamoDB表用作文件和目录元数据的快速一致存储。 
有关更多详细信息,请参阅S3Guard。

HDFS Router-Based Federation

HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

See HDFS-10467 and the HDFS Router-based Federation documentation for more details.


HDFS基于路由器的联合会添加一个RPC路由层,提供多个HDFS命名空间的联合视图。这与现有ViewFs和HDFS联合功能类似),不同之处在于在路由层对服务端进行管理而不是客户端。这简化了对现有HDFS客户端的联合集群的访问。 
有关更多详细信息,请参阅HDFS-10467和基于HDFS路由器的联合文档。 
HDFS路由器的联合文档

API-based configuration of Capacity Scheduler queue configuration

The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

See YARN-5734 and the Capacity Scheduler documentation for more information.


容量调度程序的OrgQueue扩展提供了一种编程方式,通过提供用户可以调用的REST API来修改队列配置来更改配置。这使管理员可以在队列的administrators_queue ACL中自动执行队列配置管理。 
有关更多信息,请参阅YARN-5734和Capacity Scheduler文档

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

See YARN-3926 and the YARN resource model documentation for more information.


YARN资源模型已被推广为支持用户定义的可数资源类型,不仅仅是CPU和内存。例如,集群管理员可以定义诸如GPU,软件许可证或本地附加储器之类的资源。YARN任务可以根据这些资源的可用性进行调度。 
有关更多信息,请参阅YARN-3926和YARN资源模型文档

Getting Started

The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.

  • 1
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 酷酷鲨 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值