Hadoop 3.0.0 New features

最新推荐文章于 2024-08-04 15:57:27 发布

lucklilili

最新推荐文章于 2024-08-04 15:57:27 发布

阅读量1.6k

点赞数 10

分类专栏： Apache Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/lucklilili/article/details/119079960

版权

Apache Hadoop 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Hadoop3.X And Hadoop2.X Version Diff comparison and New features。

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.

目前所有的Hadoop JARs 现在都是用了Java 8进行的编译,如果集群想升级Hadoop3.X 需要针对使用Java版本进行升级到Java 8。

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

擦除编码是一种持久存储数据的方法，与复制相比节省了大量空间。像Reed-Solomon（10,4）这样的标准编码的空间开销是标准HDFS复制的3倍，可以将3倍副本占据的空间压缩到1.5倍，并保持3倍副本的容错。

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

由于擦除编码在重建过程中增加了额外的开销，并且主要执行远程读取，因此它通常用于存储较冷、访问频率较低的数据。在部署此功能时，用户应考虑网络和CPU开销的擦除编码，传统的擦除编码技术对性能的影响，特别是IOPS和延迟的影响还是比较大的，因此目前适用的场景主要局限在归档、云存储等冷数据方面。

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

提高Timeline服务的可伸缩性和可靠性，以及通过引入流和聚合来增强可用性。

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

通过扩展YARN的资源类型，支持CPU和内存之外的其他资源，如比较流行的GPU计算、FPGA、软件许可证、本地存储等。

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

重新了部分脚本，修复了部分bug，但是没有具体体现出来。

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

2.x版本中提供的hadoop客户机Maven工件将hadoop的可传递依赖关系拉到hadoop应用程序的类路径上。如果这些可传递依赖项的版本与应用程序使用的版本冲突，那么这可能会有问题。

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

引入了ExecutionType的概念，应用程序现在可以请求执行类型为机会主义的容器。即使在调度时没有可用的资源，这种类型的容器也可以在NM处被调度执行。在这种情况下，这些容器将在NM处排队，等待资源可供其启动。机会主义容器的优先级低于默认的保证容器，因此如果需要，会被抢占，以便为保证容器腾出空间。这将提高集群利用率。

MapReduce task-level native optimization
MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

Map阶段的输出收集器增加了本地实现，对于Shuffer密集型工作，的性能可以提高30%以上。

Support for more than 2 NameNodes.
The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

hadoop2.x中NameNode的HA包含一个active的NameNode和一个Standby的NameNode。解决了系统中NameNode的单点故障问题。在hadoop3中允许多个standby状态的NameNode以达到更高级别容错的目的，允许用户运行多个备用NameNodes。例如，通过配置三个NameNodes和五个journalnode，集群能够容忍两个节点的故障，而不仅仅是一个节点。

Default ports of multiple services have been changed.
Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

端口改动，Hadoop1.x、2.x、3.x 部分主键的端口经常变更，多个Hadoop服务的默认端口位于Linux临时端口范围（32768-61000）。这意味着在启动时，由于与另一个应用程序的冲突，服务有时无法绑定到端口，这些冲突的端口已移出临时范围，影响了NameNode，Secondary NameNode，DataNode和KMS。

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

Hadoop现在支持与microsoftazure数据湖和Aliyun对象存储系统OSS集成，作为Hadoop兼容文件系统的替代方案。

Intra-datanode balancer
A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

数据倾斜，单个数据节点管理多个磁盘。在正常写操作期间，磁盘将被均匀地填满。但是，添加或替换磁盘可能会导致DataNode中的严重偏差。这种情况不是由现有的HDFS平衡器处理的，它关注的是内部而不是内部的DN偏斜。

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guide for more information.

这种情况由新的intra-DataNode平衡功能处理，该功能通过hdfs diskbalancer CLI调用。

Reworked daemon and task heap management
A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

守护进程以及MR任务的堆管理做了一系列更改。现在可以根据主机的内存大小进行自动调整，并且不推荐使用HADOOP_HEAPSIZE变量。简化了MR任务堆空间的配置，在任务中不再需要以java选项的方式进行指定。

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

为Amazon S3存储的S3A客户端添加了一个可选功能：能够将DynamoDB表用于文件和目录元数据的快速一致存储。

HDFS Router-Based Federation
HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

HDFS基于路由器的联邦添加了一个RPC路由层，该层提供多个HDFS命名空间的联合视图。这与现有的ViewFs和HDFS联合功能类似，不同之处在于安装表由路由层而不是客户端在服务器端进行管理，简化了对现有HDFS客户端对联邦群集的访问。

API-based configuration of Capacity Scheduler queue configuration
The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

容量调度器Capacity scheduler的OrgQueue扩展提供了一种编程方式，通过提供restapi来更改配置，用户可以调用restapi来修改队列配置。这使得队列的administrate队列ACL中的管理员能够自动化队列配置管理。

lucklilili

关注

10
点赞
踩
0

收藏

觉得还不错? 一键收藏
5
评论
Hadoop 3.0.0 New features

Hadoop3.X AndHadoop2.X VersionDiffcomparisonand New features。Minimum required Java version increased from Java 7 to Java 8All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java...
复制链接

扫一扫

专栏目录