Introducing The Newly Redesigned Apache HAWQ [作者:常雷]


Background

To build the modern, dynamic applications of today directly from Apache Hadoop® file system (HDFS) data, many customers require SQL, specifically advanced Hadoop Native SQL analytics with enterprise-grade processing capabilities. The following are the typical requirements from customers who are using Hadoop for their daily analytical work:


  • Interactive queries: Interactive query response time is the key to promote data exploration, rapid prototyping and other tasks.

  • Scalability: Linear scalability is necessary for the explosive growth of data size.

  • Consistency: Transaction makes sense and it takes the consistency responsibility away from application developers.

  • Extensibility: The system should support common popular data formats, such as plain text, SequenceFile, and new formats.

  • Standard compliance: Keeping the investment in existing BI and visualization tools requires standard compliance, for example, SQL and various other BI standards.

  • Productivity: It is favorable to make use of existing skill sets in the organization without losing productivity.


The Hadoop software stack couldn’t satisfy all of the above requirements. In database community, one of the recent trends is the wide adoption of Massively Parallel Processing (MPP) systems. MPP databases share many features with Hadoop, such as scalable shared-nothing architecture. They particularly excel at fast query processing capability (orders of magnitude faster than other solutions), automatic query optimization, and industry SQL standard compatible interfaces with BI tools, which make them easy to use for data analysts.


Motivated by the limitations of Hadoop and performance advantages of MPP databases, we developed HAWQ, which is a SQL query engine that combines the merits of Pivotal Greenplum Database and Hadoop distributed storage. The HAWQ architecture is different from Greenplum Database because of the underlying distribution storage characteristics and tight integration requirements with Hadoop stack.


In HAWQ 1.0, significant architectural innovations were done for various components, for example, distributed transaction, fault tolerance, unified catalog service, metadata dispatch and various other components. After 1.0 release, customer requirements have begun trending toward the support for greater elasticity to enhance the capabilities of running HAWQ on public cloud, private cloud or shared physical cluster environments. The HAWQ team has developed next generation HAWQ to address these cloud trends and continue the architecture innovation of HAWQ 1.0. Apache HAWQ (incubating) is based on HAWQ 2.0 Beta.


Architecture

The high level architecture of Apache HAWQ is show in Figure 1. On a typical deployment, on each slave node, there is a physical HAWQ segment, an HDFS DataNode and a NodeManager installed. Masters for HAWQ, HDFS and YARN are on separate nodes.


In this new release, HAWQ is tightly integrated with YARN for query resource management. HAWQ caches containers from YARN in a resource pool and then manages those resources locally leveraging its own finer-grained resource management for users and groups. For a query to be executed, it allocates a set of virtual segments according to the cost of a query, resource queue definitions, data locality and the current resource usage in the system. Then the query is dispatched to corresponding physical hosts (can be a subset of nodes of the whole cluster). The HAWQ resource enforcer on each node monitors and controls the real time resources used by the query to avoid resource usage violations.



Figure 1: Apache HAWQ high-level architecture


In this new architecture, nodes can be added dynamically without data redistribution. Expansion is an operation taking only seconds. When a new node is added, it automatically contacts HAWQ master which makes the resource available on the node to be used for future queries immediately.


Component Highlights

Figure 2 shows more details on the internal components.

  • Masters: Accepts user connections, parses queries, optimizes queries, dispatches queries to segments and coordinates the query execution.

  • Resource Manager (RM):  Obtains resources from YARN and answers any resource requests. Resources are buffered at HAWQ RM to support low latency queries.

  • Fault Tolerance Service (FTS): Responsible for detecting segment failures and accepting heartbeats from segments.

  • Dispatcher/Coordinator: Dispatches query plans to selected subset of segments and coordinating the query execution. The dispatcher and the RM are the main components for dynamic scheduling.

  • Catalog service: Stores all metadata, such as UDF/UDT information, relation information, security information and data file locations.

  • YARN: Hadoop resource management framework.

  • Segment: Performs data processing. There is only one physical segment on each host, which makes expansion, shrinking and resource management much easier. Each segment can start many Query Executors (QEs) for each query slice. This makes the single segment act like multiple virtual segments, which enables HAWQ 2.0 to better utilize all available resources.


Figure 2: Apache HAWQ components


Query Execution Flow

After a query is accepted on master, it is parsed and analyzed. After analysis, a query tree is generated. The query tree is given to the query optimizer and the optimizer generates a query plan. Given the cost information of the plan, resources are requested from the HAWQ Resource Manager. The Resource Enforcer on each node will enforce that allocated resources are used properly, and will kill offending queries that use more memory resources than the allocated quota.


Elasticity

For classical MPP execution engine, the number of segments, which are compute resource carriers, used to run a query is fixed. This is true regardless of whether the query is big (e.g., involving large table scans and multi-table join, probably requiring a lot of CPU/memory/IO resources), or small (e.g., a small table scan requiring few resources). This requirement fixes the parallelism of query execution, and thus simplifies the whole system architecture. However, this architecture does not provide efficient system resource utilization, and in a cloud or shared cluster environment this becomes very apparent.


To address this issue, the elasticity feature is introduced. It is based on virtual segments. The number of virtual segments is allocated on demand based on query costs. More specifically, for big queries, a large number of virtual segments are started, while for small queries, less virtual segments are started. Query execution is fully pipelined across query executors started on all virtual segments.


Resource Management

Resource management is a key part to support elasticity. HAWQ supports three-level resource management: global level, internal query level and operator level.


  • Global level: responsible for getting and returning resources from global resource managers, for example, YARN or Mesos.

  • Internal query level: responsible for allocating acquired resources to different sessions and queries according to resource queue definitions.

  • Operator level: responsible for allocating resources across query operators in a query plan.


Resources in HAWQ Resource Manager (RM) are managed within resource queues.  When a query is submitted, after query parsing, semantic analysis and query planning, resources are obtained from HAWQ RM through libYARN, which is a C/C++ client in HAWQ to communicate with YARN. The resource allocation for each query is sent with the plan together to the segments. Consequently, each Query Executor knows the resource quota for current query and enforces the resource consumption during the query execution. When query execution finishes (or cancelled). the resource is returned to HAWQ RM.


Further Reading

Read the SIGMOD papers: HAWQ: a massively parallel processing SQL engine in hadoop



  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值