原文在这里。
设计目标
与平台无关
该系统体系结构必须支持任何硬件和操作系统,如RHEL , SLES的,Ubuntu ,Windows等。依赖平台的组件(例如,处理yum , rpm包, Debian软件包的组件,
等)应采用定义良好的接口使其可插拔。
可插拔组件
该架构必须不假定具体的工具和技术。任何具体的工具和技术必须通过可插拔的组件封装。该架构将专注于Puppet及相关部件的可插入性,这些相关部件是provisioning,自选的配置工具,和用于持久化状态的数据库。我们的目标不是要马上支持Puppet的替代,而是未来该架构容易扩展而进行替换。
这个可插入性的目标并不包括的组件间的协议或与第三方组件协作的接口的标准化。
版本管理与升级
各个节点上运行Ambari组件必须支持多个版本的协议,该协议支持组件的独立升级。任何Ambari的组件的升级必须不影响集群状态。
可扩展性
该设计应支持轻松添加新的服务,组件和API。 扩展性也意味着易于为Hadoop栈修改任何配置或provisioning步骤。此外,支撑Hadoop栈而不是HDP的可能性需要被考虑在内。
故障恢复
该系统必须能够从任何元件故障恢复到一致的状态。该系统恢复后应尽量完成挂起操作。如果某些错误是不可恢复的,故障还是应该保持系统处于一致状态。
安全
安全意味着:
1)认证和Ambari用户(包括API和Web UI)的基于角色的授权,
2 )安装,管理和监控Hadoop的堆栈,通过Kerberos担保,以及
3)认证和加密Ambari组件之间的有线通信(例如, Ambari的master-agent通信) 。
错误跟踪
设计努力简化跟踪失败的过程。故障应该以足够的细节和分析点展现给用户。
操作的准实时反馈
对于需要一段时间才能完成操作,系统需要能够及时(准实时)向用户提供反馈当前正在运行的任务的进展, 操作完成的百分比,操作日志的链接等。在Ambari以前的版本中,这是不可用的,这是由于Puppet的Master-Agent的体系结构和它的状态报告机制。
术语
Service服务
服务是指在Hadoop堆栈中的服务。 如HDFS,HBase和Pig。一个服务可能有多个组件(例如,HDFS有NameNode,Secondary NameNode,DataNode上,等等)。一个服务可以只是一个客户端库(例如,Pig没有任何守护进程服务,而只是一个客户端库)。Component组件
一个服务包含由一个或多个组件。例如,HDFS有3部分:NameNode,DataNode上和Secondary NameNode。组件可能是可选的。一个组件可以跨越多个节点(例如,DataNode实例在多个节点上)。
Node/Host节点/主机
节点是指集群中的一台机器。在本文中节点和主机可以互换使用。
Node-Component节点组件
节点组件是指在特定节点上的一个组件的实例。例如,在一个特定节点上的特定的DataNode实例是节点组件。
Operation作业
An operation refers to a set of changes or actions performed on a cluster to satisfya user request or to achieve a desirable state change in the cluster. For example,
starting of a service is an operation and running a smoke test is an operation. If a
user requests to add a new service to the cluster and that includes running a smoke
test as well, then the entire set of actions to meet the user request will constitute an
operation. An operation can consist of multiple “actions” that are ordered (see
below).
Task任务
Task is the unit of work that is sent to a node to execute. A task is the work thatnode has to carry out as part of an action. For example, an “action” can consist of
installing a datanode on Node n1 and installing a datanode and a secondary
namenode on Node n2. In this case, the “task” for n1 will be to install a datanode
and the “tasks” for n2 will be to install both a datanode and a secondary namenode.
Stage
Stage是指完成一个作业所需要的一组任务集合,他们是相互独立的,在同一阶段的所有任务可以并行地在不同的节点上运行。
Action
一个“Action”是由一台或一组机器上的一个或多个任务组成的。每个action是由一个动作ID跟踪,并且节点至少在action的粒度报告状态。一个action可以被认为是执行的一个stage。除果没有特别指定,本文中一个stage和一个action是一一对应。一个action ID一一映射request ID ,stage-ID。
Stage Plan阶段计划
一个操作通常包括在多台机器执行的多任务,并且这些任务通常有依赖关系,要求他们按照一个特定的顺序运行。有些任务需要在别的任务被调度前完成。因此,作业所需的任务需要被划分为多个stages,使每个阶段的任务必须在下一stage之前完成,但同一阶段的所有任务可以被安排在不同的节点平行地执行。
Manifest
清单指的是被发送到执行节点一个任务的定义。该清单必须完全定义任务,而且必须是可序列化。清单还可以持久化到磁盘中,用于恢复或记录。
Role角色
角色映射一个组件(例如,的NameNode , DataNode上)或动作(例如, HDFS的rebalancing, HBase的烟雾测试,其他管理命令等)
Ambari 架构
下图描述Ambari的高层体系结构
下图描述Ambari服务器的设计:
下图描述Ambari代理的设计:
Use cases
In this section we cover a few basic use cases and describe how the request areserved by the system at a high level and how components interact.
1. Add service:
Add a new service to an existing cluster. Let’s take a specific
example of adding Hbase service to an existing cluster, which is already runningHDFS. HBase master and slaves will be added to the subset of existing nodes (no
additional nodes). It will go through following steps:
• The request lands on the server via API and a request id is generated and
attached to the request. A handler for this API is invoked in the Coordinator.
• The API Handler implements the steps needed to start a new service to an
existing cluster. In this case the steps would be: install all the service
components along with the required prerequisites, start the prerequisites and
the service components in a specific order, and re-configure Nagios server to
add new service monitoring as well.
• The Coordinator will lookup in Dependency Tracker and find the prerequisites
for HBase. Dependency Tracker will return HDFS and ZooKeeper
components. Coordinator will also lookup dependencies for Nagios server
and which will return HBase client. Dependency Tracker will also return the
required state of the required components. Thus, Coordinator will know the
entire set of components and their required states. Coordinator will set the
desired state for all these components in the DB.• During the previous step, the Coordinator may also determine that it requires
the user’s input to select nodes for ZooKeeper and may return an appropriate
response. This depends on the API semantics.
• The Coordinator will then pass the list of components and their desired states
to the Stage Planner. Stage Planner will return the staged sequence of
operations that need to be performed on each node where these components
are to be installed/started/modified. The Stage Planner will also generate the
manifest (tasks for each individual nodes for each stage) using the Manifest
Generator.
• Coordinator will pass this ordered list of stages to the Action Manager with
the corresponding request id.
• Action Manager will update the state of each node-component, in the FSM,
which will reflect that an operation is in progress. Note that the FSM for each
affected node-component is updated. In this step, the FSM may detect an
invalid event and throw failure, which will abort the operation and all actions
will be marked as failed with an error.
• Action Manager will create an action id for each operation and add it to the
Plan. The Action Manager will pick first Stage from the plan and adds each
action in this Stage to the queue for each of the affected nodes. The next
Stage will be picked when first Stage completes. Action Manager will also
start a timer for scheduled actions.
• Heartbeat Handler will receive the response for the actions and notify the
Action Manager. Action Manager will send an event to the FSM to update the
state. In case of a timeout, the action will be scheduled again or marked as
failed. Once all nodes for an action have reached completion (response
received or final timeout) the action is considered completed. Once all actions
for a Stage are completed the Stage is considered completed.
• Action completion is also recorded in the database.
• The Action Manager proceeds to the next Stage and repeats.
2. Run Smoke Test:
The cluster is already active with HDFS and HBase services
active, and the user wants to run HBase smoke test.• The request lands on the server via API and a request id is generated and
attached to the request. A handler for this API is invoked in the Coordinator.
• The API Handler invokes Dependency Tracker and finds that HBase service
should be up and running for this. The API Handler throws back error if hbase
live status is not running. The Handler also determines where the smoke test
should be run. Dependency Tracker will expose a method which will tell the
Coordinator which client component is required on the host where smoke test
should be run. The Stage Planner is invoked to generate a plan for smoke
test. In this case the plan is simple one stage with single node-component.
• The rest of the steps are similar to previous use case. In this case the FSM
will be specifically for hbase-smoketest role.
3. Reconfigure Service
• The cluster is already active with the services and its dependencies.• A request to save the configuration lands on the server and the new config
snapshot is stored on the server. This request should also have information
on what service(s) and/or roles-hosts the config change affects. This implies that the persistence layer needs to track on a per service/component/node
basis as to what was the last config version that was changed that affects the
object in question.
• A user can make multiple calls above to save multiple checkpoints.
• At some point, the user decides to deploy the new configs. In this scenario,
the user will send a request to deploy the new configs to the required
service/component/node-component.
• When this request lands on the server, via the coordinator handler, it will
result in updating of the desired config version of the object in question to the
version specified or the latest config based on the API specs.
• The Coordinator will execute re-configure in two steps. First, it will generate a
stage plan to stop the services in question. Then it will generate a stage plan
to start the services with new configurations. Coordinator will append the two
plans, stop followed by start and will pass it to the Action Manager. Rest of
the steps proceed as in previous use cases.
4. Ambari master crashed and restarted
• Assuming the Ambari master dies, there are multiple scenarios that need tobe addressed.
• Assumptions:
o Desired state has been persisted in the DB
o All pending actions are defined in the database.
o Live status is tracked in the DB. However, the agent cannot send back
live status currently so require will be based on DB state.
o All actions are idempotent.
• Actions required
o Ambari Master
! The action layer needs to re-queue all pending or previously
queued ( but incomplete ) stages back to the Heartbeat
Handler and allow the agents to re-execute the steps in
question.
! The Ambari server should not accept any new actions/requests
until it has recovered completely.*
! If there is any discrepancy between actual live state and the
desired state ( i.e live state is STARTED or STARTING but
desired state is STOPPED ), there should be a trigger to the
stage planner/transaction manager to initiate the actions to get
state to the desired state. However, if the live state is
STOP_FAILED and desired_state is STOPPED, then this is a
no-op for recovery as the Ambari server does not initiate
actions itself in such scenarios. For the latter case, the onus is
on the admin/user to re-initiate a stop for the nodes that failed [
this may change depending on API specifications and product
decision ].
o Ambari Agent
! The agent should not die if the master suddenly disappears. It
should continue to poll at regular intervals and recover as
needed when the master comes back up.! The Ambari agent should keep all the necessary information it
planned to send to the master in case of a connection failure
and re-send the information after the master comes back up.
! It may need to re-register if it was previously in the process of
registering.
5. Decommission a subset of Datanodes.
• Decommissioning will be implemented as an action performed on hadoopadmin role in the Puppet layer. hadoop-admin will be a new role defined tocover hadoop admin operations. This role will require hadoop-client to be
installed and admin privileges available.
• The manifest for the decommission operation will consist of hadoop-admin
role with certain parameters like a list of datanodes and a flag specifying the
action.
• Coordinator will identify a node which is designated to run hadoop-admin
actions. This information must be available in cluster state.
• The decommission action will follow the state machine for node-roles that are
actions. This will be considered successful when the decommissioning has
been started at the namenode i.e. when the admin command to datanode
succeeds.
• Information whether nodes have been decommissioned or not should be
queried separately. This information is available in namenode UI. Ambari
should have another API to query decommissioning or decommissioned
datanodes.
• Ambari does not keep track whether a datanode is decommissioning or
decommissioned. To get rid of decommissioned nodes from Ambari, a
seperate API call should be made to Ambari to stop/uninstall those
datanodes.
Agent:
Agents will heartbeat to the master every few seconds and will receivecommands from the master in the heartbeat responses. Heartbeat responses
will be the only way for master to send a command to the agent. The command
will be queued in the action queue, which will be picked up by the action
executioner. Action executioner will pick the right tool (Puppet, Python, etc) for
execution depending on the command type and action type. Thus the actions
sent in the hearbeat response will be processed asynchronously at the agent.
The action executioner will put the response or progress messages on the
message queue. The agent will send everything on the message queue to the
master in the next heartbeat.
Recovery
There are two main choices in terms of master recovery. • Recovery based on actions: In this case every action is persisted and upona restart, the master checks for pending actions and reschedules them. The
state of the cluster is also persisted in the database and master rebuilds the
state machines upon a restart. There could be a race condition that some
actions might be complete but master crashes before recording their
completion, this will be handled by ensuring actions are idempotent and
master will re-schedule all actions that are not marked as completed or failed
in the DB. Persisted actions can be viewed as redo logs which is a very
common approach.
• Alternating to the above approach is recovery based on the desired state. In
this approach the master persists a desired state and upon restart it matches
desired state with the live state and tries to restore the cluster to the desired
state.
The approach based on actions gels better with our overall design because we preplan an operation and persist it, therefore even if we persist desired state, the
recovery will require a re-planning of actions. Also the desired state approach
doesn’t capture certain operations that don’t change the state of the cluster from
Ambari point of view, e.g., smoke tests or hdfs re-balancing. Persisted actions can
be viewed as redo logs.
Agent recovery requires just an agent restart because the agent is stateless. An
agent failure will be detected by the master by heartbeat loss beyond a threshold of
time.
TBD: How is agent restarted?
Security
API Authentication MethodsOne way is to use HTTP Basic Authentication. This means the client would pass the
credentials (or a base64 encoded version of it) to the server on every request. The
server would have to validate credentials on every call. It makes it easy for CLI
clients to use the API this way. Also the session state is not stored on the server.
Another way is to utilize HTTP Session and Cookies. Once the server validates the
credentials, the server generates and stores the session ID in its store. The session
ID is sent back to the client to be stored as a cookie. The client can pass this
session ID on subsequent calls. This is more efficient in that credentials need not be
validated on every API request. VMware’s vCloud REST API 1.0 used to support
this. However, they dropped this in later versions due to certain security concerns:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displ
ayKC&externalId=1025884
Also, this approach forces the server to store session state which is a big no no for
RESTfulness (IF we care).
We can support both approaches.
For CLI, the first method is preferred. For browser based applications, the latter
method avoids making redundant credential validations.Both authentication methods (and any API call that requires an authenticated user)
should be done over HTTPS. User credentials or the session token should never be
transferred over HTTP.
Bootstrap
In HMC (released as 1.0) bootstrapping was a very integrated process ofinstalling/configuring the hosts. In Ambari 1.1, bootstrapping will be a helper routine
to install the Ambari agents on the hosts. In HMC 1.0, the bootstrap process figures
out the information about a host and adds that to the database. This will now be
done as a part of Ambari agent registering to the server. The bootstrap process will
just SSH onto the hosts and install the Ambari agent. No database changes will be
done as part of doing a bootstrap.