一个Hadoop管理员的职责(翻译)

一个Hadoop管理员的职责(翻译)
最近看过一篇与Hadoop有关的英文文档,其实就是一本书里的一部分内容。觉得很好,基本阐述了一个hadoop管理员的职责。平时,工作当中接触到hadoop的朋友,可以看下,这篇文档中所描述的知识和技能,大家是否都已经具备了?
译文:
一个Hadoop管理员的职责

随着对大数据日益增长的兴趣和洞察力,各个组织正在积极计划或者组建他们的大数据团队。要开始进行数据工作,他们需要一个良好而扎实的基础架构。
一旦他们具备基础架构,他们就须要针对集群的维护,管理和排错进行控制和指定策略。

市场对Hadoop管理员的需求日益增长,他们的工作(创建和维护集群)使得数据分析成为真正的可能。

Hadoop管理员在网络,操作系统,和存储方面,须要很好的系统操作技能。在复杂的网络环境中,对于计算机硬件和硬件操作,他们需要具备大量的知识。

Apache Hadoop软件主要运行在Linux操作系统,所有必须对Linux操作系统具备诸如:监控,排错,配置,安全管理等这些技能。

为集群设置节点涉及很多重复性的工作,Hadoop管理员应该使用快速而有效率的方法把这些服务器使用起来,比如使用Puppet,Chef和CFEngine这样的管理工具.
除了这些工具,管理也应该具备良好的规划技能去设计和规划集群.

在一个集群中许多节点须要复制数据,比如,namenode守护进程的fsimage文件,可以被配置为写入相同节点的不同硬盘,或者写入不同节点。
所以hadoop管理员须要理解NFS挂载点以及如何配合集群来建立NFS挂载.管理员也可能被要求在特定的节点上配置磁盘RAID.

因为Hadoop所有的服务和守护进程都是建立在Java之上,所以JVM(Java Virtual Machine Java虚拟机)的基本知识,和对Java异常的理解将会非常有用.
这些知识能够帮助管理员快速的确认问题.

Hadoop管理员应具备进行基准测试的技能,能够在高流量的场景下测试集群的性能.

集群总是在持续不断的运行,并处理大量的数据,所以集群比较容易出现故障.为了监控集群的健康状况,管理员须要部署监控工具,诸如:Nagios 和 Ganglia等等.
并且管理员须要为关键节点配置告警和监控,在出现问题之前,提前预见到问题.

具备良好的脚步语言编程知识,诸如: Python,Ruby, 或者 Shell,将会极大的帮助到Hadoop管理员.
通常,Hadoop管理员会被要求把一些预定的文件从外部文件源,分期的导入至HDFS. 脚步技能可以帮助管理员通过执行脚本来自动化地管理这些工作.

最重要的是,Hadoop管理员应该很好的了解Apache Hadoop的体系结构和它的内部运作.

下面这些项目是Hadoop管理员必须掌握的一些关键hadoop操作:
规划集群,评估集群须要处理的数据量,以此来决定集群中的节点数量.
在集群上安装和升级Apache Hadoop.
通过使用Hadoop的各种配置文件来配置和调试Hadoop.
理解所有Hadoop守护进程,以及它们在集群中的角色和承担的职责.
Hadoop 管理员应该知如何阅读和解释Hadoop的日志.
在集群中添加和删除节点.
在集群中重新平衡节点.
使用认证和认证系统来启用安全机制,比如Kerberos

几乎所有的组织都会遵循一定的策略来备份他们的数据,执行数据备份工作是Hadoop管理员的责任.
所以Hadoop管理员应该熟悉服务器的备份和恢复操作.

原文:
Responsibilities of a Hadoop administrator

With the increase in the interest to derive insight on their big data,
organizations are now planning and building their big data teams aggressively.
To start working on their data, they need to have a good solid infrastructure.
Once they have this setup, they need several controls and system policies in place to maintain, manage,and troubleshoot their cluster.

There is an ever-increasing demand for Hadoop Administrators in the market
as their function (setting up and maintaining Hadoop clusters) is what makes analysis really possible.

The Hadoop administrator needs to be very good at system operations, networking, operating systems, and storage.
They need to have a strong knowledge of computer hardware and their operations, in a complex network.

Apache Hadoop, mainly, runs on Linux. So having good Linux skills such as monitoring, troubleshooting, confguration, and security is a must.

Setting up nodes for clusters involves a lot of repetitive tasks
and the Hadoop administrator should use quicker and effcient ways to bring up these servers using confguration management tools
such as Puppet, Chef, and CFEngine.
Apart from these tools, the administrator should also have good capacity planning skills to design and plan clusters.

There are several nodes in a cluster that would need duplication of data,
for example, the fsimage file of the namenode daemon can be confgured to write to two different disks on the same node
or on a disk on a different node.
An understanding of NFS mount points and how to set it up within a cluster is required.
The administrator may also be asked to set up RAID for disks on specifc nodes.

As all Hadoop services/daemons are built on Java,
a basic knowledge of the JVM along with the ability to understand Java exceptions would be very useful.
This helps administrators identify issues quickly.

The Hadoop administrator should possess the skills to benchmark the cluster to test performance under high traffc scenarios.

Clusters are prone to failures as they are up all the time and are processing large amounts of data regularly.
To monitor the health of the cluster, the administrator should deploy monitoring tools such as Nagios and Ganglia
and should confgure alerts and monitors for critical nodes of the cluster to foresee issues before they occur.

Knowledge of a good scripting language such as Python, Ruby, or Shell would greatly help the function of an administrator.
Often, administrators are asked to set up some kind of a scheduled file staging from an external source to HDFS.
The scripting skills help them execute these requests by building scripts and automating them.

Above all, the Hadoop administrator should have a very good understanding of the Apache Hadoop architecture and its inner workings.

The following are some of the key Hadoop-related operations that the Hadoop administrator should know:

Planning the cluster, deciding on the number of nodes based on the estimated amount of data the cluster is going to serve.

Installing and upgrading Apache Hadoop on a cluster.

Confguring and tuning Hadoop using the various confguration files available within Hadoop.

An understanding of all the Hadoop daemons along with their roles and responsibilities in the cluster.

The administrator should know how to read and interpret Hadoop logs.

Adding and removing nodes in the cluster.

Rebalancing nodes in the cluster.

Employ security using an authentication and authorization system such as Kerberos.

Almost all organizations follow the policy of backing up their data
and it is the responsibility of the administrator to perform this activity.
So, an administrator should be well versed with backups and recovery operations of servers

Hadoop_hbase 1.处理hadoop的datanode宕机 cd path/to/hadoop 走到hadoop的bin目录 ./hadoop-daemon.sh start datanode ./hadoop-daemon.sh start tasktracker 2.处理hadoop的namenode宕机 ./hadoop-daemon.sh start namenode ./hadoop-daemon.sh start tasktracker 3.如果是新添加一个节点,需要执行以下步骤: 首先,把新节点的 IP或主机名 加入主节点(master)的 conf/slaves 文件。 然后登录新的从节点,执行以下命令: $ cd path/to/hadoop $ bin/hadoop-daemon.sh start datanode $ bin/hadoop-daemon.sh start tasktracker 然后就可以在master机器上运行balancer,执行负载均衡 $bin/hadoop balancer 4.处理hbase的regionserver宕机的办法 ./hbase-daemon.sh start regionserver ./hbase-deamon.sh start zookeeper//只针对有zookeeper的regionserver而且是机子需要重启的情况 5.处理hbase的master宕机的办法 ./hbase-daemon.sh start master ./hbase-daemon.sh start zookeeper//可选 6.完全重启整个集群的过程 首先是用root权限关闭所有节点的防火墙,/etc/init.d/iptables stop 然后启动hadoop集群 来到hadoop的安装路径执行: ./start-all.sh 待到集群全部成功启动之后两分钟之后执行关闭hadoop文件系统的安全模式, ./hadoop dfsadmin -safemode leave 对于hadoop文件系统安全模式的解释,如下 NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1- dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。 dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block个数达到了 元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模式。如果设为1则HDFS永远是处于SafeMode。 有两个方法离开这种安全模式 (1)修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。 (2)hadoop dfsadmin -safemode leave命令强制离开 用户可以通过dfsadmin -safemode $value来操作安全模式,参数$value的说明如下: enter – 进入安全模式 leave – 强制NameNode离开安全模式 get – 返回安全模式是否开启的信息 wait – 等待,一直到安全模式结束。 //因为我们后面要用到
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值