hadoop之HDFS高可用环境搭建和基础使用
HDFS简介
Hadoop分布式文件系统(HDFS)是一个分布式文件系统,设计用于运行在商用硬件上。
Hadoop is written in Java and is supported on all major platforms.
它与现有的分布式文件系统有许多相似之处。然而,与其他分布式文件系统的区别是显著的。
HDFS是高度容错的,设计用于部署在低成本的硬件上。
HDFS提供对应用程序数据的高吞吐量访问,适用于具有大数据集的应用程序。
HDFS放宽了一些POSIX要求,以支持对文件系统数据的流访问。
HDFS最初是作为Apache Nutch的基础设施而构建的。HDFS是Apache Hadoop Core项目的一部分。
The project URL is http://hadoop.apache.org/.
HDFS设想和目标
1、Hardware Failure - 故障检测和快速自动恢复是HDFS的核心架构目标
2、Streaming Data Access - 流式数据访问
3、Large Data Sets - 大数据集
4、Simple Coherency Model - 数据一致性
5、Moving Computation is Cheaper than Moving Data - 移动计算比移动数据成本要低
6、Portability Across Heterogeneous Hardware and Software Platforms - 跨异构硬件和软件平台的兼容性
HDFS架构
NameNode and DataNodes
HDFS has a master/slave architecture.- HDFS是一个主从架构.
NameNode and DataNodes(命名节点和数据节点),HDFS has a master/slave architecture. (主从)
1、NameNode-命名节点-管理元数据
An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
HDFS exposes a file system namespace and allows user data to be stored in files.
管理系统的命名空间,控制客户端对文件的访问,用于打开、关闭和重名文件和目录,决定数据节点的快的映射
2、DataNodes-数据节点-存储数据
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
多个数据节点,管理所有的储存,一个或者的多个块存储在多个数据节点上,提供客户端的读写请求,执行NameNode指令对块的进行创建、删除和复制。
java构建,高可移植性
3、The File System Namespace-命名空间
HDFS supports a traditional hierarchical file organization. – 支持典型的分层结构组织。
HDFS supports user quotas and access permissions. – 支持用户配额和访问权限。
HDFS does not support hard links or soft links. – 不支持软件连接。
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. – NN维护文件命名空间,改变命名空间及其属性被记录在。
4、Data Replication-数据副本
The block size and replication factor are configurable per file. – 每个文件的块大小和副本因子是可以配置的。
All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync. – 文件的所有块大小除了最后一块以外都是一样大的。而新的文件不用就填充最后文件,是创建新的快。
The NameNode makes all decisions regarding replication of blocks. --NameNode负责所有块副本的决定。
It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. – NN定期接受集群中数据节点的心跳和块状态。
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. – 接受到心跳说明dataNode运行正常,blockReport包含所有块的列表。
4.1、Replica Placement-副本的放置策略
The placement of replicas is critical to HDFS reliability and performance. – 副本的位置对HDFS的可靠性和性能至关重要。
If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).
Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time. – 不允许dataNode包含相同块的多个副本,最大副本数就是dataNode的数量
4.2、Replica Selection- 副本的选择
为了缩短带宽的消耗以及读请求的延迟性,优先读取距离reader节点最近的副本,如果reader节点上存在副本跟其在同一个rack上,直接读取该副本是最高效的。如果reader读取数据跨越多个节点,则优先本地的副本读取。
4.3、Block Placement Policies-块的放置策略
Additional to this HDFS supports 4 different pluggable Block Placement Policies.
- BlockPlacementPolicyRackFaultTolerant
- BlockPlacementPolicyWithNodeGroup
- BlockPlacementPolicyWithUpgradeDomain
- AvailableSpaceBlockPlacementPolicy
- AvailableSpaceRackFaultTolerantBlockPlacementPolicy
5、The Persistence of File System Metadata - 元数据的持久化
5.1、EditLog
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. – EditLogs是一个事务性日志保存文件系统元数据的变化。
The NameNode uses a file in its local host OS file system to store the EditLog.
5.2、FsImage
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. – 块的映射和文件系统属性保存在FsImage中
5.3、如何写入文件
NameNode
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
当nameNode启动或者一个检查点被配置的入口触发的时候,它从磁盘中读取FsImage和editLog,把editLog中的所有事务应用到FsImage所代表的内存中,并刷新的版本到磁盘中一个新的FsImage文件,并清空老的EditLog,这个过程称为一个CheckPoint。CheckPoint的目的地是保证HDFS中文件元数据的一致性。
读取FsImage很高效,但是直接写入FsImage并不高效,所以把编辑操作写入EditLog。在checkpoint触发的时候把EditLog写入到FsImage中。
DataNode
The DataNode stores HDFS data in files in its local file system. It stores each block of HDFS data in a separate file in its local file system.
When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the Blockreport. – 当dataNode启动得到时候,它会扫描本地本地文件并生产本文件存储块的列表,并上报给NameNode,这个过程叫做块上报-BlockReport.
6、The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP protocol. – HDFS之间通过TCP/IP协议通信。
一个客户端通过配置的TCP端口连接到NameNode.客户端通过客户写一样跟NameNode通信,DataNodes通过dataNode写一样跟NameNode通信。RPC封装了客户端和DataNode的协议。按照设计,NameNode不会发起RPC的调用,而只是相应客户端和DataNodes的rpc请求。
7、Robustness-稳定性
The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions. – HDFS的首要目标是保证数据存储的可靠性甚至是存在故障的时候。三个常见的故障是NameNode故障,DataNode故障,网络分区故障。
8、Data Organization-数据结构
Data Blocks
HDFS is designed to support very large files. HDFS supports write-once-read-many semantics on files.
A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.
Replication Pipelining
Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
9、Accessibility-访问
Natively, HDFS provides a FileSystem Java API for applications to use.
FS Shell
管理HDFS的文件
Action | Command |
---|---|
Create a directory named /foodir | bin/hadoop dfs -mkdir /foodir |
Remove a directory named /foodir | bin/hadoop fs -rm -R /foodir |
View the contents of a file named /foodir/myfile.txt | bin/hadoop dfs -cat /foodir/myfile.txt |
DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. – 管理HDFS集群
Action | Command |
---|---|
Put the cluster in Safemode | bin/hdfs dfsadmin -safemode enter |
Generate a list of DataNodes | bin/hdfs dfsadmin -report |
Recommission or decommission DataNode(s) | bin/hdfs dfsadmin -refreshNodes |
Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
10、Space Reclamation-空间回收
删除文件不是立即删除而是放到回收目录
If trash configuration is enabled, files removed by FS Shell is not immediately removed from HDFS.
Instead, HDFS moves it to a trash directory (each user has its own trash directory under /user//.Trash). The file can be restored quickly as long as it remains in trash.
减少副本因子
When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.
HDFS高可用安装
1、问题与解决方案
NameNode单机问题:
- 机器宕机只能等管理员重启才能使用
- 机器软硬件升级需要重启会导致集群不可用
高可用解决方案
部署2台机器,一个Active,一个standby。3.0以后版本可以部署多态,一活多备。激活的nameNode负责集群中所有客户端的操作,备用namaNode只需要要简单的维护状态以达到快速响应故障。
2、HDFS High Availability Using the Quorum Journal Manager 安装
2.1、Architecture-部署架构说明
**部署方式:**主备方式,一主多备方式。
状态同步:通过一个独立的守护线程称为JournalNodes(JNs)。
数据传输:
- 主nameNode中namespace所有更改都会持久化记录到JNs上,备nameNode不断监控读取JNs中的更改,并写入到自己的namespace中,从而保持跟主nameNode的数据同步,以保证能够在主nameNode故障的时候同步主NN大的所有namespace状态数据。
- 为了快速的故障转移:必须及时获取集群中块的位置信息,dataNodes发送块位置信息以及心跳到所有的nameNode节点。
保证只有一个激活的nameNode:
原因:会导致数据分歧,有丢失数据和不正常结果的风险。
解决:JournalNodes保证只有一个nameNode为写节点,故障转移过程中成为激活的nameNode并接管写入JournalNodes的角色。
2.2、Hardware resources-硬件资源
NameNode machines
保证主备nameNode的硬件相同。
JournalNode machines
There must be at least 3 JournalNode daemons –至少3个JNs节点,
you should run an odd number of JNs, (i.e. 3, 5, 7, etc.) – 应该要运行奇数个JNs节点,选举过半机制
tolerate at most (N - 1) / 2 failures and continue to function normally. - 系统只能忍受**(n-1)/2个机器失败**。
其他:
在该可用集群中,备用的nameNodes还可以充当checkpoints去检测namespace状态,不用启动Secondary NameNode, CheckpointNode, or BackupNode。
2.3、Deployment-部署
To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.
hdfs-site.xml
<!--
集群服务名称,将作为集群中绝对路径的目录
Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.
-->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!--
nameNodes的唯一标识,三台nameNodes:nn1,nn2,nn3,最少2个nameNode,最多不要超过5个,减少通信开销
Note: The minimum number of NameNodes for HA is two, but you can configure more. Its suggested to not exceed 5 - with a recommended 3 NameNodes - due to communication overheads.
-->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2,nn3</value>
</property>
<!--
nameNode监听的rpc地址和端口
-->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop2:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn3</name>
<value>hadoop3:8020</value>
</property>
<!--
每台nameNode的http访问地址
-->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop1:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop2:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn3</name>
<value>hadoop3:9870</value>
</property>
<!--
用于标识namenode将在其中写入/读取编辑的JNs组的URI
-->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop1:8485;hadoop3:8485;hadoop3:8485/mycluster</value>
</property>
<!--
客户端用来跟激活中的NameNode通信的java类
-->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!--
ssh到激活的NameNode并杀死相应的进程,需要开始ssh权限,配置ssh登录选项
-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!--
可以配置非标准的用户名和端口,还可以配置超时
-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
core-site.xml
<!--
hadoop的FS客户端的使用的默认路径前缀,在没配置的时候。
-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!---
JournalNode守护进程保存本地状态的路径
-->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/var/bigdata/hadoop/ha/dfs/jn</value>
</property>
<!--
if prevent safe mode namenodes to become active
-->
<property>
<name>dfs.ha.nn.not-become-active-in-safemode</name>
<value>true</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/bigdata/hadoop/ha/hadoop-${user.name}</value>
</property>
详细部署步骤
1、首先启动JournalNodes机器,并等待守护进程启动成功。
hdfs --daemon start journalnode
2、JNs启动成功后,必须首席爱你同步2台HA的nameNodes磁盘元数据。
-
如果是新的机器,首先在其中一台nameNode上执行格式化操作
hdfs namenode -format
-
格式化以后,需要在未格式化的备nameNode执行:
hdfs namenode -bootstrapStandby
Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
-
如果你转换一个非HA的nameNode到HA,你应该执行
hdfs namenode -initializeSharedEdits
它将使用来自本地NameNode编辑目录的编辑数据初始化JournalNodes。
2.4、Automatic Failover - 自动故障转移
Failure detection-故障检测
集群中每个nameName在zk中维护一个持久化的回话,当nameNode崩掉的时候,回话会过期,通知其他的nameNode应当触发故障转移。
Active NameNode election-活动的nameNode选举
通过zk的选举策略进行选择active的NameNode,当激活的NameNode宕机的时候,另外的有一个会在zk中拿到有一个排它锁表明它需要称为active的NameNode.
部署ZK,其他文档参考,需要好好学习
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
– 如果需要自动故障转移需要先停止集群,在集群运行时,目前不可能从手动故障转移设置转换到自动故障转移设置。
Configuring automatic failover - 配置自动故障转移
hdfs-site.xml
中配置
<!-- 开启级自动故障转移-->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
core-site.xml
中配置
<!-- 配置zk地址-->
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1:2181,zk2:2181,zk3:2181</value>
</property>
Initializing HA state in ZooKeeper – 在高可用状态下初始化zk,选择一台nameNode执行就行
$HADOOP_HOME/bin/hdfs zkfc -formatZK
执行以上命令将在zk中创建znode用来存储自动转移故障数据。
Starting the cluster with start-dfs.sh
[root@hadoop1 ~]# start-dfs.sh
WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER. Using value of HADOOP_SECURE_DN_USER.
Starting namenodes on [hadoop1 hadoop2 hadoop3]
上一次登录:一 8月 7 12:21:14 CST 2023pts/0 上
Starting datanodes
上一次登录:五 8月 18 10:46:08 CST 2023pts/0 上
Starting journal nodes [hadoop3 hadoop1]
ERROR: Attempting to operate on hdfs journalnode as root
ERROR: but there is no HDFS_JOURNALNODE_USER defined. Aborting operation.
Starting ZK Failover Controllers on NN hosts [hadoop1 hadoop2 hadoop3]
上一次登录:五 8月 18 10:46:11 CST 2023pts/0 上
NameNodes to become active. – 当自动转移配置完成,执行 start-dfs.sh
命令自动在nameNode机器上启动ZKFC守护进程,ZKFC启动后回自动选一个nameNode为active.
Starting the cluster manually
若你是手动管理集群的话,需要手动启动zkfc
[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start zkfc