Hadoop核心HDFS介绍以及基础指令


Hadoop三大核心中HDFS的介绍

HDFS(Hadoop Distributed File System)

HDFS:分布式文件系统,解决分布式存储

1、HDFS特点:

HDFS优点

1、支持处理超大文件
2、可运行在廉价机器上
3、高容错性
4、流式文件写入

HDFS缺点

1、不适合低延时数据访问场景
2、不适合小文件存取场景
3、不适合并发写入,文件随机修改场景

2、HDFS dfsadmin

dfsadmin命令用于管理HDFS集群

指令功能说明
hdfs dfsadmin -report返回集群的状态信息
hdfs dfsadmin -safemode enter/leave进入和离开安全模式
hdfs dfsadmin -saveNamespace保存集群的名字空间
hdfs dfsadmin -rollEdits回滚编辑日志
hdfs dfsadmin -refreshNodes刷新节点
hdfs dfsadmin -getDatanodeInfo node1:8010获取数据节点信息
hdfs dfsadmin -setQuota 10 /hdfs设置文件目录配额

3、HDFS CLI (命令行)

基本格式
hdfs dfs -cmd
hadoop fs -cmd(已过时)
命令和Linux相似

所有指令以及使用方法:

[root@zjw ~]# hdfs dfs -help
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-x] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touchz <path> ...]
	[-usage [cmd ...]]

-appendToFile <localsrc> ... <dst> :
  Appends the contents of all the given local files to the given dst file. The dst
  file will be created if it does not exist. If <localSrc> is -, then the input is
  read from stdin.

-cat [-ignoreCrc] <src> ... :
  Fetch all files that match the file pattern <src> and display their content on
  stdout.

-checksum <src> ... :
  Dump checksum information for files that match the file pattern <src> to stdout.
  Note that this requires a round-trip to a datanode storing each block of the
  file, and thus is not efficient to run on a large number of files. The checksum
  of a file depends on its content, block size and the checksum algorithm and
  parameters used for creating the file.

-chgrp [-R] GROUP PATH... :
  This is equivalent to -chown ... :GROUP ...

-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... :
  Changes permissions of a file. This works similar to the shell's chmod command
  with a few exceptions.
                                                                                 
  -R           modifies the files recursively. This is the only option currently 
               supported.                                                        
  <MODE>       Mode is the same as mode used for the shell's command. The only   
               letters recognized are 'rwxXt', e.g. +t,a+r,g-w,+rwx,o=r.         
  <OCTALMODE>  Mode specifed in 3 or 4 digits. If 4 digits, the first may be 1 or
               0 to turn the sticky bit on or off, respectively.  Unlike the     
               shell command, it is not possible to specify only part of the     
               mode, e.g. 754 is same as u=rwx,g=rx,o=r.                         
  
  If none of 'augo' is specified, 'a' is assumed and unlike the shell command, no
  umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH... :
  Changes owner and group of a file. This is similar to the shell's chown command
  with a few exceptions.
                                                                                 
  -R  modifies the files recursively. This is the only option currently          
      supported.                                                                 
  
  If only the owner or group is specified, then only the owner or group is
  modified. The owner and group names may only consist of digits, alphabet, and
  any of [-_./@a-zA-Z0-9]. The names are case sensitive.
  
  WARNING: Avoid using '.' to separate user name and group though Linux allows it.
  If user names have dots in them and you are using local file system, you might
  see surprising results since the shell command 'chown' is used for local files.

-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst> :
  Identical to the -put command.

-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Identical to the -get command.

-count [-q] [-h] [-v] [-x] <path> ... :
  Count the number of directories, files and bytes under the paths
  that match the specified file pattern.  The output columns are:
  DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  or, with the -q option:
  QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA
        DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  The -h option shows file sizes in human readable format.
  The -v option displays a header line.
  The -x option excludes snapshots from being calculated.

-cp [-f] [-p | -p[topax]] <src> ... <dst> :
  Copy files that match the file pattern <src> to a destination.  When copying
  multiple files, the destination must be a directory. Passing -p preserves status
  [topax] (timestamps, ownership, permission, ACLs, XAttr). If -p is specified
  with no <arg>, then preserves timestamps, ownership, permission. If -pa is
  specified, then preserves permission also because ACL is a super-set of
  permission. Passing -f overwrites the destination if it already exists. raw
  namespace extended attributes are preserved if (1) they are supported (HDFS
  only) and, (2) all of the source and target pathnames are in the /.reserved/raw
  hierarchy. raw namespace xattr preservation is determined solely by the presence
  (or absence) of the /.reserved/raw prefix and not by the -p option.

-createSnapshot <snapshotDir> [<snapshotName>] :
  Create a snapshot on a directory

-deleteSnapshot <snapshotDir> <snapshotName> :
  Delete a snapshot from a directory

-df [-h] [<path> ...] :
  Shows the capacity, free and used space of the filesystem. If the filesystem has
  multiple partitions, and no path to a particular partition is specified, then
  the status of the root partitions will be shown.
                                                                                 
  -h  Formats the sizes of files in a human-readable fashion rather than a number
      of bytes.                                                                  

-du [-s] [-h] [-x] <path> ... :
  Show the amount of space, in bytes, used by the files that match the specified
  file pattern. The following flags are optional:
                                                                                 
  -s  Rather than showing the size of each individual file that matches the      
      pattern, shows the total (summary) size.                                   
  -h  Formats the sizes of files in a human-readable fashion rather than a number
      of bytes.                                                                  
  -x  Excludes snapshots from being counted.                                     
  
  Note that, even without the -s option, this only shows size summaries one level
  deep into a directory.
  
  The output is in the form 
  	size	disk space consumed	name(full path)

-expunge :
  Empty the Trash

-find <path> ... <expression> ... :
  Finds all files that match the specified expression and
  applies selected actions to them. If no <path> is specified
  then defaults to the current working directory. If no
  expression is specified then defaults to -print.
  
  The following primary expressions are recognised:
    -name pattern
    -iname pattern
      Evaluates as true if the basename of the file matches the
      pattern using standard file system globbing.
      If -iname is used then the match is case insensitive.
  
    -print
    -print0
      Always evaluates to true. Causes the current pathname to be
      written to standard output followed by a newline. If the -print0
      expression is used then an ASCII NULL character is appended rather
      than a newline.
  
  The following operators are recognised:
    expression -a expression
    expression -and expression
    expression expression
      Logical AND operator for joining two expressions. Returns
      true if both child expressions return true. Implied by the
      juxtaposition of two expressions and so does not need to be
      explicitly specified. The second expression will not be
      applied if the first fails.

-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Copy files that match the file pattern <src> to the local name.  <src> is kept. 
  When copying multiple files, the destination must be a directory. Passing -p
  preserves access and modification times, ownership and the mode.

-getfacl [-R] <path> :
  Displays the Access Control Lists (ACLs) of files and directories. If a
  directory has a default ACL, then getfacl also displays the default ACL.
                                                                  
  -R      List the ACLs of all files and directories recursively. 
  <path>  File or directory to list.                              

-getfattr [-R] {-n name | -d} [-e en] <path> :
  Displays the extended attribute names and values (if any) for a file or
  directory.
                                                                                 
  -R             Recursively list the attributes for all files and directories.  
  -n name        Dump the named extended attribute value.                        
  -d             Dump all extended attribute values associated with pathname.    
  -e <encoding>  Encode values after retrieving them.Valid encodings are "text", 
                 "hex", and "base64". Values encoded as text strings are enclosed
                 in double quotes ("), and values encoded as hexadecimal and     
                 base64 are prefixed with 0x and 0s, respectively.               
  <path>         The file or directory.                                          

-getmerge [-nl] <src> <localdst> :
  Get all the files in the directories that match the source file pattern and
  merge and sort them to only one file on local fs. <src> is kept.
                                                        
  -nl  Add a newline character at the end of each file. 

-help [cmd ...] :
  Displays help for given command or all commands if none is specified.

-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...] :
  List the contents that match the specified file pattern. If path is not
  specified, the contents of /user/<currentUser> will be listed. For a directory a
  list of its direct children is returned (unless -d option is specified).
  
  Directory entries are of the form:
  	permissions - userId groupId sizeOfDirectory(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) directoryName
  
  and file entries are of the form:
  	permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) fileName
  
    -C  Display the paths of files and directories only.
    -d  Directories are listed as plain files.
    -h  Formats the sizes of files in a human-readable fashion
        rather than a number of bytes.
    -q  Print ? instead of non-printable characters.
    -R  Recursively list the contents of directories.
    -t  Sort files by modification time (most recent first).
    -S  Sort files by size.
    -r  Reverse the order of the sort.
    -u  Use time of last access instead of modification for
        display and sorting.

-mkdir [-p] <path> ... :
  Create a directory in specified location.
                                                  
  -p  Do not fail if the directory already exists 

-moveFromLocal <localsrc> ... <dst> :
  Same as -put, except that the source is deleted after it's copied.

-moveToLocal <src> <localdst> :
  Not implemented yet

-mv <src> ... <dst> :
  Move files that match the specified file pattern <src> to a destination <dst>. 
  When moving multiple files, the destination must be a directory.

-put [-f] [-p] [-l] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                       
  -p  Preserves access and modification times, ownership and the mode. 
  -f  Overwrites the destination if it already exists.                 
  -l  Allow DataNode to lazily persist the file to disk. Forces        
         replication factor of 1. This flag will result in reduced
         durability. Use with care.

-renameSnapshot <snapshotDir> <oldName> <newName> :
  Rename a snapshot from oldName to newName

-rm [-f] [-r|-R] [-skipTrash] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>   
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories                                    

-rmdir [--ignore-fail-on-non-empty] <dir> ... :
  Removes the directory entry specified by each directory argument, provided it is
  empty.

-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] :
  Sets Access Control Lists (ACLs) of files and directories.
  Options:
                                                                                 
  -b          Remove all but the base ACL entries. The entries for user, group   
              and others are retained for compatibility with permission bits.    
  -k          Remove the default ACL.                                            
  -R          Apply operations to all files and directories recursively.         
  -m          Modify ACL. New entries are added to the ACL, and existing entries 
              are retained.                                                      
  -x          Remove specified ACL entries. Other ACL entries are retained.      
  --set       Fully replace the ACL, discarding all existing entries. The        
              <acl_spec> must include entries for user, group, and others for    
              compatibility with permission bits.                                
  <acl_spec>  Comma separated list of ACL entries.                               
  <path>      File or directory to modify.                                       

-setfattr {-n name [-v value] | -x name} <path> :
  Sets an extended attribute name and value for a file or directory.
                                                                                 
  -n name   The extended attribute name.                                         
  -v value  The extended attribute value. There are three different encoding     
            methods for the value. If the argument is enclosed in double quotes, 
            then the value is the string inside the quotes. If the argument is   
            prefixed with 0x or 0X, then it is taken as a hexadecimal number. If 
            the argument begins with 0s or 0S, then it is taken as a base64      
            encoding.                                                            
  -x name   Remove the extended attribute.                                       
  <path>    The file or directory.                                               

-setrep [-R] [-w] <rep> <path> ... :
  Set the replication level of a file. If <path> is a directory then the command
  recursively changes the replication factor of all files under the directory tree
  rooted at <path>.
                                                                                 
  -w  It requests that the command waits for the replication to complete. This   
      can potentially take a very long time.                                     
  -R  It is accepted for backwards compatibility. It has no effect.              

-stat [format] <path> ... :
  Print statistics about the file/directory at <path> in the specified format.
  Format accepts filesize in blocks (%b), group name of owner(%g), filename (%n),
  block size (%o), replication (%r), user name of owner(%u), modification date
  (%y, %Y)

-tail [-f] <file> :
  Show the last 1KB of the file.
                                             
  -f  Shows appended data as the file grows. 

-test -[defsz] <path> :
  Answer various questions about <path>, with result via exit status.
    -d  return 0 if <path> is a directory.
    -e  return 0 if <path> exists.
    -f  return 0 if <path> is a file.
    -s  return 0 if file <path> is greater than zero bytes in size.
    -z  return 0 if file <path> is zero bytes in size, else return 1.

-text [-ignoreCrc] <src> ... :
  Takes a source file and outputs the file in text format.
  The allowed formats are zip and TextRecordInputStream and Avro.

-touchz <path> ... :
  Creates a file of zero length at <path> with current time as the timestamp of
  that <path>. An error is returned if the file exists with non-zero length

-usage [cmd ...] :
  Displays the usage for given command or all commands if none is specified.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

常用指令演示:

-ls
hdfs dfs -ls /
在这里插入图片描述
-mkdir
hdfs dfs -mkdir /mydemo
在这里插入图片描述
-put
hdfs dfs -put /opt/zjw.txt /mydemo
在这里插入图片描述
-rm
hdfs dfs -rm /mydemo/zjw.txt
在这里插入图片描述
-help
在这里插入图片描述
-cat:只能看简单的文本文档
hdfs dfs -cat /mydemo/zjw.txt
在这里插入图片描述
创建多层文件夹-p
hdfs dfs -mkdir -p /mydemo/zjw
在这里插入图片描述
获取文件
hdfs dfs -get /mydemo/zjw/zjw.txt /opt/

在这里插入图片描述
在这里插入图片描述

4、HDFS架构

在这里插入图片描述
HDFS具有主/从架构。HDFS集群由单个NameNode,和多个datanode构成。

NameNode:管理文件系统命名空间的主服务器和管理客户端对文件的访问组成,如打开,关闭和重命名文件和目录。负责管理文件目录、文件和block的对应关系以及block和datanode的对应关系,维护目录树,接管用户的请求。如下图所示:
在这里插入图片描述
1、将文件的元数据保存在一个文件目录树中
2、在磁盘上保存为:fsimage 和 edits
3、保存datanode的数据信息的文件,在系统启动的时候读入内存。

DataNode:(数据节点)管理连接到它们运行的​​节点的存储,负责处理来自文件系统客户端的读写请求。DataNodes还执行块创建,删除

Client:(客户端)代表用户通过与nameNode和datanode交互来访问整个文件系统,HDFS对外开放文件命名空间并允许用户数据以文件形式存储。用户通过客户端(Client)与HDFS进行通讯交互。

块和复制:
我们都知道linux操作系统中的磁盘的块的大小默认是512,而hadoop2.x版本中的块的大小默认为128M,那为什么hdfs中的存储块要设计这么大呢?
其目的是为了减小寻址的开销。只要块足够大,磁盘传输数据的时间必定会明显大于这个块的寻址时间。

那为什么要以块的形式存储文件,而不是整个文件呢?
1、因为一个文件可以特别大,可以大于有个磁盘的容量,所以以块的形式存储,可以用来存储无论大小怎样的文件。
2、简化存储系统的设计。因为块是固定的大小,计算磁盘的存储能力就容易多了
3、以块的形式存储不需要全部存在一个磁盘上,可以分布在各个文件系统的磁盘上,有利于复制和容错,数据本地化计算

块和复本在hdfs架构中分布如下图所示:
在这里插入图片描述
既然namenode管理着文件系统的命名空间,维护着文件系统树以及整颗树内的所有文件和目录,这些信息以文件的形式永远的保存在本地磁盘上,分别问命名空间镜像文件fsimage和编辑日志文件Edits。datanode是文件的工作节点,根据需要存储和检索数据块,并且定期的向namenode发送它们所存储的块的列表。那么就知道namenode是多么的重要,一旦那么namenode挂了,那整个分布式文件系统就不可以使用了,所以对于namenode的容错就显得尤为重要了,hadoop为此提供了容错机制:

就是通过对那些组成文件系统的元数据持久化,分别问命名空间镜像文件fsimage(文件系统的目录树)和编辑日志文件Edits(针对文件系统做的修改操作记录)。磁盘上的映像FsImage就是一个Checkpoint,一个里程碑式的基准点、同步点,有了一个Checkpoint之后,NameNode在相当长的时间内只是对内存中的目录映像操作,同时也对磁盘上的Edits操作,直到关机。下次开机的时候,NameNode要从磁盘上装载目录映像FSImage,那其实就是老的Checkpoint,也许就是上次开机后所保存的映像,而自从上次开机后直到关机为止对于文件系统的所有改变都记录在Edits文件中;将记录在Edits中的操作重演于上一次的映像,就得到这一次的新的映像,将其写回磁盘就是新的Checkpoint(也就是fsImage)。但是这样有很大一个缺点,如果Edits很大呢,开机后生成原始映像的过程也会很长,所以对其进行改进:每当 Edits长到一定程度,或者每隔一定的时间,就做一次Checkpoint,但是这样就会给namenode造成很大的负荷,会影响系统的性能。于是就有了SecondaryNameNode的需要,这相当于NameNode的助理,专替NameNode做Checkpoint。当然,SecondaryNameNode的负载相比之下是偏轻的。所以如果为NameNode配上了热备份,就可以让热备份兼职,而无须再有专职的SecondaryNameNode。所以架构图如下图所示:
在这里插入图片描述
SecondaryNameNode工作原理图:
在这里插入图片描述
SecondaryNameNode主要负责下载NameNode中的fsImage文件和Edits文件,并合并生成新的fsImage文件,并推送给NameNode,工作原理如下:

1、secondarynamenode请求主namenode停止使用edits文件,暂时将新的写操作记录到一个新的文件中;
2、secondarynamenode从主namenode获取fsimage和edits文件(通过http get)
3、secondarynamenode将fsimage文件载入内存,逐一执行edits文件中的操作,创建新的fsimage文件。
4、secondarynamenode将新的fsimage文件发送回主namenode(使用http post).
5、namenode用从secondarynamenode接收的fsimage文件替换旧的fsimage文件;用步骤1所产生的edits文件替换旧的edits文件。同时,还更新fstime文件来记录检查点执行时间。
6、最终,主namenode拥有最新的fsimage文件和一个更小的edits文件。当namenode处在安全模式时,管理员也可调用hadoop dfsadmin –saveNameSpace命令来创建检查点。
从上面的过程中我们清晰的看到secondarynamenode和主namenode拥有相近内存需求的原因(因为secondarynamenode也把fsimage文件载入内存)。因此,在大型集群中,secondarynamenode需要运行在一台专用机器上。
创建检查点的触发条件受两个配置参数控制。通常情况下,secondarynamenode每隔一小时(有fs.checkpoint.period属性设置)创建检查点;此外,当编辑日志的大小达到64MB(有fs.checkpoint.size属性设置)时,也会创建检查点。系统每隔五分钟检查一次编辑日志的大小。

5、yarn机制

在这里插入图片描述
工作流程:
(1)Client向ResourceManager提交作业(可以是Spark/Mapreduce作业)

(2)ResourceManager会为这个作业分配一个container

(3)ResourceManager与NodeManager通信,要求NodeManger在刚刚分配好的container上启动应用程序的Application Master

(4)Application Master先去向ResourceManager注册,而后ResourceManager会为各个任务申请资源,并监控运行情况

(5)Application Master采用轮询(polling)方式向ResourceManager申请并领取资源(通过RPC协议通信)

(6) Application Manager申请到了资源以后,就和NodeManager通信,要求NodeManager启动任务

最后,NodeManger启动作业对应的任务。

工作机制:
(0)Mr 程序提交到客户端所在的节点。

(1)Yarnrunner 向 Resourcemanager 申请一个 Application。

(2)rm 将该应用程序的资源路径返回给 yarnrunner。

(3)该程序将运行所需资源提交到 HDFS 上。

(4)程序资源提交完毕后,申请运行 mrAppMaster。

(5)RM 将用户的请求初始化成一个 task。

(6)其中一个 NodeManager 领取到 task 任务。

(7)该 NodeManager 创建容器 Container,并产生 MRAppmaster。

(8)Container 从 HDFS 上拷贝资源到本地。

(9)MRAppmaster 向 RM 申请运行 maptask 资源。

(10)RM 将运行 maptask 任务分配给另外两个 NodeManager,另两个 NodeManager 分别领取任务并创建容器。

(11)MR 向两个接收到任务的 NodeManager 发送程序启动脚本,这两个 NodeManager分别启动 maptask,maptask 对数据分区排序。

(12)MrAppMaster 等待所有 maptask 运行完毕后,向 RM 申请容器,运行 reduce task。

(13)reduce task 向 maptask 获取相应分区的数据。

(14)程序运行完毕后,MR 会向 RM 申请注销自己。

6、HDFS副本机制

(1)Block:数据块

HDFS最基本的存储单元
默认块大小:128M(2.x)
副本机制
作用:避免数据丢失
副本数默认为3

(2)存放机制:

一个在本地机架节点
一个在同一个机架不同节点
一个在不同机架的节点
在这里插入图片描述

7、HDFS高可用(High Availability)

1、在1.x版本中
存在Namenode单点问题
2、在2.x版本中解决:
HDFS Federation方式,共享DN资源
Active Namenode:对外提供服务
Standby Namenode:Active故障时可切换为Active

8、HDFS读文件

在这里插入图片描述
在这里插入图片描述
1、客户端client使用open函数打开文件;

2、DistributedFileSystem用RPC调用元数据节点,得到文件的数据块信息;

3、对于每一个数据块,元数据节点返回保存数据块的数据节点的地址;

4、DistributedFileSystem返回FSDataInputStream给客户端,用来读取数据;

5、客户端调用FSDataInputStream的read函数开始读取数据;

6、FSDataInputStream连接保存此文件第一个数据块的最近的数据节点;

7、Data从数据节点读到客户端;

8、当此数据块读取完毕时,FSDataInputStream关闭和此数据节点的连接,然后连接此文件下一个数据块的最近的数据节点;

9、当客户端读取数据完毕时,调用FSDataInputStream的close函数;

10、在读取数据的过程中,如果客户端在与数据节点通信出现错误,则尝试连接包含此数据块的下一个数据节点。失败的数据节点将被记录,以后不再连接。

9、HDFS写文件

在这里插入图片描述
在这里插入图片描述

  1. 使用HDFS提供的客户端开发库,向远程的Namenode发起RPC请求;

  2. Namenode会检查要创建的文件是否已经存在,创建者是否有权限进行操作,成功则会为文件创建一个记录,否则会让客户端抛出异常;

  3. 当客户端开始写入文件的时候,开发库会将文件切分成多个packets(信息包),并在内部以"data queue"的形式管理这些packets,并向Namenode申请新的blocks,获取用来存储replicas(复制品)的合适的datanodes列表,列表的大小根据在Namenode中对replication的设置而定。

  4. 开始以pipeline(管道)的形式将packet写入所有的replicas中。开发库把packet以流的方式写入第一个datanode,该datanode把该packet存储之后,再将其传递给在此pipeline(管道)中的下一个datanode,直到最后一个datanode,这种写数据的方式呈流水线的形式。

  5. 最后一个datanode成功存储之后会返回一个ack packet,在pipeline里传递至客户端,在客户端的开发库内部维护着"ack queue",成功收到datanode返回的ack packet后会从"ack queue"移除相应的packet。

  6. 如果传输过程中,有某个datanode出现了故障,那么当前的pipeline会被关闭,出现故障的datanode会从当前的pipeline中移除,剩余的block会继续剩下的datanode中继续以pipeline的形式传输,同时Namenode会分配一个新的datanode,保持replicas设定的数量。

10、HDFS文件格式

1、HDFS支持以不同格式存储所有类型的文件
(1)文本、二进制
(2)未压缩、压缩
2、为了最佳的Map-Reduce处理,文件需可分割
(1)SequenceFile
(2)Avro File
(3)RCFile&ORCFile
(4)Parquet File

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值