1、初识HDFS
HDFS:Hadoop Distributed File System
场景:一次写入多次读出,其本身不支持修改,适合用来做数据分析,并不适合做网盘
优点:高容错行、适合大数据处理、构建在廉价机上
缺点:不适合低延时访问、小文件存储不高效、不支持并发写入和文件随机修改
1.1、HDFS组成架构
- NameNode:管理数据块映射、配置副本策略、处理客户端读写请求;
- DataNode:存储实际的数据块、执行数据块的读/写操作;
- Client:对上传HDFS的文件切块、从NameNode获取文件位置信息、从DataNode读写数据、通过命令管理(NameNode格式化)和访问HDFS(增删改查);
- Secondary NameNode:不是NameNode的热备,而是为其分担工作(定期合并Fsimage和Edits),可辅助恢复NameNode;
1.2、HDFS文件块大小
配置文件hdfs-default.xml中dfs.blocksize来配置,设置方案:BlockSize = Block寻址时间 * Block传输速率,这样寻址和传输可以无缝连接(PS:HDFS不允许多线程写) 。
思考:Block太大时磁盘传输时间大于寻址时间,Block太小时寻址时间大于磁盘传输时间,这两种情况都会导致HDFS程序的等待。
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
<description>
The default block size for new files, in bytes.
You can use the following suffix (case insensitive):
k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
Or provide complete size in bytes (such as 134217728 for 128 MB).
</description>
</property>
1.3、HDFS的shell操作
分布式场景下效果相同的命令:hadoop fs 具体命令 OR hdfs dfs 具体命令
hadoop fs、hadoop dfs和hdfs dfs的区别
# 查看支持的命令
[atguigu@hadoop102 ~]$ hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] [-s <sleep interval>] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines
The general command line syntax is:
command [genericOptions] [commandOptions]
查看命令的帮助信息
[atguigu@hadoop102 ~]$ hadoop fs -help rm
-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
Delete all files that match the specified file pattern. Equivalent to the Unix
command "rm <src>"
-f If the file does not exist, do not display a diagnostic message or
modify the exit status to reflect an error.
-[rR] Recursively deletes directories.
-skipTrash option bypasses trash, if enabled, and immediately deletes <src>.
-safely option requires safety confirmation, if enabled, requires
confirmation before deleting large directory with more than
<hadoop.shell.delete.limit.num.files> files. Delay is expected when
walking over large directory recursively to count the number of
files to be deleted before the confirmation.
HDFS与本地进行文件交互
# 剪切本地文件到HDFS
[atguigu@hadoop102 mydata]$ hadoop fs -moveFromLocal ./kongming.txt /sanguo/shuguo
# 复制本地文件到HDFS:copyFromLocal = put
[atguigu@hadoop102 mydata]$ hadoop fs -copyFromLocal liubei.txt /sanguo/shuguo
[atguigu@hadoop102 mydata]$ hadoop fs -put liubei.txt /sanguo/shuguo
# 追加本地文件内容到HDFS文件,HDFS文件不存在则创建
[atguigu@hadoop102 mydata]$ hadoop fs -appendToFile liubei.txt /sanguo/shuguo/kongming.txt
# 从HDFS拷贝文件到本地:copyToLocal = get
[atguigu@hadoop102 mydata]$ hadoop fs -copyToLocal /sanguo/shuguo/kongming.txt /opt/module/hadoop-3.1.3/mydata/
[atguigu@hadoop102 mydata]$ hadoop fs -get /sanguo/shuguo/kongming.txt /opt/module/hadoop-3.1.3/
# 合并下载HDFS文件到本地一个文件中
[atguigu@hadoop102 mydata]$ hadoop fs -getmerge /sanguo/shuguo/* /opt/module/hadoop-3.1.3/mydata/text.txt
直接操作HDFS
# ls、mkdir、cat、chgrp、chmod、chown、cp、mv、rm、rmdir同Linux文件系统一样
[atguigu@hadoop102 mydata]$ hadoop fs -ls /
Found 4 items
drwxr-xr-x - atguigu supergroup 0 2021-12-12 22:09 /input
drwxr-xr-x - atguigu supergroup 0 2021-12-12 22:10 /output
drwxr-xr-x - atguigu supergroup 0 2021-12-14 21:21 /sanguo
drwxrwx--- - atguigu supergroup 0 2021-12-12 22:09 /tmp
# tail显示文件末尾的1kb数据
[atguigu@hadoop102 mydata]$ hadoop fs -tail /sanguo/shuguo/kongming.txt
# 统计文件夹大小信息
[atguigu@hadoop102 mydata]$ hadoop fs -du /sanguo/shuguo
35 105 /sanguo/shuguo/kongming.txt
13 39 /sanguo/shuguo/liubei.txt
[atguigu@hadoop102 mydata]$ hadoop fs -du -s /sanguo/shuguo
48 144 /sanguo/shuguo
# 设置文件的副本数量:真正的副本数不会多于节点数
[atguigu@hadoop102 mydata]$ hadoop fs -setrep 10 /sanguo/shuguo/kongming.txt
2、WIN10安装配置Hadoop
下载winutils 3.1.0,Windows安装hadoop需要这部分文件,配置好环境变量:
2.1、创建Maven工程
pom.xml导入相关的依赖坐标
<dependencies>
<dependency>
<!-- 测试 -->
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<!-- 日志 -->
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.12.0</version>
</dependency>
<dependency>
<!-- hadoop客户端 -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
第一个Hadoop程序
package com.atguigu.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Test;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
public class HdfsClient {
@Test
public void testMkdirs() throws IOException, InterruptedException, URISyntaxException {
// 1 获取文件系统
Configuration configuration = new Configuration();
// 配置在集群上运行
// configuration.set("fs.defaultFS", "hdfs://hadoop102:9820");
// FileSystem fs = FileSystem.get(configuration);
// HDFS的访问路径 配置对象 操作用户
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9820"), configuration, "atguigu");
// 2 创建目录
fs.mkdirs(new Path("/1108/daxian/banzhang"));
// 3 关闭资源
fs.close();
}
}