004 Hadoop之HDFS初识

1、初识HDFS

HDFS:Hadoop Distributed File System
场景:一次写入多次读出,其本身不支持修改,适合用来做数据分析,并不适合做网盘
优点:高容错行、适合大数据处理、构建在廉价机上
缺点:不适合低延时访问、小文件存储不高效、不支持并发写入和文件随机修改

1.1、HDFS组成架构

  • NameNode:管理数据块映射、配置副本策略、处理客户端读写请求;
  • DataNode:存储实际的数据块、执行数据块的读/写操作;
  • Client:对上传HDFS的文件切块、从NameNode获取文件位置信息、从DataNode读写数据、通过命令管理(NameNode格式化)和访问HDFS(增删改查);
  • Secondary NameNode:不是NameNode的热备,而是为其分担工作(定期合并Fsimage和Edits),可辅助恢复NameNode;

1.2、HDFS文件块大小

配置文件hdfs-default.xml中dfs.blocksize来配置,设置方案:BlockSize = Block寻址时间 * Block传输速率,这样寻址和传输可以无缝连接(PS:HDFS不允许多线程写) 。
思考:Block太大时磁盘传输时间大于寻址时间,Block太小时寻址时间大于磁盘传输时间,这两种情况都会导致HDFS程序的等待。

<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
  <description>
      The default block size for new files, in bytes.
      You can use the following suffix (case insensitive):
      k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
      Or provide complete size in bytes (such as 134217728 for 128 MB).
  </description>
</property>

1.3、HDFS的shell操作

分布式场景下效果相同的命令:hadoop fs 具体命令 OR hdfs dfs 具体命令
hadoop fs、hadoop dfs和hdfs dfs的区别

# 查看支持的命令
[atguigu@hadoop102 ~]$ hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] [-s <sleep interval>] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]
查看命令的帮助信息
[atguigu@hadoop102 ~]$ hadoop fs -help rm
-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories.                                   
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>.  
  -safely     option requires safety confirmation, if enabled, requires          
              confirmation before deleting large directory with more than        
              <hadoop.shell.delete.limit.num.files> files. Delay is expected when
              walking over large directory recursively to count the number of    
              files to be deleted before the confirmation. 
HDFS与本地进行文件交互
# 剪切本地文件到HDFS
[atguigu@hadoop102 mydata]$ hadoop fs -moveFromLocal ./kongming.txt /sanguo/shuguo
# 复制本地文件到HDFS:copyFromLocal = put 
[atguigu@hadoop102 mydata]$ hadoop fs -copyFromLocal liubei.txt /sanguo/shuguo
[atguigu@hadoop102 mydata]$ hadoop fs -put liubei.txt /sanguo/shuguo
# 追加本地文件内容到HDFS文件,HDFS文件不存在则创建
[atguigu@hadoop102 mydata]$ hadoop fs -appendToFile liubei.txt /sanguo/shuguo/kongming.txt
# 从HDFS拷贝文件到本地:copyToLocal = get
[atguigu@hadoop102 mydata]$ hadoop fs -copyToLocal /sanguo/shuguo/kongming.txt /opt/module/hadoop-3.1.3/mydata/
[atguigu@hadoop102 mydata]$ hadoop fs -get /sanguo/shuguo/kongming.txt /opt/module/hadoop-3.1.3/
# 合并下载HDFS文件到本地一个文件中
[atguigu@hadoop102 mydata]$ hadoop fs -getmerge /sanguo/shuguo/* /opt/module/hadoop-3.1.3/mydata/text.txt
直接操作HDFS
# ls、mkdir、cat、chgrp、chmod、chown、cp、mv、rm、rmdir同Linux文件系统一样
[atguigu@hadoop102 mydata]$ hadoop fs -ls /
Found 4 items
drwxr-xr-x   - atguigu supergroup          0 2021-12-12 22:09 /input
drwxr-xr-x   - atguigu supergroup          0 2021-12-12 22:10 /output
drwxr-xr-x   - atguigu supergroup          0 2021-12-14 21:21 /sanguo
drwxrwx---   - atguigu supergroup          0 2021-12-12 22:09 /tmp
# tail显示文件末尾的1kb数据
[atguigu@hadoop102 mydata]$ hadoop fs -tail /sanguo/shuguo/kongming.txt
# 统计文件夹大小信息
[atguigu@hadoop102 mydata]$ hadoop fs -du /sanguo/shuguo
35  105  /sanguo/shuguo/kongming.txt
13  39   /sanguo/shuguo/liubei.txt
[atguigu@hadoop102 mydata]$ hadoop fs -du -s /sanguo/shuguo
48  144  /sanguo/shuguo
# 设置文件的副本数量:真正的副本数不会多于节点数
[atguigu@hadoop102 mydata]$ hadoop fs -setrep 10 /sanguo/shuguo/kongming.txt

2、WIN10安装配置Hadoop

下载winutils 3.1.0,Windows安装hadoop需要这部分文件,配置好环境变量:
在这里插入图片描述
在这里插入图片描述

2.1、创建Maven工程

pom.xml导入相关的依赖坐标
<dependencies>
    <dependency>
    	<!-- 测试 -->
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
   		<!-- 日志 -->
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-slf4j-impl</artifactId>
        <version>2.12.0</version>
    </dependency>
    <dependency>
    	<!-- hadoop客户端 -->
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
</dependencies>
第一个Hadoop程序
package com.atguigu.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class HdfsClient {
    @Test
    public void testMkdirs() throws IOException, InterruptedException, URISyntaxException {

        // 1 获取文件系统
        Configuration configuration = new Configuration();
        // 配置在集群上运行
        // configuration.set("fs.defaultFS", "hdfs://hadoop102:9820");
        // FileSystem fs = FileSystem.get(configuration);
		// HDFS的访问路径 配置对象 操作用户
        FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9820"), configuration, "atguigu");

        // 2 创建目录
        fs.mkdirs(new Path("/1108/daxian/banzhang"));

        // 3 关闭资源
        fs.close();
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值