hdfs的API操作_apifox 请求hdfs-CSDN博客

本文链接：https://blog.csdn.net/weixin_46919419/article/details/112465891

下一篇：hdfs高可用和联邦机制

hdfsAPI操作环境配置

一、准备工作

1、配置Windows的hadoop运行环境，否则运行代码会出现

缺少winutils.exe、hadoop.dll

这两个文件可以在github上找

貌似hadoop2.10.1不用配置环境也可以使用api

第一步：将这两个文件放到一个全英文没空格的文件夹下

例如：D:\English_path\hadoop
第二步：配置环境变量：HADOOP_HOME并将%HADOOP_HOME%\bin添加到path中

这两个文件是放在bin下的所以要\bin

第三步：把hadoop.dll文件拷贝放入C:\Windows\System32下

2、导入Maven依赖

找jar包可以在百度中搜mvn进入官方依赖库

在这里搜

导入这些依赖

 <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
            <scope>test</scope>
        </dependency>
	    <!--打包插件-->
        <dependency>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.4.3</version>
        </dependency>
    </dependencies>

一些插件
<build>
        <!--编译插件-->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!--打包插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

api操作获取FileSystem

一、使用url方式访问数据（了解，用的不多）

上传一个a.txt做测试

//通过URL来访问数据，实现一个拷贝功能

@Test

public void test1() throws Exception {

//1、注册hdfs的url

URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

//2、获取文件输入流

InputStream inputStream = new URL("hdfs://node01:8020/a.txt").openStream();

//3、获取文件输出流

FileOutputStream outputStream = new FileOutputStream(new File("D:\\code\\java_code\\HadoopTest\\src\\hello.txt"));

//4、实现文件的拷贝,IOUtils.copy实现文件拷贝，这个工具类一定要导这个包org.apache.commons.io.IOUtils

IOUtils.copy(inputStream,outputStream);

//5、关闭流

IOUtils.closeQuietly(inputStream);

IOUtils.closeQuietly(outputStream);

}

执行后，发现已经拷贝过来了

补充：清除烦人的log4j警告

在resources下创建一个log4j.properties

并写入，现在不需要知道怎么写

log4j.rootLogger=info,appender

log4j.appender.appender=org.apache.log4j.ConsoleAppender

log4j.appender.appender.layout=org.apache.log4j.TTCCLayout

log4j的警告就没了

注：这个警告是jdk版本太高，1.8就没这个警告了

二、使用文件系统方式访问（重点）

主要涉及一下class

1、Configuration，该类的对象封装了客户端或者服务器的配置

2、FileSystem，该类的对象是一个文件系统对象，可以用该对象的一些方法来对文件进行操作，通过FileSystem的静态方法get获得该对象

get方法从conf中的第一个参数fs.defaultFS的配置值判断具体是什么类

如果代码中没有指定fs.defaultFS,并且工程ClassPath下也没有给定相应的配置，conf中的默认值就来自于Hadoop的Jar包中的core-defalut.xml

默认值为file:///则获取的不是第一个DistributedFileSystem的实例，而是一个本地文件系统的客户端对象

1、获取FileSystem的几种方式

//第一种获得FileSystem的方法

@Test

public void getFileSystem1() throws Exception {

//1、创建Configuration对象

Configuration configuration = new Configuration();

//2、指定我们使用的文件系统类型

configuration.set("fs.defaultFS","hdfs://node01:8020/");

//3、获取指定的文件系统

FileSystem fileSystem = FileSystem.get(configuration);

//4、输出

System.out.println(fileSystem);

}

//第二种获得FileSystem的方法

@Test

public void getFileSystem2() throws Exception{

//URI是统一路径标识符

//这种方式更简单用的是get的重载方法，要new两个对象：URI、Configuration

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

System.out.println("fileSystem:"+fileSystem);

}

//第三种获得FileSystem的方法

@Test

public void getFileSystem3() throws Exception{

Configuration configuration = new Configuration();

configuration.set("fs.defaultFS","hdfs://node01:8020/");

//就是将第一种方式的get改成了newInstance

FileSystem fileSystem = FileSystem.newInstance(configuration);

System.out.println(fileSystem);

}

//第四种获得FileSystem的方法

@Test

public void getFileSystem4() throws Exception{

//还是newInstance的重载方法

FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://node01:8020"),new Configuration());

System.out.println(fileSystem);

}

api操作，功能型

一、遍历HDFS中所有文件

@Test

public void listMyFiles()throws Exception{

//1、获取FileSystem类

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

//2、获取RemoteIterator得到所有的文件或文件夹，第一个参数指定遍历的路径，第二个参数表示是否要递归遍历

RemoteIterator<LocatedFileStatus> Iterator = fileSystem.listFiles(new Path("/"), true);

//3、循环遍历

while (Iterator.hasNext()){

LocatedFileStatus fileStatus = Iterator.next();

//获取路径并打印

//这里可以用fileStatus获取很多东西路径、文件名、分块等 System.out.println(fileStatus.getPath()+"---"+fileStatus.getPath().getName());

BlockLocation[] blockLocations = fileStatus.getBlockLocations();

System.out.println("block数"+blockLocations.length);

}

//4、关闭文件系统

fileSystem.close();

}

结果：

二、在HDFS上创建文件夹

@Test

public void mkdirs()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

//用FS调用mkdirs就可以了，这个方法可以递归创建

boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));

if (mkdirs) System.out.println("创建成功");

fileSystem.close();

}

执行后：

hdfs中就有了这个文件夹

三、创建文件

//这个方法可以创建文件，且是递归创建

fileSystem.create(new Path("/hello/mydir/test/a.txt"));

创建成功

api上传和下载

一、下载文件

@Test

public void downloadFile()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

//1、获取文件输入流

FSDataInputStream inputStream = fileSystem.open(new Path("/a.txt"));

//2、获取文件输出流

FileOutputStream outputStream = new FileOutputStream(new File("src/hello.txt"));

//3、文件的拷贝

IOUtils.copy(inputStream,outputStream);

//4、关闭流

IOUtils.closeQuietly(inputStream);

IOUtils.closeQuietly(outputStream);

fileSystem.close();

}

//更简便的方法，文件下载方式2

@Test

public void downloadFile2()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

//调用copyToLocalFile，里面还有很多方法例如上传

fileSystem.copyToLocalFile(new Path("/a.txt"),new Path("src/hello.txt"));

fileSystem.close();

}

二、上传文件

@Test

public void putData()throws Exception{

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());

//调用copyFromLocalFile方法上传

//注意这里在这个文件改动后上传hdfs的crc校验会导致报错，改个名或者去删crc文件

fileSystem.copyFromLocalFile(new Path("src/a.txt"),new Path("/"));

fileSystem.close();

}

api访问权限控制

一、测试

因为在hdfs-site.xml中关闭了访问权限

所以rw没有任何意义

现在我们测试一下

修改a.txt的权限全部关闭

hdfs dfs -chmod 000 /a.txt

我们执行之前写好的文件下载方法

发现下载成功了，说明权限毫无作用

我们现在开启权限

先停止hdfs

stop-dfs.sh

在修改配置文件为true

在分发给其他机器

scp hdfs-site.xml node02:$PWD

再启动hdfs

网页上下载a.txt就不允许了

api写的也报错了

现在加一个当前用户可读可写的权限

hdfs dfs -chmod 600 /a.txt

发现还是读不了，因为a.txt是所属root，windows的用户属于其他用户

权限改为666就可以读了

二、不是所属用户，想访问这个文件，要用到一个技术：伪装用户

使用get的重载方法

get(final URI uri, final Configuration conf, String user)

这里这个user，就是设置以什么用户去访问

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration(),"root");

发现伪造用户后就可以以root身份访问了

api小文件合并

命令行下的合并

第一种：

该命令可以将很多的hdfs文件合并为一个大的文件下载到本地

hdfs dfs -getmerge /config/*.xml ./hello.xml

*指的是这里所有的xml文件

我们用txt文件演示

hdfs dfs -getmerge /a.txt /hello.txt ./big.txt

合并成功

第二种（用的较多）：

在上传的时候将小文件合并成一个大文件

要用到java程序

代码：

@Test

public void mergeFile()throws Exception{

//这里由于之前设置了权限，使用伪装root，初学阶段还是改成无权限的好

FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration(),"root");

//1、得到一个大文件的输出流

FSDataOutputStream outputStream = fileSystem.create(new Path("/bigfile.txt"));

//2、获取一个本地的文件系统（可以访问本地磁盘）

LocalFileSystem local = FileSystem.getLocal(new Configuration());

//3、通过本地文件系统获取文件列表，为一个集合

FileStatus[] fileStatuses = local.listStatus(new Path("src/input"));

//循环输入

for (FileStatus fileStatus : fileStatuses) {

//4、获取输入流

FSDataInputStream inputStream = local.open(fileStatus.getPath());

//5、将小文件的数据复制到大文件

IOUtils.copy(inputStream,outputStream);

}

//6、释放资源

IOUtils.closeQuietly(outputStream);

local.close();

fileSystem.close();

}

从网页上下载下来查看

发现合并成功

合并后只占一个元数据空间，减轻了NameNode的压力