hadop-6 小文件治理

最新推荐文章于 2024-05-07 15:58:58 发布

爱吃甜食_

最新推荐文章于 2024-05-07 15:58:58 发布

阅读量186

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/a3125504x/article/details/106280062

版权

hadoop 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

小文件治理原因

小文件同样需要对应的元数据，过多的小文件元数据浪费内存空间
寻址大量小文件浪费时间

hadoop archive

hadoop archive实际上底层实现是运行了一个MR任务。

在这里插入图片描述
官方文档地址：https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html

治理命令

hadoop

name: 生成的压缩包文件名。文件名必须以.har结尾
parent path: 需要治理的小文件所在文件夹所在父目录
r : 官方文档中也没有提到此参数。
src :源文件目录
dest: 目标文件目录

 //dir1 和dir2是testfile目录下的两个子文件夹，可以指定大于等于1个子文件夹
 //也可以省略dir1 dir2，这样会直接治理testfile下的全部文件
 hadoop archive -archiveName testhar.har -p /testfile dir1 dir2 -r 2 /tmp

查看命令

hdfs dfs -lsr 《file path》

此命令查看治理后的文件目录，与web ui上显示一致

hdfs dfs -lsr har://《filepath》

此命令查看治理前的文件目录，与压缩前的目录显示一致

解压命令

hdfs dfs -cp

Sequence Files

SequenceFile文件，其中数据格式为二进制。
SequenceFile文件主要由一条条record记录组成；每个record是键值对形式的。
将SequenceFile文件作为小文件容器，将大量的小文件压缩成一个SequenceFile文件，小文件名作为record的key,小文件内容作为record的value

Sequence结构

一个4四节的header(文件版本号)
若干个record记录
若各个位置随机的同步点sync marker
- sync marker用于方便定位到记录边界。当seek寻找record错误时，直接从下一个sync marker查找

Sequence File 压缩方式

不压缩
以record为单位压缩
以Sequence Fille中的block为单位压缩。
- 多数情况以此压缩
- 因为一个block包含多条记录，利用record间的相似性进行压缩，压缩效率更高
- Sequence Fille中两个Sync marker之间相连的多个record为一个block

在这里插入图片描述

Sequence File写数据

把已有的数据转存为SequenceFile比较慢。
比起先写小文件，再将小文件写入SequenceFile，一个更好的选择是直接将数据写入一个SequenceFile文件，
省去小文件作为中间媒介


import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;


import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class HDFSOperate {
    //模拟数据源
    private static final String[] TESTDATA = {
            "The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.",
            "It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.",
            "Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer",
            "o delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.",
            "Hadoop Common: The common utilities that support the other Hadoop modules."
    };

    public static void main(String[] args) throws IOException, URISyntaxException {
        //输出路径：要生成的Sequence File文件名
        String uri = "hdfs://node01:8020/writeSeFile";
        Configuration conf = new Configuration();
        FileSystem fileSystem = FileSystem.get(URI.create(uri), conf);
        //创建HDFS上Sequence File的路径实例
        Path path = new Path(uri);
        //指定SF的record的key类型
        IntWritable key = new IntWritable();
        //指定SF的record的value类型
        Text value = new Text();
        //创建向SequenceFile文件写入数据时的一些选项:path keyOption valueOption compressionTpye
        //要写入的SequenceFile的路径
        SequenceFile.Writer.Option pathOption = SequenceFile.Writer.file(path);
        //record的key类型选项
        SequenceFile.Writer.Option keyOption = SequenceFile.Writer.keyClass(key.getClass());
        //record的value类型选项
        SequenceFile.Writer.Option valueOption = SequenceFile.Writer.valueClass(value.getClass());
        //SequenceFile压缩方式：NONE | RECORD | BLOCK三选一
        //方案一：RECORD、不指定压缩算法
        SequenceFile.Writer.Option compressionOption = SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD);
        //方案二：BLOCK、不指定压缩算法
        //SequenceFile.Writer.Option compressionOption = SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK);
        //方案三：使用BLOCK、压缩算法BZip2Codec；压缩耗时间
        /*
        //创建压缩算法
        BZip2Codec bZip2Codec = new BZip2Codec();
        bZip2Codec.setConf(conf);
        SequenceFile.Writer.Option compressAlgorithm = SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK,bZip2Codec);
        //创建写数据的writer的实例
        SequenceFile.Writer writer = SequenceFile.createWriter(conf, pathOption, keyOption, valueOption,compressAlgorithm);*/
        //创建写数据的writer的实例
        SequenceFile.Writer writer = SequenceFile.createWriter(conf, pathOption, keyOption, valueOption);

        for (int i = 0; i < 100000;i++){
            //分别设置key、value值
            key.set(100 - i);
            value.set(TESTDATA[i% TESTDATA.length]);
            System.out.printf("[%s]\t%s\t%s\n",writer.getLength(),key,value);
            //在SequenceFile末尾追加内容
            writer.append(key,value);
        }
        //关闭流
        IOUtils.closeStream(writer);
    }
}

在这里插入图片描述

查看Sequence File

// | head -100为可选参数
 hadoop fs -text hdfs://node01:8020/writeSeFile | head -100

Sequence File读数据

public static void main(String[] args) throws IOException {
        //要读的SequenceFile
        String uri = "hdfs://node01:8020/writeSeFile";
        Configuration conf = new Configuration();
        Path path =  new Path(uri);
        //Reader对象
        SequenceFile.Reader reader = null;

        try{
            //读取SequenceFile的Reader的路径选项
            SequenceFile.Reader.Option pathOption = SequenceFile.Reader.file(path);
            //实例化Reader对象
            reader = new SequenceFile.Reader(conf, pathOption);
            //根据反射，求出key类型
            IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
            //根据反射，求出value类型
            Text value = (Text) ReflectionUtils.newInstance(reader.getValueClass(), conf);
            //打印当前读取位置
            long position = reader.getPosition();

            while (reader.next(key,value)){
                System.out.println(position);
                //遇到sync打印*
                String syncSeen = reader.syncSeen() ? "*" : "";
                System.out.printf("[%s%s]\t%s\t%s\n",position,syncSeen,key,value);
                position =reader.getPosition(); //beginning of next record
            }
        } finally{
            IOUtils.closeStream(reader);
        }
    }

在这里插入图片描述

//源码

/** Read the next key/value pair in the file into <code>key</code> and
     * <code>val</code>.  Returns true if such a pair exists and false when at
     * end of file */
    public synchronized boolean next(Writable key, Writable val)
      throws IOException {
      if (val.getClass() != getValueClass())
        throw new IOException("wrong value class: "+val+" is not "+valClass);

      boolean more = next(key);
      
      if (more) {
        getCurrentValue(val);
      }

      return more;
    }

爱吃甜食_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadop-6 小文件治理

小文件治理原因小文件同样需要对应的元数据，过多的小文件元数据浪费内存空间寻址大量小文件浪费时间hadoop archivehadoop archive实际上底层实现是运行了一个MR任务。官方文档地址：https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html治理命令name: 生成的压缩包文件名。文件名必须以.har结尾parent path: 需要治理的小文件所在文件夹所在父目录r : 官方文档中也没有提到此参数。s
复制链接

扫一扫

专栏目录