hdfs小文件的解决方案

最新推荐文章于 2024-07-25 20:59:28 发布

ukakasu

最新推荐文章于 2024-07-25 20:59:28 发布

阅读量572

点赞数

分类专栏： HDFS

本文链接：https://blog.csdn.net/ukakasu/article/details/47205489

版权

HDFS 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

小文件的解决方案——应用程序自己控制

package small;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.commons.io.IOUtils;

public class Mes {

	public static void main(String[] args) throws Exception {
		FileSystem fs = FileSystem.newInstance(new URI(
				"hdfs://192.168.1.182:9000"), new Configuration());
		Path path = new Path("/me4");
		FSDataOutputStream out = fs.create(path);
		File dir = new File("E:\\test");
		for (File file : dir.listFiles()) {
			System.out.println(file);
			BufferedInputStream in = new BufferedInputStream(new FileInputStream(file));
			long len = file.length();
			byte[] buffer=new byte[(int)len];
			while ((len = in.read(buffer))!=-1){
				out.write(buffer);
			}
			in.close();
		}
		out.close();

	}

}

数据读写部分可改为

          for(File fileName : dir.listFiles()) {
               System.out.println(fileName.getAbsolutePath());
               final FileInputStream in= new FileInputStream(fileName.getAbsolutePath());
               final List<String> readLines = IOUtils.readLines(fileInputStream);
               //注意此IOUtils为org.apache.commons.io.IOUtils
               //org.apache.hadoop.io.IOUtils下的为readFully
               for (String line : readLines) {
                    out.write(line.getBytes());     
               }
               in.close();
          }
          out.close();

小文件的解决方案——SequenceFile

package small;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class Seq {

	public static void main(String[] args) throws Exception{
		FileSystem fs = FileSystem.newInstance(new URI(
				"hdfs://192.168.1.182:9000"), new Configuration());
		Path seqPath = new Path("/seq3.seq");
		
		SequenceFile.Writer writer = SequenceFile.createWriter(fs, new Configuration(), seqPath, Text.class, Text.class);
		File dir = new File("E:\\test");
		
		Text key = new Text();
		Text value =new Text();
		InputStream in = null;  
	    byte[] buffer = null;
		for (File file : dir.listFiles()) {
			System.out.println(file);
			key.set(file.getName());
			in = new BufferedInputStream(new FileInputStream(file)); 
			long len = file.length();
			buffer=new byte[(int)len];
			IOUtils.readFully(in, buffer, 0, buffer.length); 
			value.set(buffer);
			writer.append(key, value);
			
		}
		in.close();
		IOUtils.closeStream(writer);
		
		SequenceFile.Reader reader=new SequenceFile.Reader(fs,seqPath,new Configuration());
		Text key2=new Text();
		Text value2=new Text();
		while(reader.next(key2,value2)){
		     System.out.println(key2);
		     System.out.println(value2);
		}
		IOUtils.closeStream(reader);
	}
}

MapFile

Configuration conf=new Configuration();
FileSystem fs=FileSystem.get(conf);
Path mapFile=new Path("mapFile.map");

//Writer内部类用于文件的写操作,假设Key和Value都为Text类型
MapFile.Writer writer=new MapFile.Writer(conf,fs,mapFile.toString(),Text.class,Text.class);

//通过writer向文档中写入记录
writer.append(new Text("key"),new Text("value"));
IOUtils.closeStream(writer);//关闭write流

//Reader内部类用于文件的读取操作
MapFile.Reader reader=new MapFile.Reader(fs,mapFile.toString(),conf);

//通过reader从文档中读取记录
Text key=new Text();
Text value=new Text();
while(reader.next(key,value)){
     System.out.println(key);
     System.out.println(key);
}
IOUtils.closeStream(reader);//关闭read流

Hadoop Archives

Hadoop Archives (HAR files)是在0.18.0版本中引入的，它的出现就是为了缓解大量小文件消耗namenode内存的问题。HAR文件是通过在HDFS上构建一个层次化的文件系统来工作。一个HAR文件是通过hadoop的archive命令来创建，而这个命令实际上也是运行了一个MapReduce任务来将小文件打包成HAR。对于client端来说，使用HAR文件没有任何影响。所有的原始文件都（using har://URL）。但在HDFS端它内部的文件数减少了。

通过HAR来读取一个文件并不会比直接从HDFS中读取文件高效，而且实际上可能还会稍微低效一点，因为对每一个HAR文件的访问都需要完成两层 index文件的读取和文件本身数据的读取。并且尽管HAR文件可以被用来作为MapReduce job的input，但是并没有特殊的方法来使maps将HAR文件中打包的文件当作一个HDFS文件处理。

创建文件 hadoop archive -archiveName xxx.har -p /src /dest
查看内容 hadoop fs -lsr har:///dest/xxx.har