1、数据完整性
检测数据是否损坏:在数据第一次引入时计算校验和并在数据通过一个通道进行传输时再次计算校验和(CRC-32)
HDFS中,datanode负责在收到数据后存储该数据及其校验和之前对数据验证,客户端从datanode读取数据也会验证校验和,每个datanode运行一个DataBlockScanner后台进程定期验证数据块。
2、压缩
图中所有压缩工具提供9个不同选项控制压缩时要考虑的权衡:-1优化速度,-9优化空间
如gzip -9 file
压缩能力:bzip2>gzip>LZO,LZ4,Snappy
压缩速度:LZO,LZ4,Snappy>>gzip>bzip2
解压速度:Snappy,LZ4 >> LZO>gzip>bzip2
3、Hadoop中的Codec(压缩-解压缩算法的一种实现)
(1)通过CompressionCodec压缩解压缩
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
/**
* 该程序压缩标准输入的数据,然后写到标准输出
*/
public class example1 {
public static void main(String[] args) throws ClassNotFoundException, IOException {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
//对输入流进行压缩,用createOutputStream(),反之用createInputStream()
CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
}
}
//用例
% echo “hello” | Hadoop example1 org.apache.hadoop.io.compress.GzipCodec | gunzip
% hello
(2)通过CompressionCodecFectory推断CompressionCodec
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
/**
* 根据文件拓展名选取codec解压缩文件
*/
public class example2 {
public static void main(String args[]) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.print("No codec found for" + uri);
System.exit(1);
}
String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try{
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in,out,conf);
}finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
例子:
% Hadoop example2 file.gz
(3)CodePool
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
/**
* 使用压缩池对读取自标准输入的数据进行压缩
*/
public class example3 {
public static void main(String[] args) throws ClassNotFoundException {
String codecClassName = args[0];
Class<?> codecClass = Class.forName(codecClassName);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass,conf);
Compressor compressor = null;
try{
compressor = CodecPool.getCompressor(codec);
CompressionOutputStream out = codec.createOutputStream(System.out, compressor);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
} catch (IOException e) {
e.printStackTrace();
}finally {
CodecPool.returnCompressor(compressor);
}
}
}
4、Hadoop中压缩格式的选择
效率由高到低
容器文件格式-支持切分压缩格式-在应用中将文件切分成块,并使用任意压缩格式简历压缩文件-未压缩
5、在Mapreduce中使用压缩
或者另一种方式:在FileOutputFormat中设置配置:
//对于普通压缩
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
//对于顺序文件输出
SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);
6.序列化
7、关于SequenceFile
HDFS和MapReduce是针对大文件优化的,所以通过SequenceFile类型吧小文件包装起来,可以获得更高效率的存储和处理。
(1)SequenceFile的写操作
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import java.io.IOException;
import java.net.URI;
/**
* 将键值对写入SequenceFile
*/
public class example4 {
private static final String[] DATA = {
"One ,two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;//写到顺序文件的实例
try {
//通过SequenceFile.createWriter获取实例
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
//writer.getLength()获取文件当前位置
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(),key,value);
writer.append(key, value);
}
}finally {
IOUtils.closeStream(writer);
}
}
}
结果如下
(2)SequenceFile的读操作
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
import java.io.IOException;
import java.net.URI;
public class example5 {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;//读Sequencefile的实例
try{
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();//获得当前位置
while(reader.next(key, value)){
String syncSeen = reader.syncSeen()?"*":"";//显示同步点位置
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen
, key, value);
position = reader.getPosition()
}
}finally {
IOUtils.closeStream(reader);
}
}
结果如下
(3)其他操作
可通过hadoop fs -text以文本形式显示顺序文件如