hive 中间结果和结果数据压缩

23 篇文章 0 订阅
Hadoop.The.Definitive.Guide.2nd.Edition 79页
hadoop默认的压缩算法。
DEFLATE org.apache.hadoop.io.compress.DefaultCodec

结果数据压缩是否开启,下面的配置为true,所以开启。
这个是最终的结果数据:
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
<description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
</property>

mapred.output.compression.codec 这个选项确定压缩算法

这个是中间的结果数据是否压缩,也就是一个sql,生成多道MR,除了最后一道MR的结果数据外,前面的MR的结果数据可以压缩。
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
</property>
中间结果数据压缩使用的算法
<property>
<name>hive.intermediate.compression.codec</name>
<value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>
默认的文件格式是SequenceFile
<property>
<name>hive.default.fileformat</name>
<value>SequenceFile</value>
<description>Default file format for CREATE TABLE statement. Options are TextFile and SequenceFile. Users can explicitly say CREATE TABLE ... STORED AS <TEXTFILE|SEQUENCEFILE> to override</description>
</property>

HiveConf里面:
COMPRESSRESULT("hive.exec.compress.output", false),
COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
COMPRESSINTERMEDIATECODEC("hive.intermediate.compression.codec", ""),
COMPRESSINTERMEDIATETYPE("hive.intermediate.compression.type", ""),


hive.exec.compress.output
SemanticAnalyzer:
private Operator genFileSinkPlan(String dest, QB qb, Operator input)
throws SemanticException {

Operator output = putOpInsertMap(
OperatorFactory.getAndMakeChild(
new FileSinkDesc(
queryTmpdir,
table_desc,
conf.getBoolVar(HiveConf.ConfVars.COMPRESSRESULT), //结果数据是否压缩
currentTableId,
rsCtx.isMultiFileSpray(),
rsCtx.getNumFiles(),
rsCtx.getTotalFiles(),
rsCtx.getPartnCols(),
dpCtx),
fsRS, input), inputRR);

}

FileSinkOperator:
private void createEmptyBuckets(Configuration hconf, ArrayList<String> paths)
throws HiveException, IOException {

for (String p: paths) {
Path path = new Path(p);
RecordWriter writer = HiveFileFormatUtils.getRecordWriter(
jc, hiveOutputFormat, outputClass, isCompressed, tableInfo.getProperties(), path);//创建RecordWriter
writer.close(false);
LOG.info("created empty bucket for enforcing bucketing at " + path);
}

}

HiveFileFormatUtils:
public static RecordWriter getRecordWriter(JobConf jc,
HiveOutputFormat<?, ?> hiveOutputFormat,
final Class<? extends Writable> valueClass, boolean isCompressed,
Properties tableProp, Path outPath) throws IOException, HiveException {
if (hiveOutputFormat != null) {
return hiveOutputFormat.getHiveRecordWriter(jc, outPath, valueClass,
isCompressed, tableProp, null);
}
return null;
}

HiveSequenceFileOutputFormat:
public class HiveSequenceFileOutputFormat extends SequenceFileOutputFormat
implements HiveOutputFormat<WritableComparable, Writable> {

public RecordWriter getHiveRecordWriter(JobConf jc, Path finalOutPath,
Class<? extends Writable> valueClass, boolean isCompressed,
Properties tableProperties, Progressable progress) throws IOException {

FileSystem fs = finalOutPath.getFileSystem(jc);
final SequenceFile.Writer outStream = Utilities.createSequenceWriter(jc,
fs, finalOutPath, BytesWritable.class, valueClass, isCompressed);

}

}

Utilities:
public static SequenceFile.Writer createSequenceWriter(JobConf jc, FileSystem fs, Path file,
Class<?> keyClass, Class<?> valClass, boolean isCompressed) throws IOException {
CompressionCodec codec = null;
CompressionType compressionType = CompressionType.NONE;
Class codecClass = null;
if (isCompressed) {
compressionType = SequenceFileOutputFormat.getOutputCompressionType(jc);
codecClass = FileOutputFormat.getOutputCompressorClass(jc, DefaultCodec.class); //默认的压缩算法是DefaultCodec org.apache.hadoop.io.compress.DefaultCodec
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, jc);
}
return (SequenceFile.createWriter(fs, jc, file, keyClass, valClass, compressionType, codec));

}

FileOutputFormat:
public static Class<? extends CompressionCodec>
getOutputCompressorClass(JobConf conf,
Class<? extends CompressionCodec> defaultValue) {
Class<? extends CompressionCodec> codecClass = defaultValue;

String name = conf.get("mapred.output.compression.codec"); //可以经过这个选项进行配置
if (name != null) {
try {
codecClass =
conf.getClassByName(name).asSubclass(CompressionCodec.class);
} catch (ClassNotFoundException e) {
throw new IllegalArgumentException("Compression codec " + name +
" was not found.", e);
}
}
return codecClass;
}

中间结果数据压缩:
GenMapRedUtils.splitTasks:
public static void splitTasks(Operator<? extends Serializable> op,
Task<? extends Serializable> parentTask,
Task<? extends Serializable> childTask, GenMRProcContext opProcCtx,
boolean setReducer, boolean local, int posn) throws SemanticException {

// Create a file sink operator for this file name
boolean compressIntermediate = parseCtx.getConf().getBoolVar(
HiveConf.ConfVars.COMPRESSINTERMEDIATE);
FileSinkDesc desc = new FileSinkDesc(taskTmpDir, tt_desc,
compressIntermediate);
if (compressIntermediate) {
desc.setCompressCodec(parseCtx.getConf().getVar(
HiveConf.ConfVars.COMPRESSINTERMEDIATECODEC));
desc.setCompressType(parseCtx.getConf().getVar(
HiveConf.ConfVars.COMPRESSINTERMEDIATETYPE));
}
Operator<? extends Serializable> fs_op = putOpInsertMap(OperatorFactory
.get(desc, parent.getSchema()), null, parseCtx);

}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值