呼叫结果(call_result)与销售历史(sale_history)的join优化:
CALL_RESULT: 32亿条/444G SALE_HISTORY:17亿条/439G
原逻辑
Map: 3255 Reduce: 950 Cumulative CPU: 238867.84 sec HDFS Read: 587550313339 HDFS Write: 725372805057 SUCCESS 28.1MIN开启中间结果压缩
set hive.exec.compress.intermediate=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodecMap: 3255 Reduce: 950 Cumulative CPU: 268479.06 sec HDFS Read: 587548211067 HDFS Write: 725372805057 SUCCESS 31.6MIN
从结果看cpu的耗时增加,这是压缩解压缩过程的消耗;HDFS读取量略有减少,可能是因为源表是RCFile存储,本身已经压缩导致,因此整体时间上没有明显减少。开启中间和最终压缩
set hive.exec.compress.intermediate=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.a