Hive ORCFile 分区写入报错
报错内容
2022-03-01 10:53:14,868 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
at org.apache.hadoop.hive.ql.io.orc.OutStream.write(OutStream.java:140)
at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
at com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:80)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.writeStripe(WriterImpl.java:724)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1609)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1991)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2283)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.abortWriters(FileSinkOperator.java:252)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1026)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:199)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2022-03-01 10:53:14,893 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.io.IOException: The client is stopped
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1534)
at org.apache.hadoop.ipc.Client.call(Client.java:1478)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
at com.sun.proxy.$Proxy9.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:790)
at java.lang.Thread.run(Thread.java:748)
报错原因
--查看详细表结构
desc extended `table_nane`;
.......
location:hdfs://solway-ha/user/hive/warehouse/xx.db/xxx_table_name, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:date, type:string, comment:null)], parameters:{orc.compress=ZLIB, transient_lastDdlTime=1645002562}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
ORCFile 写入分区时需要使用内存进行写入 , 默认大小为256Kb , 需要通过调节alter table table_name set tblproperties("orc.compress.size"="65536")
来降低内存使用,我这里降低为原来的四分之一,进行参数调试
方案
>alter table gdm.gdm_wt_minute set tblproperties("orc.compress.size"="65536")
-- 重跑SQL后无问题
>insert overwrite table xx.xxx_table_name partition(date) select * from xx.xxx_tmp_table;
>
......
MapReduce Jobs Launched:
Stage-Stage-1: Map: 19 Cumulative CPU: 1089.54 sec HDFS Read: 5407892294 HDFS Write: 312056735 SUCCESS
Stage-Stage-3: Map: 139 Cumulative CPU: 308.08 sec HDFS Read: 332061143 HDFS Write: 311133769 SUCCESS
Total MapReduce CPU Time Spent: 23 minutes 17 seconds 620 msec
OK