Hive2.0以后,使用了新的API来读写ORC文件(https://orc.apache.org)。
本文中的代码,在本地使用Java程序生成ORC文件,然后加载到Hive表。
代码如下:
- package com.lxw1234.hive.orc;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
- import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
- import org.apache.orc.CompressionKind;
- import org.apache.orc.OrcFile;
- import org.apache.orc.TypeDescription;
- import org.apache.orc.Writer;
- public class TestORCWriter {
- public static void main(String[] args) throws Exception {
- //定义ORC数据结构,即表结构
- // CREATE TABLE lxw_orc1 (
- // field1 STRING,
- // field2 STRING,
- // field3 STRING
- // ) stored AS orc;
- TypeDescription schema = TypeDescription.createStruct()
- .addField("field1", TypeDescription.createString())
- .addField("field2", TypeDescription.createString())
- .addField("field3", TypeDescription.createString());
- //输出ORC文件本地绝对路径
- String lxw_orc1_file = "/tmp/lxw_orc1_file.orc";
- Configuration conf = new Configuration();
- FileSystem.getLocal(conf);
- Writer writer = OrcFile.createWriter(new Path(lxw_orc1_file),
- OrcFile.writerOptions(conf)
- .setSchema(schema)
- .stripeSize(67108864)
- .bufferSize(131072)
- .blockSize(134217728)
- .compress(CompressionKind.ZLIB)
- .version(OrcFile.Version.V_0_12));
- //要写入的内容
- String[] contents = new String[]{"1,a,aa","2,b,bb","3,c,cc","4,d,dd"};
- VectorizedRowBatch batch = schema.createRowBatch();
- for(String content : contents) {
- int rowCount = batch.size++;
- String[] logs = content.split(",", -1);
- for(int i=0; i<logs.length; i++) {
- ((BytesColumnVector) batch.cols[i]).setVal(rowCount, logs[i].getBytes());
- //batch full
- if (batch.size == batch.getMaxSize()) {
- writer.addRowBatch(batch);
- batch.reset();
- }
- }
- }
- writer.addRowBatch(batch);
- writer.close();
- }
- }
将该程序打成普通jar包:orcwriter.jar
另外,本地使用Java程序执行依赖的Jar包有:
- commons-collections-3.2.1.jar
- hadoop-auth-2.3.0-cdh5.0.0.jar
- hive-exec-2.1.0.jar
- slf4j-log4j12-1.7.7.jar
- commons-configuration-1.6.jar
- hadoop-common-2.3.0-cdh5.0.0.jar
- log4j-1.2.16.jar
- commons-logging-1.1.1.jar
- hadoop-hdfs-2.3.0-cdh5.0.0.jar
- slf4j-api-1.7.7.jar
run2.sh中封装了执行命令:
- #!/bin/bash
- PWD=$(dirname ${0})
- echo "PWD [${PWD}] .."
- JARS=`find -L "${PWD}" -name '*.jar' -printf '%p:'`
- echo "JARS [${JARS}] .."
- $JAVA_HOME/bin/java -cp ${JARS} com.lxw1234.hive.orc.TestORCWriter $*
执行./run2.sh
在Hive中建表并LOAD数据:
可以看到,生成的ORC文件可以正常使用。
大多情况下,还是建议在Hive中将文本文件转成ORC格式,这种用JAVA在本地生成ORC文件,属于特殊需求场景。