接前面的文章 《Java API 读取Hive Orc文件》,本文中介绍使用Java API写Orc格式的文件。
下面的代码将三行数据:
张三,20
李四,22
王五,30
写入HDFS上的/tmp/lxw1234/orcoutput/lxw1234.com.orc文件中。
- package com.lxw1234.test;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat;
- import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
- import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
- import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Writable;
- import org.apache.hadoop.mapred.JobConf;
- import org.apache.hadoop.mapred.OutputFormat;
- import org.apache.hadoop.mapred.RecordWriter;
- import org.apache.hadoop.mapred.Reporter;
- /**
- * lxw的大数据田地 -- http://lxw1234.com
- * @author lxw.com
- *
- */
- public class TestOrcWriter {
- public static void main(String[] args) throws Exception {
- JobConf conf = new JobConf();
- FileSystem fs = FileSystem.get(conf);
- Path outputPath = new Path("/tmp/lxw1234/orcoutput/lxw1234.com.orc");
- StructObjectInspector inspector =
- (StructObjectInspector) ObjectInspectorFactory
- .getReflectionObjectInspector(MyRow.class,
- ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
- OrcSerde serde = new OrcSerde();
- OutputFormat outFormat = new OrcOutputFormat();
- RecordWriter writer = outFormat.getRecordWriter(fs, conf,
- outputPath.toString(), Reporter.NULL);
- writer.write(NullWritable.get(),
- serde.serialize(new MyRow("张三",20), inspector));
- writer.write(NullWritable.get(),
- serde.serialize(new MyRow("李四",22), inspector));
- writer.write(NullWritable.get(),
- serde.serialize(new MyRow("王五",30), inspector));
- writer.close(Reporter.NULL);
- fs.close();
- System.out.println("write success .");
- }
- static class MyRow implements Writable {
- String name;
- int age;
- MyRow(String name,int age){
- this.name = name;
- this.age = age;
- }
- @Override
- public void readFields(DataInput arg0) throws IOException {
- throw new UnsupportedOperationException("no write");
- }
- @Override
- public void write(DataOutput arg0) throws IOException {
- throw new UnsupportedOperationException("no read");
- }
- }
- }
将上面的程序打包成orc.jar,上传至Hadoop客户端机器,
执行命令:
export HADOOP_CLASSPATH=/usr/local/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar:$HADOOP_CLASSPATH
hadoop jar orc.jar com.lxw1234.test.TestOrcWriter
执行成功后,在HDFS上查看该文件:
- [liuxiaowen@dev tmp]$ hadoop fs -ls /tmp/lxw1234/orcoutput/
- Found 1 items
- -rw-r--r-- 2 liuxiaowen supergroup 312 2015-08-18 18:09 /tmp/lxw1234/orcoutput/lxw1234.com.orc
接下来在Hive中建立外部表,路径指向该目录,并设置文件格式为ORC:
- CREATE EXTERNAL TABLE lxw1234(
- name STRING,
- age INT
- ) stored AS ORC
- location '/tmp/lxw1234/orcoutput/';
在Hive中查询该表数据:
- hive> desc lxw1234;
- OK
- name string
- age int
- Time taken: 0.148 seconds, Fetched: 2 row(s)
- hive> select * from lxw1234;
- OK
- 张三 20
- 李四 22
- 王五 30
- Time taken: 0.1 seconds, Fetched: 3 row(s)
- hive>
OK,数据正常显示。
注意:该程序只做可行性测试,如果Orc数据量太大,则需要改进,或者使用MapReduce;
后续将介绍使用MapReduce读写Hive Orc文件。