Parquet 之mapreduce

原创 2016年06月16日 16:10:09

在mapreduce中使用Parquet,根据不同的序列化方式,有不同的选择,下面以Avro为例:
使用 AvroParquetInputFormat 和 AvroParquetOutputFormat

    @Override
    public int run(String[] strings) throws Exception {


        Path inputPath = new Path(strings[0]);
        Path outputPath = new Path(strings[1]);

        Job job = Job.getInstance(getConf(),"AvroParquetMapReduce");
        job.setJarByClass(getClass());

        job.setInputFormatClass(AvroParquetInputFormat.class);
        AvroParquetInputFormat.setInputPaths(job,inputPath);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputFormatClass(AvroParquetOutputFormat.class);
        FileOutputFormat.setOutputPath(job,outputPath);
        AvroParquetOutputFormat.setSchema(job,StockAvg.SCHEMA$);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    static class Map extends Mapper<Void,Stock,Text,DoubleWritable>{

        @Override
        protected void map(Void key, Stock value, Context context) throws IOException, InterruptedException {
            context.write(new Text(value.getSymbol().toString()),new DoubleWritable(value.getOpen()));
        }
    }

    static class Reduce extends Reducer<Text, DoubleWritable, Void, StockAvg> {

        @Override
        protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
            Mean mean = new Mean();

            for (DoubleWritable val :values){
                mean.increment(val.get());
            }

            StockAvg avg = new StockAvg();
            avg.setSymbol(key.toString());
            avg.setAvg(mean.getResult());
            context.write(null,avg);
        }
    }

这里的输入输出都是 Parquet文件。如果向输入是Text文件,只要不设置InputFormatClass即可。

如果改变input schema文件,Avro不能加载具体的class,会强制使用GenericData代替。

public class AvroProjectionParquetMapReduce extends Configured implements Tool {

    public static void main(String[] args) throws Exception {
        args = new String[2];

        args[0] = "hdfs://hadoop:9000/user/madong/parquet-input";
        args[1] = "hdfs://hadoop:9000/user/madong/parquet-output";

        int code = ToolRunner.run(new AvroProjectionParquetMapReduce(),args);
        System.exit(code);
    }

    @Override
    public int run(String[] strings) throws Exception {
        Path inputPath = new Path(strings[0]);
        Path outputPath = new Path(strings[1]);


        Job job = Job.getInstance(getConf(),"AvroProjectionParquetMapReduce");
        job.setJarByClass(AvroProjectionParquetMapReduce.class);

        job.setInputFormatClass(AvroParquetInputFormat.class);
        AvroParquetInputFormat.setInputPaths(job, inputPath);

        // predicate pushdown
        AvroParquetInputFormat.setUnboundRecordFilter(job, GoogleStockFilter.class);

        // projection pushdown
        Schema projection = Schema.createRecord(Stock.SCHEMA$.getName(),
                Stock.SCHEMA$.getDoc(), Stock.SCHEMA$.getNamespace(), false);
        List<Schema.Field> fields = Lists.newArrayList();
        for (Schema.Field field : Stock.SCHEMA$.getFields()) {
            if ("symbol".equals(field.name()) || "open".equals(field.name())) {
                fields.add(new Schema.Field(field.name(), field.schema(), field.doc(),
                        field.defaultValue(), field.order()));
            }
        }
        projection.setFields(fields);
        AvroParquetInputFormat.setRequestedProjection(job, projection);


        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(DoubleWritable.class);

        job.setOutputFormatClass(AvroParquetOutputFormat.class);
        FileOutputFormat.setOutputPath(job, outputPath);
        AvroParquetOutputFormat.setSchema(job, StockAvg.SCHEMA$);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static class GoogleStockFilter implements UnboundRecordFilter {

        private final UnboundRecordFilter filter;

        public GoogleStockFilter() {
            filter = ColumnRecordFilter.column("symbol", ColumnPredicates.equalTo("GOOG"));
        }

        @Override
        public RecordFilter bind(Iterable<ColumnReader> readers) {
            return filter.bind(readers);
        }
    }

    static class Map extends Mapper<Void, Stock, Text, DoubleWritable> {

        @Override
        protected void map(Void key, Stock value, Context context) throws IOException, InterruptedException {
            if (value != null) {
                context.write(new Text(value.getSymbol().toString()),
                        new DoubleWritable(value.getOpen()));
            }
        }
    }

    static class Reduce extends Reducer<Text, DoubleWritable, Void, StockAvg> {

        @Override
        protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
            Mean mean = new Mean();

            for (DoubleWritable val :values){
                mean.increment(val.get());
            }

            StockAvg avg = new StockAvg();
            avg.setSymbol(key.toString());
            avg.setAvg(mean.getResult());
            context.write(null,avg);
        }
    }
}
版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

Parquet 读写

write and read MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" + ...

读写parquet格式文件的几种方式

摘要 文章地址:http://www.fanlegefan.com/index.php/2017/07/15/hive-parquet/ 博客地址:http://www.fanlegefan.com ...
  • woloqun
  • woloqun
  • 2017年07月25日 14:28
  • 2982

Parquet_8. MapReduce & Parquet -- 待完善

具体内容将会在后续进行完善,敬请期待
  • Mike_H
  • Mike_H
  • 2015年12月06日 23:57
  • 718

新一代列式存储格式Parquet

Apache Parquet是Hadoop生态圈中一种新型列式存储格式,它可以兼容Hadoop生态圈中大多数计算框架(Hadoop、Spark等),被多种查询引擎支持(Hive、Impala、Dril...

HDFS列式存储Parquet与行式存储(Avro)性能测试-Benchmark(hadoop, Spark, Scala)

HDFS列式存储Parquet与行式存储(Avro)Benchmark(hadoop, Spark)

Hadoop Parquet File 文件的读取

产生parquet数据这里通过Spark SQL来从CSV文件中读取数据,然后把这些数据存到parquet文件去。 SparkContext context = new SparkCon...

parquet.hadoop 狂打日志,不受控制

Confluent 将数据写入hadfs时,会调用parquet的接口。基本每写一次数据,就会打日志。有两点让人很崩溃: 1. 对快速的流式来说,日志一直打一直打; 2. 不受kafka conne...

Hive Parquet配置

parquet的配置主要包括: parquet.compression.codec parquet.block.size parquet.page.size 等,详见: https://github....
  • bhq2010
  • bhq2010
  • 2015年01月30日 22:43
  • 9661

parquet原理分析

文章转载自这里    http://lastorder.me/tag/parquet.html [翻译] Dremel made simple with Parquet Category...

parquet-format-2.1.0-cdh5.5.0.tar.gz

  • 2016年03月07日 14:45
  • 1020KB
  • 下载
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Parquet 之mapreduce
举报原因:
原因补充:

(最多只允许输入30个字)