Hadoop多个输出案例

需求:将原始数据按近似比例采样,将数据分为训练集和测试集。训练集存放于指定输出目录的train目录下,测试集存放于指定输出目录的test目录下。

class SampleMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
    private double ratio;
    private Random random = new Random();
    MultipleOutputs<NullWritable, Text> multipleOutputs;

    protected void setup(Context context) throws IOException, InterruptedException {
        ratio = Double.parseDouble(context.getConfiguration().get("ratio"));
        multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
    }
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        if (random.nextDouble() <= ratio) {
            multipleOutputs.write(NullWritable.get(), value,"train/");
        } else {
            multipleOutputs.write(NullWritable.get(), value,"test/");
        }
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        multipleOutputs.close();
    }
}
public static void job(Configuration config, Path inputPath, Path outputPath, String ratio) throws IOException {
        config.set("ratio", ratio);
        Job job = Job.getInstance(config);
        job.setJobName("Random Sample");
        job.setJarByClass(Sampler.class);
        job.setMapperClass(SampleMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setNumReduceTasks(0);
        FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);
        MultipleOutputs.addNamedOutput(job, "train", TextOutputFormat.class, NullWritable.class, Text.class);
        MultipleOutputs.addNamedOutput(job, "test", TextOutputFormat.class, NullWritable.class, Text.class);
        try {
            job.waitForCompletion(true);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

关键代码:

multipleOutputs.write(NullWritable.get(), value,"train/");
multipleOutputs.write(NullWritable.get(), value,"test/");

FileOutputFormat.setOutputPath(job, outputPath);
        MultipleOutputs.addNamedOutput(job, "train", TextOutputFormat.class, NullWritable.class, Text.class);
        MultipleOutputs.addNamedOutput(job, "test", TextOutputFormat.class, NullWritable.class, Text.class);

指定采样比例、输入路径和输出路径为:
hadoop.sampler.ratio = 0.2
hadoop.sampler.datainputpath = /lgh/data/input
hadoop.sampler.dataoutputpath = /lgh/sampleoutput
输出目录:
/lgh/sampleoutput/train
/lgh/sampleoutput/test

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值