Hadoop_数据清洗示例

Hadoop_数据清洗

示例(去除空行、开头为空格的数据):

  1. 原始数据:D:data estdata.txt

    zhangsan 500 450 jan
    zhangsan 550 450 feb
    lisi 210 150 jan
    lisi 200 150 feb
    zhangsan 400 150 march

    zhangsan 600 500 april
    lisi 190 150 april
    800 100 jan
    BLU 2000 200 feb
    lisi 110 10 may

  2. DataCleanMapper

    package com.blu.dataclean;

    import java.io.IOException;

    import org.apache.commons.lang3.StringUtils;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;

    public class DataCleanMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
    throws IOException, InterruptedException {
    String val = value.toString();
    String[] vals = val.split(" ");
    if(StringUtils.isEmpty(vals[0])) {
    //如果当前行的第一个值是空的,说明不是我们要的数据,直接返回。
    return;
    }
    context.write(value, NullWritable.get());
    }
    }

  3. DataCleanJob

    package com.blu.dataclean;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class DataCleanJob {
    public static void main(String[] args) throws Exception {
    Job job = Job.getInstance();
    job.setJarByClass(DataCleanJob.class);
    job.setMapperClass(DataCleanMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(NullWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(NullWritable.class);
    //设置任务数为0
    job.setNumReduceTasks(0);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    boolean flag = job.waitForCompletion(true);
    System.exit(flag ?0 : 1);
    }
    }

  4. 运行参数:

    D:data estdata.txt D:dataoutput

  5. 运行结果:

    zhangsan 500 450 jan
    zhangsan 550 450 feb
    zhangsan 400 150 march
    zhangsan 600 500 april
    BLU 2000 200 feb
    lisi 110 10 may

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值