Hadoop_数据清洗示例

Hadoop_数据清洗

示例(去除空行、开头为空格的数据):

  1. 原始数据:D:\data\testdata.txt
zhangsan 500 450 jan
zhangsan 550 450 feb
 lisi 210 150 jan
 lisi 200 150 feb
zhangsan 400 150 march

zhangsan 600 500 april
 lisi 190 150 april
      800 100 jan
BLU 2000 200 feb
lisi 110 10 may
  1. DataCleanMapper
package com.blu.dataclean;

import java.io.IOException;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class DataCleanMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
			throws IOException, InterruptedException {
		String val = value.toString();
		String[] vals = val.split(" ");
		if(StringUtils.isEmpty(vals[0])) {
			//如果当前行的第一个值是空的,说明不是我们要的数据,直接返回。
			return;
		}
		context.write(value, NullWritable.get());
	}
}
  1. DataCleanJob
package com.blu.dataclean;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DataCleanJob {
	public static void main(String[] args) throws Exception {
		Job job = Job.getInstance();
		job.setJarByClass(DataCleanJob.class);
		job.setMapperClass(DataCleanMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(NullWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		//设置任务数为0
		job.setNumReduceTasks(0);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean flag = job.waitForCompletion(true);
		System.exit(flag ?0 : 1);
	}
}
  1. 运行参数:
D:\data\testdata.txt D:\data\output
  1. 运行结果:
zhangsan 500 450 jan
zhangsan 550 450 feb
zhangsan 400 150 march
zhangsan 600 500 april
BLU 2000 200 feb
lisi 110 10 may
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值