## 使用MapReduce程序完成相关数据预处理

最新推荐文章于 2024-07-10 23:15:15 发布

也想洒脱

最新推荐文章于 2024-07-10 23:15:15 发布

阅读量821

点赞数

文章标签： mapreduce

本文链接：https://blog.csdn.net/weixin_44629054/article/details/113869114

版权

使用MapReduce程序完成相关数据预处理

数据大概有2万条左右所以部分截取 (格式为csv)

1月20日,北京,大兴区,2,0,0,北京市大兴区卫健委,https://m.weibo.cn/2703012010/4462638756717942,
1月20日,北京,昌平区,2,0,0,北京市卫健委,http://wjw.beijing.gov.cn/xwzx_20031/wnxw/202001/t20200121_1620353.html,
1月20日,北京,外地来京,1,0,0,北京市卫健委,http://wjw.beijing.gov.cn/xwzx_20031/wnxw/202001/t20200121_1620353.html,
1月20日,广东,深圳市,1,0,0,深圳市卫健委,http://wjw.sz.gov.cn/gsgg/202001/t20200120_18987619.htm,
1月20日,广东,深圳市,8,0,0,广东卫健委,http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2876926.html,
1月20日,广东,珠海市,3,0,0,珠海市卫健委,http://wsjkj.zhuhai.gov.cn/zwgk/tzgg/content/post_2461447.html,
1月20日,广东,湛江市,1,0,0,广东卫健委,http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2876926.html,https://www.zhanjiang.gov.cn/zjwjj/sy/gzdt/content/post_1031598.html,
1月20日,广东,惠州市,1,0,0,广东卫健委,http://wsjkw.gd.gov.cn/zwyw_yqxx/content/post_2876926.html,

1.1、数据转换：请将数据中日期字段格式，替换成日期格式为xxxx年xx月xx日
在集群执行明令：
在这里插入图片描述
结果：

代码如下：
Java代码


```java
package webgame_demo;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import webgame_demo.yq_replace.MyMap.MyReduce;



public class yq_replace {
	public static class MyMap extends Mapper<LongWritable, Text, LongWritable, Text>
	{
		@Override
		protected void map(LongWritable key, Text value,
				Mapper<LongWritable, Text, LongWritable, Text>.Context context)
				throws IOException, InterruptedException {
			String line = value.toString();	
			//将文本数据根据,切分
			String[] split = line.split(",");
			String newsplit = "";
			//添加字段后写回数组
			split[0]= "2020年"+split[0];
			for(String s1:split) {
				newsplit+=s1;
			}
             context.write(key, new Text(newsplit));      
		
		}
		
		public static class MyReduce extends Reducer<LongWritable, Text, Text, Text>
		{
			@Override
			protected void reduce(LongWritable k2, Iterable<Text> v2s,
					Reducer<LongWritable, Text, Text, Text>.Context context)
					throws IOException, InterruptedException {
					//遍历后输出 key设置为空只输出value的值
				for (Text text : v2s) {
					context.write(new Text(), text);
				}
			}
			
			
			
			
			
		}
		
		
		
		
	}
	
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {		
		//设置配置参数
		Configuration conf = new Configuration();
		//conf.set("mapred.textoutputformat.separator", ",");
		//创建任务
		Job job = Job.getInstance(conf, yq_replace.class.getSimpleName());
		//指定jar文件
		job.setJarByClass(yq_replace.class);
		//指定输入路径，数据在hdfs上的输入路径,指定第一个参数是hdfs输入路径
		FileInputFormat.addInputPath(job,new Path(args[0]));
		//指定map的类
		job.setMapperClass(MyMap.class);
		//指定map输出的key和value的数据类型。
		job.setMapOutputKeyClass(LongWritable.class);
		job.setMapOutputValueClass(Text.class);
		
		//指定reduce类以及输出数据类型。
		job.setReducerClass(MyReduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		//指定输出路径hdfs
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		//提交任务，如果是true，会返回任务执行的进度信息等。
		job.waitForCompletion(true);
		
		
	}
	
	

}

也想洒脱

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
## 使用MapReduce程序完成相关数据预处理

使用MapReduce程序完成相关数据预处理数据大概有2万条左右所以部分截取 (格式为csv)1月20日,北京,大兴区,2,0,0,北京市大兴区卫健委,https://m.weibo.cn/2703012010/4462638756717942,1月20日,北京,昌平区,2,0,0,北京市卫健委,http://wjw.beijing.gov.cn/xwzx_20031/wnxw/202001/t20200121_1620353.html,1月20日,北京,外地来京,1,0,0,北京市卫健委,http
复制链接

扫一扫