hadoop-WordCount单词统计

最新推荐文章于 2024-05-10 10:39:28 发布

天黑要加班

最新推荐文章于 2024-05-10 10:39:28 发布

阅读量468

点赞数 1

分类专栏： mapreduce

本文链接：https://blog.csdn.net/weixin_42767528/article/details/83683978

版权

mapreduce 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

在这里插入图片描述

|在进行MapReduce处理过程中，分为如下几个阶段

1 输入阶段文本文件 --》 inputFormat -》 map -----》 shuffle-----> reduce ----> outputFormat阶段

总结：
MapReduce在运行计算过程中分为两个阶段
第一个是map阶段对数据进行切片每一个切片对应一个mapTask ,假如一个文件被切分成了10个切片就存在10个mapTask任务在并行运行互不干扰

第二阶段 reduce阶段
把每个mapTask阶段的输出进行整合
在整个MapReduce运行的时候存在如下进程
1 mapTask
2 ReduceTask
3 MRAppMaster 任务的管理进程

标题案例：wordCount

文本： inputFormat 输出到 map<k,v>

Hello world

Hadoop spark

Hadoop java

FileInputFormat 会读取一行文本输出给map 切分为k v 对 key 是文本中的偏移量， v 文本中的内容

<0,hello world>
<10,hadoop spark>
<15,hadoop java>

String word=value.toString();
Hello world

String[] words=word.split(“ ”);

[hello ,world]

接下来 map 要输出给 reduce <k,v> k文本的内容 v 单词出现的次数

Reduce进行接收的时候 <k,v> k文本的内容 v 单词出现的次数
MapReduce程序需要编写三大模块

1 Mapper

2 Reduce

3 Driver

Hadoop的序列化

如果要进行对象的传输，则传输的内容必须进行序列化，所以hadoop就创建了一些序列化类型

Long longWritable

Int IntWrieable

String text

Wordcount案例

本地运行和提交到集群上运行

Hadoop 打成jar包在集群上运行

bin/hdfs dfs -put 1.data / 把测试数据上传到根目录下
bin/hadoop jar a.jar /1.data /g
从eclipse导出的jar包新的文件夹
bin/hdfs dfs –text /g/p* 查看处理后的数据


/**
 * 		
 *Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 *
 *输入 key 文本中偏移量 
 *value 文本中的内容
 *
 *输出 key 是文本的内容 
 *
 *value 是单词出现的次数 
 */
public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{

	private Text k=new Text();
	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		
		//1 获取一行的数据 
		
		String line=value.toString();
		
		//2 切割  按照空格切分
		
		String[] words=line.split(" ");
		
		for(String word:words) {
			
			k.set(word);   //把String类型的word 转换为Text类型
			//3 输出到Reduce 
			context.write(k, new IntWritable(1));
		}
		
	
	}
	
	
	//需要实现Map方法编写业务逻辑
	

}

/*
hello 1
 *hadoop 1
 *
 *hadoop 1
 *
 *hadoop 2
 *
 *把相同key的values进行累加  
 */

public class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable>{

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
		// TODO Auto-generated method stub
		
		int sum=0; 
		
		for(IntWritable count:values) {
			
			sum+=count.get();
			
		}
		
		//输出
		context.write(key, new IntWritable(sum));
		
	}
  
}
public class Driver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		
		
		//1 获得配置信息 
		
		Configuration config=new Configuration();
		// 实例化 job类 并且把配置信息传给job
		Job job=Job.getInstance(config);
		
		// 通过反射机制 加载主类的位置
		job.setJarByClass(Driver.class);
		
		//设置map和reduce类
		job.setMapperClass(WordCountMap.class);
		job.setReducerClass(WordCountReduce.class);
		
		
		//设置map的输出 
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		
		
		//设置redue的输出
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//设置文件的输入 输出路径
		
		FileInputFormat.setInputPaths(job, new Path("/input"));
		
		FileOutputFormat.setOutputPath(job, new Path("/output"));
		
		//提交任务 
		
		boolean result=job.waitForCompletion(true);
		
		System.exit(result?0:1);
		
	}
}

天黑要加班

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hadoop-WordCount单词统计

/** * *Mapper&amp;lt;KEYIN, VALUEIN, KEYOUT, VALUEOUT&amp;gt; * *输入 key 文本中偏移量 *value 文本中的内容 * *输出 key 是文本的内容 * *value 是单词出现的次数 */public class WordCountMap extends Mapper&amp;lt;LongWritable, Te...
复制链接

扫一扫