使用MapReduce自定义统计词频

最新推荐文章于 2024-05-24 09:45:00 发布

weixin_44928809

最新推荐文章于 2024-05-24 09:45:00 发布

阅读量771

点赞数 2

分类专栏：大数据 java 文章标签： MapReduce hadoop

本文链接：https://blog.csdn.net/weixin_44928809/article/details/102461179

版权

大数据同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

java

4 篇文章 0 订阅

订阅专栏

MapReduce编程模型

1. 一种分布式计算模型，解决海量数据的计算
2. MapReduce将整个并行计算过程抽象到函数

Map(映射)：对一些独立元素组成的列表的每一个元素进行指定的操作，可以高度并行
Reduce(简化、规约)：对一个列表元素进行合并

3. 一个简单的MapReduce程序只需要指定map(),reduce(),input和output,剩下的事由框架完成。

MapReduce的特点

易于编程
良好的扩展性
高容错性
适合PB级以上海量数据的离线处理

MapReduce的相关概念

1. Job，用户的每一个计算请求，称为一个作业
2. TASK，每一个作业，都需要拆分开，交由多个服务器来完成，拆分出来的执行单位，就称为任务。
3. TASK分为MapTask和ReduceTask两种，分别及进行Map操作和Reduce操作，依据Job设置的Map类和Reduce类

MapReduce实战

例子：统计词频

hadoop is good
hadoop is nice
hadoop is better
hadoop is best
HADOOP IS BEST
HADOOP IS BETTER
hadoopbetter

首先在linux中创建一个文本输入如上文本内容，并将其发送到hdfs文件系统中

hdfs dfs -put /home/words /words

1. 创建一个java类MyWordCount.java
2. 在该类中写入如下代码：

public class MyWordCount {    
    public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{        
		public static Text k = new Text();        
		public static IntWritable v = new IntWritable();        						    		
		@Override        
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
		{            
			//1.从输入数据中获取每一个文件中的每一行的值           
			String line = value.toString();            
			//2.对每一行的数据进行切分(有的不用)            
			String [] words = line.split(" ");           
			//3.循环处理            
			for(String word : words){                                      
   				k.set(word);                
   				v.set(1);                
   				context.write(k,v);            
   			}        
   		}    
	}    
	public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{        
		@Override        
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {            
			//自定义一个计数器            
			int counter = 0;            
			//循环迭代器中的值            
			for(IntWritable i : values){                
				counter += i.get();            
			}            
			//reduce阶段的最终输出            
			context.write(key,new IntWritable(counter));       
		}    
	}    
	//驱动    
	public static void main(String[] args) throws IOException,InterruptedException,ClassNotFoundException{        
		//1.获取配置对象信息        
		Configuration conf = new Configuration();        
		//2.对conf进行设置        
		//3.获取job对象        
		Job job =Job.getInstance(conf,"mywordcount");        			
		//4.设置job的运行主类        
		job.setJarByClass(MyWordCount.class);        
		//5.对map阶段进行设置        
		job.setMapperClass(MyMapper.class);        		
		job.setMapOutputKeyClass(Text.class);        	
		job.setMapOutputValueClass(IntWritable.class);        
		FileInputFormat.addInputPath(job,new Path(args[0]));       
		//6.对reduce阶段进行设置        
		job.setReducerClass(MyReducer.class);        
		job.setOutputKeyClass(Text.class);        	
		job.setOutputValueClass(IntWritable.class);        
		FileOutputFormat.setOutputPath(job,new 
		Path(args[1]));        
		//7.提交job并打印信息        
		int isok = job.waitForCompletion(true)? 0 : 1;        		
		//8.退出整个job        
		System.exit(isok);   
	}
}

3. 将其打包成jar
在这里插入图片描述
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oey5EFiQ-1570604616820)(en-resource://database/3176:0)]

4. 将jar包拷贝到服务器

5. 使用该jar包来统计词频

yarn jar /home/QF_Online.jar com.qf.mr.MyWordCount /words /out/00

在这里插入图片描述

6. 查看输出

hdfs dfs -ls /out/00

在这里插入图片描述

hdfs dfs -cat /out/00/part-r-00000

在这里插入图片描述

weixin_44928809

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用MapReduce自定义统计词频

MapReduce编程模型1. 一种分布式计算模型，解决海量数据的计算2. MapReduce将整个并行计算过程抽象到函数Map(映射)：对一些独立元素组成的列表的每一个元素进行指定的操作，可以高度并行Reduce(简化、规约)：对一个列表元素进行合并3. 一个简单的MapReduce程序只需要指定map(),reduce(),input和output,剩下的事由框架完成。MapR...
复制链接

扫一扫