Mapreduce，hive，spark实现Wordcount

最新推荐文章于 2024-07-19 08:02:36 发布

原创最新推荐文章于 2024-07-19 08:02:36 发布 · 449 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#worldcount

hadoop生态圈专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了如何使用MapReduce框架实现WordCount应用，包括配置Job、编写Mapper和Reducer类的具体步骤，以及通过Hadoop和Scala进行文件处理的方法。

实现worldcount

在给定的文件中，统计输出每一个单词出现的次数

mapreduce：

need jar ： log4j-core ,junit, hadoop-common ,hadoop-client,hadoop-hdfs

- - WCdriver类
  psvm 
  //创建Job实例来提供默认配置
        Configuration con = new Configuration();
        Job job = Job.getInstance(con);

   	//让job来识别mapper和reducer两个业务
	job.setMapperClass(WCMapper.class);
	job.setReducerClass(WCReducer.class);

      //设置map阶段的key和value输出
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(IntWritable.class);

	//最终最终输出的key和value的类型
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(Intwritable.class);

	//设置输入和输出路径
	FileInputFormat.setInputPaths(job,new Path("/local"));
	FileOutputFormat.setOutputPath(job,new Path("/output"));

	//提交
	job.waitForConpletion(true);
	
- - WCMapper
  public class WCMapper extends Mapper<LonWritable,Text,Text,InWritable>{
  重写 方法
  String split[] = value.toString().split("");
  for(String s: split){
  Text text = new Text();
  text.set(s);
  InWritable intWritable = new IntWritable(1);
  context.write(text.intWritable);
  }
  - public class WCReducer extends Reducer<Text,IntWritable,Text,InWritable>{
    重写 方法
    InWritable res = new IntWritable();
    int sum = 0;
    for(IntWritable one: values){
    sum+=one.get();
    }
    res.set(sum);
    context.write(key,res);
    }
    

 具体实现：先上传要查询的文件到hdfs

将代码打成jar包 （注意driver里的输入和输出路径）上传到本地

启动hadoop ---------------hadoop jar jar包路径名字 driver路径 输入在hdfs的路径  即将要输出的路径名称(不存在的)

hive实现spark

1.将文件上传到本地
2.在hive中创建表 create table textlines(line string)；
3.load data local inpath ‘pathname’ into table textlines;
4.select word,count(1) cnt from textlines lateral view explode(split(line,"\t")) alaistable as word group by word order by cnt;
解析：explode 为每个输入行生成零个或多个输出行，即行转列
Lateral View一般与用户自定义表生成函数（如explode()）结合使用，Lateral View 首先将UDTF应用于基表的每一行，然后将结果输出行连接到输入行，以形成具有提供的表别名的虚拟表。

scala实现worldcount

val str = "cc mm cc mm aa tt ss lal gg"
var worldcount =  lst.map(_.split(" ").map((_,1))).flatten.groupBy(_._1).foreach(x=>{println(x._1,x._2.length)})
解析：  //思路：遍历并分割字符串集合得到每个单词的集合，然后再遍历这个每个单词的集合并且value赋值1得到一个list元祖。把这个元祖遍历展开得到元祖集合，将元祖集合按照key值进行分组 tanple 取key x._1 ,遍历集合得到每个元祖中key值和value值