need jar : log4j-core ,junit, hadoop-common ,hadoop-client,hadoop-hdfs
- - WCdriver类
psvm
//创建Job实例来提供默认配置
Configuration con = new Configuration();
Job job = Job.getInstance(con);
//让job来识别mapper和reducer两个业务
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//设置map阶段的key和value输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//最终最终输出的key和value的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Intwritable.class);
//设置输入和输出路径
FileInputFormat.setInputPaths(job,new Path("/local"));
FileOutputFormat.setOutputPath(job,new Path("/output"));
//提交
job.waitForConpletion(true);
- - WCMapper
public class WCMapper extends Mapper<LonWritable,Text,Text,InWritable>{
重写 方法
String split[] = value.toString().split("");
for(String s: split){
Text text = new Text();
text.set(s);
InWritable intWritable = new IntWritable(1);
context.write(text.intWritable);
}
- public class WCReducer extends Reducer<Text,IntWritable,Text,InWritable>{
重写 方法
InWritable res = new IntWritable();
int sum = 0;
for(IntWritable one: values){
sum+=one.get();
}
res.set(sum);
context.write(key,res);
}
具体实现:先上传要查询的文件到hdfs
将代码打成jar包 (注意driver里的输入和输出路径)上传到本地
启动hadoop ---------------hadoop jar jar包路径名字 driver路径 输入在hdfs的路径 即将要输出的路径名称(不存在的)
hive实现spark
1.将文件上传到本地
2.在hive中创建表 create table textlines(line string);
3.load data local inpath ‘pathname’ into table textlines;
4.select word,count(1) cnt from textlines lateral view explode(split(line,"\t")) alaistable as word group by word order by cnt;
解析:explode 为每个输入行生成零个或多个输出行,即行转列
Lateral View一般与用户自定义表生成函数(如explode())结合使用,Lateral View 首先将UDTF应用于基表的每一行,然后将结果输出行连接到输入行,以形成具有提供的表别名的虚拟表。
scala实现worldcount
val str = "cc mm cc mm aa tt ss lal gg"
var worldcount = lst.map(_.split(" ").map((_,1))).flatten.groupBy(_._1).foreach(x=>{println(x._1,x._2.length)})
解析: //思路:遍历并分割字符串集合得到每个单词的集合,然后再遍历这个每个单词的集合并且value赋值1得到一个list元祖。把这个元祖遍历展开得到元祖集合,将元祖集合按照key值进行分组 tanple 取key x._1 ,遍历集合得到每个元祖中key值和value值