终于开始写代码了
先从配置idea开发环境开始
参考博文 [ https://blog.csdn.net/u010171031/article/details/53024516 ]
注意,src文件夹下需配置以下文件,我是用伪分布式测试的,电脑有点吃不消
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
如果不放这几个配置文件的话,要手动指定hdfs服务地址
上传一个用于wordcount的样例文件到hdfs中。
ok,下面是我的代码
1. 编写WordCount类
package com.hadoop.learn.wc;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// 放开后用于将任务提交给远端server执行
// conf.set("mapred.jar", "/Users/zhengyifan/app/project/bigdata-learn/hadoop/hadoop.jar");
Job job = new Job(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileSystem fs = FileSystem.get(conf);
System.out.println(fs.getName());
job.setReducerClass(WCReduce.class);
Path path = new Path("/tmp/wc");
FileInputFormat.addInputPath(job, path);
Path outpath = new Path("/tmp/out_wc");
// 保证文件不存在
if (fs.exists(outpath)) {
fs.delete(outpath, true);
}
FileOutputFormat.setOutputPath(job, outpath);
boolean res = job.waitForCompletion(true);
if (res == true) {
System.out.println("执行成功!");
}else{
System.out.println("执行失败!");
}
}
}
2. Mapper类
package com.hadoop.learn.wc;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String val = value.toString();
String[] list = val.split(" ");
for (String s : list) {
System.out.println(s + "\n===" );
// 配置idea的文章中是写成静态变量的,应该会减少gc的压力
context.write(new Text(s), new IntWritable(1));
}
}
}
3. Reduce类
package com.hadoop.learn.wc;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WCReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> value, Context context)
throws IOException, InterruptedException {
int count = 0;
for (IntWritable i : value) {
count += i.get();
System.out.println(key.toString() + "---" + count);
}
context.write(key, new IntWritable(count));
}
}
4. 在测试环境中运行
直接run就好了
5. 通过idea,指定jar包,把任务递交给server运行,首先需要打个jar包出来
参考博文 [ https://www.cnblogs.com/blog5277/p/5920560.html ]
放开我WordCount类中的注释,运行。
运行后,通过localhost:8088可以查看进行中的任务
6. 将jar包发布到服务器上执行
hadoop jar /path/to/your.jar com.your.mapreduce.class
我是在mac上执行的,所以会遇到一个问题,我将会将所有问题集中收录,方便自己以后查看,参考[ https://blog.csdn.net/qq_31343581/article/details/80861790 ]