一. HAOOP集群环境搭建
二. WordCount代码编写
1. 搭建maven工程
2. pom中添加hadoop和hdfs依赖
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
3. map代码
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//map每次读取文件中的一行,key为偏移量,value为一行文本
//将每一行按照空格切分,统计出每个单词的数量
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
System.out.println("===========>Start Mapper");
String line = value.toString();
String[] words = line.split(" ");
for (int i = 0; i < words.length; i++) {
word.set(words[i]);
context.write(word, one);
System.out.println("==== After Mapper: ==== " + word + "," + one);
}
System.out.println("===========>End Mapper");
}
}
4. reduce代码
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//reduce每次读取一组数据(key相同的为一组)
System.out.println("===========>Start Reduce");
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
result.set(sum);
context.write(key, result);
System.out.println("==== After Reduce ==== " + key + ", " + key.toString());
}
}
5. 主方法代码
public class WordCount {
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
String[] remainingArgs = optionParser.getRemainingArgs();
List<String> argList = new ArrayList<>();
for (int i=0; i < remainingArgs.length; ++i) {
argList.add(remainingArgs[i]);
}
Job job = Job.getInstance(conf, "MyWordCount");
job.setJarByClass(WordCount.class);
//指定自定义的Mapper阶段的任务处理类
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(argList.get(0)));
//指定自定义的Reducer阶段的任务处理类
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath(job, new Path(argList.get(1)));
//执行提交job方法,直到完成,参数true打印进度和详情
job.waitForCompletion(true);
System.out.println("Finished");
} catch (Exception e) {
e.printStackTrace();
}
}
}
6. 在idea中配置启动项,并运行
配置mpreduce的输入文件和输出目录
运行
运行结果
打开part-r-00000
三. 生成jar包,并在集群中运行
1. 注意千万不要用mvn打出来的jar包在Hadoop上运行,会找不到对应的类(具体原因未知)
2.正确的打包方式
设置jar包的输出路径
设置主方法的类
构建生成jar
上传jar包到集群主节点服务器,执行命令(in为hdfs地址,文件存储在in目录下):
hadoop jar MapReduceDemo.jar /in /out