环境及工具要求:
(以笔者为例)
jdk 1.8(Windows环境下)
Maven 3.3.9(Windows环境下)
IDEA 2017+(Windows环境下)
WinScp(Windows环境下)
hadoop 2.7.3(Linux环境下)
一、配置IDEA的Maven
1.file—>new—>project
2.新窗口打开
(注意修改IDEA默认的maven路径)
修改pom.xml文件
添加下列代码:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!-- main()所在的类,注意修改 -->
<mainClass>com.MyWordCount.WordCountMain</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>7</source>
<target>7</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.3</version>
</dependency>
</dependencies>
完成后如图:
特别注意这里,com.MyWordCount为包名,WordCountMain为主类名,在后续的新建Java类时,注意修改为自己所创建的。
二、创建Java文件
1.新建包
2.新建 WordCountMapper.java
代码:
package com.MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
// 泛型 k1 v1 k2 v2
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key1, Text value1, Context context)
throws IOException, InterruptedException {
//数据: I like MapReduce
String data = value1.toString();
//分词:按空格来分词
String[] words = data.split(" ");
//输出 k2 v2
for(String w:words){
context.write(new Text(w), new IntWritable(1));
}
}
}
2.新建WordCountReducer.java
代码:
package com.MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
// k3 v3 k4 v4
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException {
//对v3求和
int total = 0;
for(IntWritable v:v3){
total += v.get();
}
//输出 k4 单词 v4 频率
context.write(k3, new IntWritable(total));
}
}
3.新建WordCountMain.java
代码:
package com.MyWordCount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountMain {
public static void main(String[] args) throws Exception {
//1.创建一个job和任务入口
Job job = Job.getInstance(new Configuration());
job.setJarByClass(WordCountMain.class); //main方法所在的class
//2.指定job的mapper和输出的类型<k2 v2>
job.setMapperClass(WordCountMapper.class);//指定Mapper类
job.setMapOutputKeyClass(Text.class); //k2的类型
job.setMapOutputValueClass(IntWritable.class); //v2的类型
//3.指定job的reducer和输出的类型<k4 v4>
job.setReducerClass(WordCountReducer.class);//指定Reducer类
job.setOutputKeyClass(Text.class); //k4的类型
job.setOutputValueClass(IntWritable.class); //v4的类型
//4.指定job的输入和输出
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//5.执行job
job.waitForCompletion(true);
}
}
三、运行Maven命令打jar包
1.使用IDEA terminal 运行Maven命令
2.打包成功
3.刷新target目录
四、上传jar包到配置有hadoop的Linux环境中
1.使用工具上传(推荐使用WinScp)
五、执行MapReduce任务
1.首先查看jar包是否上传成功
2.在Linux的~目录下新建一个ceshi.txt文件,输入测试内容
3.将文件上传至hdfs的根目录下
查看是否上传成功:
hdfs dfs -ls /
4.执行MapReduce任务
输入:
hadoop jar MyWordCount-1.0-SNAPSHOT.jar /ceshi.txt /ceshioutput
运行成功部分截图
5.查看执行命令后输出的结果
命令:
hdfs dfs -cat /ceshioutput/part-r-00000
输出结果完成了对文本中不同单词出现次数的统计
六、小节
WordCount的需求可以概括为:对程序设计语言源文件统计字符数、单词数、行数,统计结果以指定格式输出到默认文件中,以及其他扩展功能,并能够快速地处理多个文件。