WordCount 案例:
1、需求:给定一个文本文件,统计输出文本文件中每个单词出现的总次数。
2、数据文本:
//文件内容:
hello world
atguigu atguigu
hadoop
spark
hello world
atguigu atguigu
hadoop
spark
hello world
atguigu atguigu
hadoop
spark
3、IDEA 的 pom.xml 配置:
<!-- 新加转码 -->
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<!-- 依赖 -->
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.8.4</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.16.10</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 加版本号 -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.5.1</version>
<configuration>
<encoding>UTF-8</encoding>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
4、分析:按照 MapReduce 编程规范,分别编写 Mapper,Reduce,Driver
MapReduce 程序:
1)、编写 mapper 类:
/**
* <KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* <LongWritable, Text,Text, IntWritable>
*
* @author Jds
*/
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
/**
* ctrl + o 呼出继承的方法
*/
@Override
protected void map(LongWritable key,
Text value,
Context context) throws IOException, InterruptedException {
//1、转换格式
String line = value.toString();
//2、切分数据 -->以空格切分
String[] words = line.split(" ");
//3、输出成<单词,1>
for (String word :words
) {
k.set(word);
v.set(1);
context.write(k,v);
}
}
}
2)、编写 Reducer 类:
/**
* <KEYIN, VALUEIN, KEYOUT, VALUEOUT>
* <Text, IntWritable,Text, IntWritable>
*
* @author Jds
*/
public class WordCountReducer extends Reducer<Text, IntWritable,Text, IntWritable> {
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key,
Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
//1、初始化次数
int count = 0;
//2、汇总次数
for (IntWritable value:values
) {
count += value.get();
}
v.set(count);
//2、输出总次数
context.write(key,v);
}
}
3)、编写 Driver 类:
/**
* @author Jds
*/
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//输入数据的 Windows 路径,输出的 Windows 路径
args = new String[]{"C:\\Users\\Jds\\Desktop\\mapreduce\\aaa.txt",
"C:\\Users\\Jds\\Desktop\\mapreduce\\A1"};
//获取配置信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//反射三个类
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//Map 输出的 k,V 类型 Text, IntWritable 为 Reduce 的输入
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//Reduce 输出的 K,V 类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//数据的输入和输出指定目录
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//提交 job ---> waitForCompletion 包含 submit() 两个都是提交
job.waitForCompletion(true);
}
}
MRjar 在 Linux 上运行:
1、在 Linux 上跑的程序不需要 Windows 的路径,所以需要将代码中获取 Windows 路径的代码注释掉
2、进行打包:选择右侧的 Maven --> 项目名 --> Lifecycle --> 双击 clean --> 双击 package
3、运行完后会在项目下创建 target 目录,该目录下会生成一个 项目名-版本号.jar 的文件
例:MapReduce-1.0-SNAPSHOT.jar
4、将 .jar 文件传入到 Linux 中
5、在 Linux 中存放 .jar 的目录下运行:
hadoop jar MapReduce-1.0-SNAPSHOT.jar WordCount.WordCountDriver /aaa /aaa1
/aaa :hdfs 上需要处理问的文本 /aaa1:处理后存放的目录(不可提前存在)
6、查看处理后的数据:
hadoop fs -cat /aaa1/part-r-00000