前言:如果需要在本地测试MR程序需要安装hadoop,就跟本地安装jdk一个道理,JAVA_HOME变成HADOOP_HOME,最后在cmd测试输入hadoop命令看安装结果,是不是跟jdk配置一样
hadoop包:链接:https://pan.baidu.com/s/1bPlkKnYLXsfOjMtcK1Nq8g
密码:nzqg
一、计算任务分析
1、需求:统计指定文本文件里面每个单词出现的次数。
文件内容如下:
java java hello scala java
baidu alib meituan scala baidu
alib java scala java alib
2、编写程序
计算如上文本内容每个单词出现次数
2.1 工程搭建
1.新建maven工程 mr-study
2.导入pom依赖关系
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version>
</dependency>
</dependencies>
- resources下新建log4j.properties,导入如下内容,用于查看任务运行日志信息
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
2.2 Mapper编写
/**
* 编写wordcount map类:完成数据输入
*
* FileInputFormat默认实现是TextInputFormat,所以是一行一行读入数据,key是偏移量,v是读取行字符串
*/
public class WordcountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
// 输出数据格式,也就是reduce接受的格式
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取当前行文本
String line = value.toString();
//按照 \t 分割字符
String[] words = line.split("\t");
//循环将每个字符写出,reduce会按照相同key的数据聚集到一起
for (String word : words) {
k.set(word);
//word 1
//java 1
context.write(k,v);
}
}
}
2.3 reducer编写
package com.cjy.mr.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
/**
* 编写reducer类:计算相同单词出现的次数
* 从map接受数据类型如下:
*
* (java,(1,1,1,1))
*/
public class WordcountReducer extends Reducer<Text,IntWritable,Text, IntWritable> {
int sum ;
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
// 1 累加求和
sum = 0;
for (IntWritable count : values) {
sum += count.get();
}
// 2 输出
v.set(sum);
//java 5
//hello 4
context.write(key,v);
}
}
2.4 driver编写
package com.cjy.mr.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordcountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1 获取配置信息以及封装任务
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
// 2 设置jar加载路径
job.setJarByClass(WordcountDriver.class);
// 3 设置map和reduce类
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
// 4 设置map输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置最终输出kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// // 6 设置输入和输出路径
//如下动态参数接收用于hadoop集群测试
// FileInputFormat.setInputPaths(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileInputFormat.setInputPaths(job, new Path("/Users/chenjunying/Downloads/wd.txt"));//改为自己本地文件输入位置
FileOutputFormat.setOutputPath(job, new Path("/Users/chenjunying/Downloads/out/")); //结果输出位置
// 7 提交
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
2.5 结果
alib 3
baidu 2
hello 1
java 5
meituan 1
scala 3
2.6 运行日志分析
只看重要部分的日志信息
处理一个文件
2020-05-09 21:42:14,255 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input paths to process : 1
分了一个片,使用默认TextinputFormat,按照128MB的块大小来分片的所以只有一个
2020-05-09 21:42:14,310 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:1
任务id:我这里是本地运行所以 job后边有个local
2020-05-09 21:42:14,390 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local751775058_0001
合并排序
2020-05-09 21:42:14,766 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments
只有看到 map 100% reduce 100% 才能确定任务正确执行完毕
2020-05-09 21:42:15,522 INFO [org.apache.hadoop.mapreduce.Job] - map 100% reduce 100%
二、打包集群运行测试
1、添加打包插件依赖
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin </artifactId>
<configuration>
<descriptorRefs>
<!--运行所需要jar打到一个jar中去-->
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.cjy.mr.wordcount.WordcountDriver</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
2、使用maven打成jar与wd.txt 一起上传至集群
jar包上传至集群本地,wd.txt需要上传到hdfs
1.上传文件
[root@hadoop102 hadoop-2.7.2]# hadoop fs -put wd.txt /
2.测试
[root@hadoop102 hadoop-2.7.2]# hadoop jar mr-study-1.0-SNAPSHOT-jar-with-dependencies.jar com.cjy.mr.wordcount.WordcountDriver /wd.txt /out/
3、集群测试日志查看
[root@hadoop102 hadoop-2.7.2]# hadoop jar mr-study-1.0-SNAPSHOT-jar-with-dependencies.jar /wd.txt /out/
申请资源
20/05/09 22:17:40 INFO client.RMProxy: Connecting to ResourceManager at hadoop103/10.211.55.103:8032
20/05/09 22:17:41 INFO mapreduce.JobSubmitter: number of splits:1
job后面就没有local了
20/05/09 22:17:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1589033576096_0003
20/05/09 22:17:42 INFO impl.YarnClientImpl: Submitted application application_1589033576096_0003
20/05/09 22:17:42 INFO mapreduce.Job: The url to track the job: http://hadoop103:8088/proxy/application_1589033576096_0003/
20/05/09 22:17:42 INFO mapreduce.Job: Running job: job_1589033576096_0003
20/05/09 22:17:48 INFO mapreduce.Job: Job job_1589033576096_0003 running in uber mode : false
20/05/09 22:17:48 INFO mapreduce.Job: map 0% reduce 0%
20/05/09 22:17:52 INFO mapreduce.Job: map 100% reduce 0%
20/05/09 22:18:00 INFO mapreduce.Job: map 100% reduce 100%
demo地址:https://github.com/chenjy512/bigdata_study/tree/master/mr-zhdemo/mr-study