1. 部署伪分布式Hadoop集群
- 集群版本:2.10.1,下载链接:hadoop-2.10.1
- 部署方式参考官方文档:Pseudo-Distributed Operation
与官方文档不一致的地方:
-
启动hdfs之前,需要为
hadoop-env.sh
配置真实的JAVA_HOME路径,而非${JAVA_HOME}
-
启动hdfs后,应该有三个进程:DataNode、 NameNode、SecondaryNameNode
-
yarn-site.xml
中需要指定resourcemanager的端口号,避免8088端扣已被占用<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>hadoop:23030</value> </property> </configuration>
-
启动yarn前,同样需要配置真实的JAVA_HOME路径。这个文件中JAVA_HOME被注释掉了,可以直接添加以下内容:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
-
启动了yarn后,新增java进程:NodeManager、ResourceManager
-
修改配置Hadoop环境变量,实现无需进入对应目录,即可执行命令
export HADOOP_HOME=/home/hadoop/hadoop-2.10.1 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
2. 编写MR程序
-
配置maven依赖
<properties> <hadoop.version>2.10.1</hadoop.version> </properties> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.8.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build>
-
编写mapper函数
package com.hadoop.score.mr; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class ScoreMap extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); // 按行处理数据,先split出科目和成绩 String[] data = line.split(","); Integer score = Integer.valueOf(data[1]); // 构建输出内容 context.write(new Text(data[0]), new IntWritable(score)); } }
-
编写reducer函数
package com.hadoop.score.mr; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class ScoreReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // 获取每条数据的最大值 int maxScore = Integer.MIN_VALUE; for (IntWritable score : values) { maxScore = Math.max(maxScore, score.get()); } // 构建输出数据 context.write(key, new IntWritable(maxScore)); } }
-
编写主函数,创建并配置job
package com.hadoop.score.mr; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class ScoreMR { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("Usage: ScoreMR <input_path> <output_path>"); System.exit(-1); } // 创建一个MR工作,这是client的一个执行工作单元 Job job = new Job(); job.setJarByClass(ScoreMR.class); job.setJobName("score mapreduce"); // 设置输入/输出目录 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 设置mapper和reducer job.setMapperClass(ScoreMap.class); job.setReducerClass(ScoreReducer.class); // 设置输出数据key-value格式 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 等待作业完成后退出 System.exit(job.waitForCompletion(true) ? 0 : -1); } }
3. 运行MR程序
-
先准备数据文件:score.txt
math,86 english,124 english,80 math,93 math,77
-
将文件上传到hdfs,完整路径为
/user/hadoop/input
hdfs dfs -put score.txt input/
-
执行MR程序
hadoop jar score-mr-1.0-SNAPSHOT.jar com/hadoop/score/mr/ScoreMR input/score.txt output/score_2
-
自己的程序一直执行不成功,暂时记录一下错误
[2021-04-06 01:16:31.227]Container exited with a non-zero exit code 255. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Apr 06, 2021 1:16:26 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class Apr 06, 2021 1:16:26 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton" Apr 06, 2021 1:16:27 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton" Apr 06, 2021 1:16:28 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest" log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
-
错误原因: 机器的hostname中有下划线 —— 解决办法: 修改机器的hostname(包括
/etc/hostname
和/etc/hosts
),重启机器,重启Hadoop集群。 -
重新执行,输出如下:
21/04/06 09:50:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 21/04/06 09:50:47 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 21/04/06 09:50:48 INFO input.FileInputFormat: Total input files to process : 1 21/04/06 09:50:49 INFO mapreduce.JobSubmitter: number of splits:1 21/04/06 09:50:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1617673573380_0002 21/04/06 09:50:49 INFO conf.Configuration: resource-types.xml not found 21/04/06 09:50:49 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE 21/04/06 09:50:49 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE 21/04/06 09:50:50 INFO impl.YarnClientImpl: Submitted application application_1617673573380_0002 21/04/06 09:50:50 INFO mapreduce.Job: The url to track the job: http://hadoop:23030/proxy/application_1617673573380_0002/ 21/04/06 09:50:50 INFO mapreduce.Job: Running job: job_1617673573380_0002 21/04/06 09:51:01 INFO mapreduce.Job: Job job_1617673573380_0002 running in uber mode : false 21/04/06 09:51:01 INFO mapreduce.Job: map 0% reduce 0% 21/04/06 09:51:10 INFO mapreduce.Job: map 100% reduce 0% 21/04/06 09:51:18 INFO mapreduce.Job: map 100% reduce 100% 21/04/06 09:51:19 INFO mapreduce.Job: Job job_1617673573380_0002 completed successfully 21/04/06 09:51:19 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=67 FILE: Number of bytes written=416357 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=158 HDFS: Number of bytes written=20 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=6625 Total time spent by all reduces in occupied slots (ms)=5080 Total time spent by all map tasks (ms)=6625 Total time spent by all reduce tasks (ms)=5080 Total vcore-milliseconds taken by all map tasks=6625 Total vcore-milliseconds taken by all reduce tasks=5080 Total megabyte-milliseconds taken by all map tasks=6784000 Total megabyte-milliseconds taken by all reduce tasks=5201920 Map-Reduce Framework Map input records=5 Map output records=5 Map output bytes=51 Map output materialized bytes=67 Input split bytes=111 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=67 Reduce input records=5 Reduce output records=2 Spilled Records=10 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=257 CPU time spent (ms)=2470 Physical memory (bytes) snapshot=469913600 Virtual memory (bytes) snapshot=4328771584 Total committed heap usage (bytes)=326631424 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=47 File Output Format Counters Bytes Written=20
-
执行输出分析:
① 输入数据被划分成一个split,因此应该分配一个map任务
② 整个jod由一个map任务,一个reduce任务构成
③ map和reduce操作的输入和输出的记录数
-
执行结果,与预期相符
english 124 math 93
4. 一些注意事项
-
为job指定
inputPath
和outputPath
时,要求outputPath
对应的目录是不存在的。如果存在,则会报错如下:21/04/06 09:47:18 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop:9000/user/hadoop/output/wordcount_1 already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146) ...
-
作业的运行详情,可以通过resourceManager对应的wen界面去查看。默认端口是8088,本人指定的是23030端口