Eclipse开发mapreduce程序环境搭建

    Eclipse作为一个常用的java IDE,其使用程度虽然比不上idea那么强大,但是对于习惯使用eclipse开发的人来说,也不失为一个可以选择的IDE。对于喜欢eclipse开发的人来说,就是想让他更加的智能化,更加的友好,比如开发mapreduce程序,我们可以利用hadoop-eclipse-plugin这个插件让eclipse能够可视化查看hdfs上的文件系统,并且可以创建mapreduce程序而不需要额外添加hadoop的jar包。

    这里介绍eclipse4.12.0版本结合hadoop-eclipse-plugin-2.7.0插件来开发mapreduce程序。hadoop-eclipse-plugin插件是一个jar包,可以在网上下载,也可以自己通过源码编译,编译的话,需要ant,jdk,hadoop安装目录,以及hadoop环境变量HADOOP_HOME,具体教程可以看这篇文章

    eclipse版本信息:

    hadoop-eclipse-plugin-2.7.0.jar:

    1)、将插件放入eclipse安装目录的plugins目录下,然后重启eclipse即可,我们如果在Window-Preferences选项中看到了Hadoop Map/Reduce选项,说明我们已经安装成功了,注意,一定要找对hadoop与eclipse-hadoop-plugin插件版本对应关系,否则可能会出现启动ecipse而报错的问题,或者插件无法正常使用。

    2)、插件有了,我们需要指定本机hadoop安装目录。Window->Preferences->Hadoop Map/Reduce->Hadoop install directory:这里指定本机hadoop安装目录,我这里是E:\apache-hadoop\hadoop-2.7.0

    3)、插件安装成功了,我们可以在新建project向导的时候看到Map/Reduce Project选项,也可以在Window->show view下查看Map/Reduce Locations选项。

    4)、为了能够方便看到hdfs的信息,我们新建一个mapreduce locations。这里需要填写准确的主机名和端口号,如果是hadoop-2.7.0版本,可以直接填写Host:虚拟机IP,Port:9000。一定要注意这里的端口信息。

这里填写完毕,点击保存,然后在面板上选择Project Explorer,我们可以看到多了DFS Locations选项,依次点击展开,我们会看到hdfs上的文件目录以及文件信息:

    5)、新建mapreduce工程:File->New->Project:选择Map/Reduce Project。

项目创建完毕,结构如下,会自动加入本机hadoop安装目录下的hadoop相关依赖jar包,不用额外手动导入jar包,非常方便。

    6)、编写简单的wordcount程序:

package com.xxx.hadoop.mapred;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
 * 最简单的词频统计程序
 */
public class WordCountApp {

	public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{
		private IntWritable one = new IntWritable(1);
		protected void map(Object key, Text value, Context context) throws java.io.IOException ,InterruptedException {
			StringTokenizer tokenizer = new StringTokenizer(value.toString());
			while(tokenizer.hasMoreTokens()) {
				Text word = new Text(tokenizer.nextToken());
				context.write(word, one);
			}
		};
	}
	
	public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
		private IntWritable result = new IntWritable();
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException ,InterruptedException {
			int sum = 0;
			for(IntWritable value:values) {
				sum += value.get();
			}
			result.set(sum);
			context.write(key, result);	
		};
	}
	
	public static void main(String[] args) throws Exception {
		System.setProperty("HADOOP_USER_NAME", "root");
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		String input = args[0];
		String output = args[1];
		
		job.setJarByClass(WordCountApp.class);
		
		FileInputFormat.addInputPath(job, new Path(input));
		FileOutputFormat.setOutputPath(job, new Path(output));
		
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		System.exit(job.waitForCompletion(true)?0:1);
	}

}

    7)、运行时,选择Run As->Run Configurations,在参数输入框中,添加如下参数,分别表示输入路径和输出路径。

    8)、运行过程遇到的问题。

org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=WRITE, inode="/user":root:supergroup:drwxr-xr-x
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:307)

这个问题是权限问题导致的,解决办法有两种,一种是修改远程hadoop配置文件,hdfs-site.xml增加如下配置:

<property>
   <name>dfs.permissions.enabled</name>
   <value>false</value>
</property>

修改配置,然后重启hadoop,再次运行程序,就不会报权限异常了,但是这种方式适合在测试环境中使用,一般生产环境,不会轻易的让一般用户随意切换到别的用户目录下进行写操作,因此,这种做法是一种折中的办法。真正的解决这类问题,其实很简单,就是让运行这个程序的用户(HADOOP_USER_NAME)为所需的用户即可,我们可以在程序中显式指定,比如我这里是提示需要root用户,那么可以这样来指定:

System.setProperty("HADOOP_USER_NAME", "root");

这句代码写在main函数中的第一行即可。

另外一个小问题,就是如果我们运行了一次程序之后,会生成输出文件夹目录,如果需要再次运行程序,这个目录如果存在,会报错,为了不让程序报错,我们需要在hdfs中删除这个目录,如:hdfs dfs -rmr /user/output

运行正确的日志打印信息:

2019-08-29 23:48:41,567 [main] [INFO ] org.apache.hadoop.conf.Configuration.deprecation session.id is deprecated. Instead, use dfs.metrics.session-id
2019-08-29 23:48:41,569 [main] [INFO ] org.apache.hadoop.metrics.jvm.JvmMetrics Initializing JVM Metrics with processName=JobTracker, sessionId=
2019-08-29 23:48:41,960 [main] [WARN ] org.apache.hadoop.mapreduce.JobResourceUploader Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2019-08-29 23:48:42,014 [main] [WARN ] org.apache.hadoop.mapreduce.JobResourceUploader No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-08-29 23:48:42,063 [main] [INFO ] org.apache.hadoop.mapreduce.lib.input.FileInputFormat Total input paths to process : 1
2019-08-29 23:48:42,183 [main] [INFO ] org.apache.hadoop.mapreduce.JobSubmitter number of splits:1
2019-08-29 23:48:42,567 [main] [INFO ] org.apache.hadoop.mapreduce.JobSubmitter Submitting tokens for job: job_local1702410896_0001
2019-08-29 23:48:42,852 [main] [INFO ] org.apache.hadoop.mapreduce.Job The url to track the job: http://localhost:8080/
2019-08-29 23:48:42,854 [main] [INFO ] org.apache.hadoop.mapreduce.Job Running job: job_local1702410896_0001
2019-08-29 23:48:42,862 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner OutputCommitter set in config null
2019-08-29 23:48:42,870 [Thread-3] [INFO ] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter File Output Committer Algorithm version is 1
2019-08-29 23:48:42,873 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2019-08-29 23:48:42,946 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Waiting for map tasks
2019-08-29 23:48:42,950 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Starting task: attempt_local1702410896_0001_m_000000_0
2019-08-29 23:48:42,992 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter File Output Committer Algorithm version is 1
2019-08-29 23:48:43,003 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree ProcfsBasedProcessTree currently is supported only on Linux.
2019-08-29 23:48:43,053 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.Task  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@9057f5
2019-08-29 23:48:43,061 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask Processing split: hdfs://192.168.56.202:9000/user/root/wordcount.txt:0+85
2019-08-29 23:48:43,115 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask (EQUATOR) 0 kvi 26214396(104857584)
2019-08-29 23:48:43,116 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask mapreduce.task.io.sort.mb: 100
2019-08-29 23:48:43,116 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask soft limit at 83886080
2019-08-29 23:48:43,116 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask bufstart = 0; bufvoid = 104857600
2019-08-29 23:48:43,116 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask kvstart = 26214396; length = 6553600
2019-08-29 23:48:43,121 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-08-29 23:48:43,463 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.LocalJobRunner 
2019-08-29 23:48:43,466 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask Starting flush of map output
2019-08-29 23:48:43,466 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask Spilling map output
2019-08-29 23:48:43,466 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask bufstart = 0; bufend = 145; bufvoid = 104857600
2019-08-29 23:48:43,466 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask kvstart = 26214396(104857584); kvend = 26214340(104857360); length = 57/6553600
2019-08-29 23:48:43,484 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.MapTask Finished spill 0
2019-08-29 23:48:43,491 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.Task Task:attempt_local1702410896_0001_m_000000_0 is done. And is in the process of committing
2019-08-29 23:48:43,504 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.LocalJobRunner map
2019-08-29 23:48:43,505 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.Task Task 'attempt_local1702410896_0001_m_000000_0' done.
2019-08-29 23:48:43,505 [LocalJobRunner Map Task Executor #0] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Finishing task: attempt_local1702410896_0001_m_000000_0
2019-08-29 23:48:43,506 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner map task executor complete.
2019-08-29 23:48:43,509 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Waiting for reduce tasks
2019-08-29 23:48:43,510 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Starting task: attempt_local1702410896_0001_r_000000_0
2019-08-29 23:48:43,520 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter File Output Committer Algorithm version is 1
2019-08-29 23:48:43,520 [pool-6-thread-1] [INFO ] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree ProcfsBasedProcessTree currently is supported only on Linux.
2019-08-29 23:48:43,554 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Task  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@597152e3
2019-08-29 23:48:43,558 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.ReduceTask Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1b9caf3f
2019-08-29 23:48:43,575 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl MergerManager: memoryLimit=1265788544, maxSingleShuffleLimit=316447136, mergeThreshold=835420480, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-08-29 23:48:43,578 [EventFetcher for fetching Map Completion Events] [INFO ] org.apache.hadoop.mapreduce.task.reduce.EventFetcher attempt_local1702410896_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-08-29 23:48:43,632 [localfetcher#1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.LocalFetcher localfetcher#1 about to shuffle output of map attempt_local1702410896_0001_m_000000_0 decomp: 177 len: 181 to MEMORY
2019-08-29 23:48:43,647 [localfetcher#1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput Read 177 bytes from map-output for attempt_local1702410896_0001_m_000000_0
2019-08-29 23:48:43,671 [localfetcher#1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl closeInMemoryFile -> map-output of size: 177, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->177
2019-08-29 23:48:43,673 [EventFetcher for fetching Map Completion Events] [INFO ] org.apache.hadoop.mapreduce.task.reduce.EventFetcher EventFetcher is interrupted.. Returning
2019-08-29 23:48:43,675 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner 1 / 1 copied.
2019-08-29 23:48:43,676 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2019-08-29 23:48:43,687 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Merger Merging 1 sorted segments
2019-08-29 23:48:43,688 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Merger Down to the last merge-pass, with 1 segments left of total size: 173 bytes
2019-08-29 23:48:43,689 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl Merged 1 segments, 177 bytes to disk to satisfy reduce memory limit
2019-08-29 23:48:43,690 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl Merging 1 files, 181 bytes from disk
2019-08-29 23:48:43,691 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl Merging 0 segments, 0 bytes from memory into reduce
2019-08-29 23:48:43,691 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Merger Merging 1 sorted segments
2019-08-29 23:48:43,693 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Merger Down to the last merge-pass, with 1 segments left of total size: 173 bytes
2019-08-29 23:48:43,693 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner 1 / 1 copied.
2019-08-29 23:48:43,723 [pool-6-thread-1] [INFO ] org.apache.hadoop.conf.Configuration.deprecation mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2019-08-29 23:48:43,857 [main] [INFO ] org.apache.hadoop.mapreduce.Job Job job_local1702410896_0001 running in uber mode : false
2019-08-29 23:48:43,859 [main] [INFO ] org.apache.hadoop.mapreduce.Job  map 100% reduce 0%
2019-08-29 23:48:44,324 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Task Task:attempt_local1702410896_0001_r_000000_0 is done. And is in the process of committing
2019-08-29 23:48:44,340 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner 1 / 1 copied.
2019-08-29 23:48:44,340 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Task Task attempt_local1702410896_0001_r_000000_0 is allowed to commit now
2019-08-29 23:48:44,379 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter Saved output of task 'attempt_local1702410896_0001_r_000000_0' to hdfs://192.168.56.202:9000/user/output/_temporary/0/task_local1702410896_0001_r_000000
2019-08-29 23:48:44,380 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner reduce > reduce
2019-08-29 23:48:44,380 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.Task Task 'attempt_local1702410896_0001_r_000000_0' done.
2019-08-29 23:48:44,380 [pool-6-thread-1] [INFO ] org.apache.hadoop.mapred.LocalJobRunner Finishing task: attempt_local1702410896_0001_r_000000_0
2019-08-29 23:48:44,380 [Thread-3] [INFO ] org.apache.hadoop.mapred.LocalJobRunner reduce task executor complete.
2019-08-29 23:48:44,860 [main] [INFO ] org.apache.hadoop.mapreduce.Job  map 100% reduce 100%
2019-08-29 23:48:44,861 [main] [INFO ] org.apache.hadoop.mapreduce.Job Job job_local1702410896_0001 completed successfully
2019-08-29 23:48:44,886 [main] [INFO ] org.apache.hadoop.mapreduce.Job Counters: 35
	File System Counters
		FILE: Number of bytes read=726
		FILE: Number of bytes written=549119
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=170
		HDFS: Number of bytes written=106
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Map-Reduce Framework
		Map input records=2
		Map output records=15
		Map output bytes=145
		Map output materialized bytes=181
		Input split bytes=115
		Combine input records=0
		Combine output records=0
		Reduce input groups=13
		Reduce shuffle bytes=181
		Reduce input records=15
		Reduce output records=13
		Spilled Records=30
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=18
		Total committed heap usage (bytes)=494927872
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=85
	File Output Format Counters 
		Bytes Written=106

    9)、运行成功之后,查看结果,这里我们直接在eclipse的Project Explorer视图中点击我们的输出目录,一般包含两个文件,一个是_SUCCESS,另一个就是输出内容的文件part-r-00000。

    至此,一个简单的wordcount程序就通过eclipse以及他的插件帮助我们在本地运行成功了。而无需反复打包,然后在hadoop部署机器上运行。

    这里有几个注意的问题:

    hadoop在本机的安装,可以是真实的安装,也可以是直接下载一个hadoop预编译版本,不用在本机运行,只需要解压,然后在环境变量中配置HADOOP_HOME等环境变量。个人理解这个hadoop的安装是为了设置windows->preferences->hadoop install,最后,我们创建mapreduce工程的时候,自动添加了这个hadoop安装目录下对应的jar包。

    另外一个问题,就是我们直接在Eclipse里运行一次mapreduce程序之后,还想再次运行,那么需要注意的是,需要手动删除输出目录,为了解决这个问题,我们可以在程序运行之前先检查是否存在输出目录,如果存在,就删除。这里还是以wordcount为例,我们可以这么修改main函数。

public static void main(String[] args) throws Exception {
	String input = args[0];
	String output = args[1];
	System.setProperty("HADOOP_USER_NAME", "root");
	Configuration conf = new Configuration();
	conf.set("fs.defaultFS", "hdfs://192.168.56.202:9000");	
	FileSystem fs = FileSystem.get(conf);
	boolean exists = fs.exists(new Path(output));
	if(exists) {
		fs.delete(new Path(output),true);
	}
	Job job = Job.getInstance(conf);
	
	job.setJarByClass(WordCountApp.class);
	
	FileInputFormat.addInputPath(job, new Path(input));
	FileOutputFormat.setOutputPath(job, new Path(output));
	
	job.setMapperClass(WordCountMapper.class);
	job.setReducerClass(WordCountReducer.class);
	
	job.setMapOutputKeyClass(Text.class);
	job.setMapOutputValueClass(IntWritable.class);
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);
	
	System.exit(job.waitForCompletion(true)?0:1);
}

 

  • 5
    点赞
  • 55
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

luffy5459

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值