我的开发环境:
操作系统ubuntu12.04 一个namenode 三个datanode
Hadoop版本:hadoop-1.0.1
Eclipse版本:EclipseSDK3.8.2
第一步:先启动hadoop守护进程
具体参看:http://www.cnblogs.com/flyoung2008/archive/2011/11/29/2268302.html
第二步:在eclipse上安装hadoop插件
1.复制 hadoop-eclipse插件:hadoop-eclipse-plugin-1.2.1到 eclipse安装目录/plugins/ 下。
下载地址:http://download.csdn.net/detail/wtxwd/7803427
2.重启eclipse,配置hadoop installation directory。
如果安装插件成功,打开Window-->Preferens,你会发现Hadoop Map/Reduce选项,在这个选项里你需要配置Hadoop installation directory。配置完成后退出。
3.配置Map/Reduce Locations。
在Window-->Show View中打开Map/Reduce Locations。
在Map/Reduce Locations中新建一个Hadoop Location。在这个View中,右键-->New Hadoop Location。在弹出的对话框中你需要配置Location name,如Hadoop,还有Map/Reduce Master和DFS Master。这里面的Host、Port分别为你在mapred-site.xml、core-site.xml中配置的地址及端口。如:
Map/Reduce Master
master(192.168.1.6) 9001
DFS Master
master(192.168.1.6)
9000
第三步:新建项目。
File-->New-->Other-->Map/Reduce Project
项目名可以随便取,如WordCount。
复制WordCount.java到刚才新建的项目下面。
源文件下载地址:http://download.csdn.net/detail/wtxwd/7803385
第四步:暂时离开eclipse,上传文件
然后打开终端创建2个文件file01、file02,内容如下
- [hadoop@localhost~]$ ls
classes Desktop file01 file02 hadoop-0.20.203.0 wordcount.jar WordCount.java
- [hadoop@localhost~]$ cat file01
Hello World Bye World
- [hadoop@localhost~]$ cat file02
Hello Hadoop Goodbye Hadoop
启动hadoop,在hadoop中创建input文件夹
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -ls
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -mkdir input
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2011-11-23 05:20 /user/hadoop/input
把file01、file02上传input中
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -put file01 input
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -put file02 input
- [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -ls input
Found 2 items
-rw-r--r-- 1 hadoop supergroup 22 2011-11-23 05:22/user/hadoop/input/file01
-rw-r--r-- 1 hadoop supergroup 28 2011-11-23 05:22/user/hadoop/input/file02
第五步:运行项目
1.在新建的项目Hadoop,点击WordCount.java,右键-->Run As-->Run Configurations
2.在弹出的Run Configurations对话框中,点Java Application,右键-->New,这时会新建一个application名为WordCount
3.配置运行参数,点Arguments,在Program arguments中输入“你要传给程序的输入文件夹和你要求程序将计算结果保存的文件夹”,如:
hdfs://master:9000/user/hadoop/input hdfs://master:9000/user/hadoop/output
4、如果运行时报java.lang.OutOfMemoryError: Java heap space 配置VM arguments(在Program arguments下)
-Xms512m -Xmx1024m -XX:MaxPermSize=256m
完成输入输出路径以后,我们点击Apply,但是这个时候不能点击Run,因为这里的run是指在单机上run,而我们是要在hadoop集群上run,因此我们点击close然后执行以下步骤:WordCount.java->右键->Run as->Run on hadoop。
点击finish运行。
使用Eclipse进行hadoop的程序编写,然后Run on hadoop 后,可能出现如下错误org.apache.hadoop.security.AccessControlException:org.apache.hadoop.security.AccessControlException: Permissiondenied: user=mango, access=WRITE,inode="hadoop"/inode="tmp":hadoop:supergroup:rwxr-xr-x
因为Eclipse使用hadoop插件提交作业时,会默认以 mango 身份去将作业写入hdfs文件系统中,对应的也就是 HDFS上的/user/xxx , 我的为/user/zcf ,
提供的解决方法为:放开 hadoop 目录的权限 , 命令如下:
$cd /usr/local/hadoop
$ bin/hadoop fs -chmod 777 /user/hadoop
运行成功后控制台显示如下:
13/06/30 14:17:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/06/30 14:17:01 INFO input.FileInputFormat: Total input paths to process : 2
13/06/30 14:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/30 14:17:01 INFO mapred.JobClient: Running job: job_local_0001
13/06/30 14:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/30 14:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1e903d5
13/06/30 14:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/30 14:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/30 14:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/30 14:17:02 INFO mapred.MapTask: Starting flush of map output
13/06/30 14:17:02 INFO mapred.MapTask: Finished spill 0
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner:
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/06/30 14:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c01e99
13/06/30 14:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/30 14:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/30 14:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/30 14:17:02 INFO mapred.MapTask: Starting flush of map output
13/06/30 14:17:02 INFO mapred.MapTask: Finished spill 0
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner:
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
13/06/30 14:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@31f2a7
13/06/30 14:17:02 INFO mapred.LocalJobRunner:
13/06/30 14:17:02 INFO mapred.Merger: Merging 2 sorted segments
13/06/30 14:17:02 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 73 bytes
13/06/30 14:17:02 INFO mapred.LocalJobRunner:
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner:
13/06/30 14:17:02 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/06/30 14:17:02 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/user/hadoop/output
13/06/30 14:17:02 INFO mapred.LocalJobRunner: reduce > reduce
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/06/30 14:17:02 INFO mapred.JobClient: map 100% reduce 100%
13/06/30 14:17:02 INFO mapred.JobClient: Job complete: job_local_0001
13/06/30 14:17:02 INFO mapred.JobClient: Counters: 22
13/06/30 14:17:02 INFO mapred.JobClient: File Output Format Counters
13/06/30 14:17:02 INFO mapred.JobClient: Bytes Written=31
13/06/30 14:17:02 INFO mapred.JobClient: FileSystemCounters
13/06/30 14:17:02 INFO mapred.JobClient: FILE_BYTES_READ=18047
13/06/30 14:17:02 INFO mapred.JobClient: HDFS_BYTES_READ=116
13/06/30 14:17:02 INFO mapred.JobClient: FILE_BYTES_WRITTEN=214050
13/06/30 14:17:02 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=31
13/06/30 14:17:02 INFO mapred.JobClient: File Input Format Counters
13/06/30 14:17:02 INFO mapred.JobClient: Bytes Read=46
13/06/30 14:17:02 INFO mapred.JobClient: Map-Reduce Framework
13/06/30 14:17:02 INFO mapred.JobClient: Map output materialized bytes=81
13/06/30 14:17:02 INFO mapred.JobClient: Map input records=2
13/06/30 14:17:02 INFO mapred.JobClient: Reduce shuffle bytes=0
13/06/30 14:17:02 INFO mapred.JobClient: Spilled Records=12
13/06/30 14:17:02 INFO mapred.JobClient: Map output bytes=78
13/06/30 14:17:02 INFO mapred.JobClient: Total committed heap usage (bytes)=681639936
13/06/30 14:17:02 INFO mapred.JobClient: CPU time spent (ms)=0
13/06/30 14:17:02 INFO mapred.JobClient: SPLIT_RAW_BYTES=222
13/06/30 14:17:02 INFO mapred.JobClient: Combine input records=8
13/06/30 14:17:02 INFO mapred.JobClient: Reduce input records=6
13/06/30 14:17:02 INFO mapred.JobClient: Reduce input groups=4
13/06/30 14:17:02 INFO mapred.JobClient: Combine output records=6
13/06/30 14:17:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/06/30 14:17:02 INFO mapred.JobClient: Reduce output records=4
13/06/30 14:17:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/06/30 14:17:02 INFO mapred.JobClient: Map output records=8
运行完毕后在output文件夹中查看结果。
hadoop fs -ls /user/hadoop/output
输出:
Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2014-08-22 17:13 /user/hadoop/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 41 2014-08-22 17:13 /user/hadoop/output/part-r-00000
查看文件内容:
hadoop fs -cat /user/hadoop/output/part-r-00000
输出
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
至此,eclipse下的WordCount实例运行结束,如果还想重新运行一遍,这需把output文件夹删除或者修改Run Configuration中arguments中的output路径,因为hadoop为了保证结果的正确性,存在输出的文件夹的话,就会报异常,异常如下
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists