Hadoop: Setup Maven project for MapReduce in 5mn

Hadoop: Setup Maven project for MapReduce in 5mn

I am sure I am not the only one who ever struggled with Hadoop eclipse plugin installation. This plugin strongly depends on your environment (eclipse, ant, jdk) and hadoop distribution and version. Moreover, it only provides the Old API for MapReduce.
It is so simple to create a maven project for Hadoop that wasting time trying to build this plugin becomes totally useless. I am describing on this article how to setup a first maven hadoop project for Cloudera CDH4 on eclipse.

Prerequisite

maven 3
jdk 1.6
eclipse with m2eclipse plugin installed

Add Cloudera repository

Cloudera jar files are not available on default Maven central repository. You need to explicitly add cloudera repo in your settings.xml (under ${HOME}/.m2/settings.xml).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<? xml version = "1.0" encoding = "UTF-8" ?>
< settings >
     < profiles >
         < profile >
             < id >standard-extra-repos</ id >
             < activation >
                 < activeByDefault >true</ activeByDefault >
             </ activation >
             < repositories >
                 < repository >
                     <!-- Central Repository -->
                     < id >central</ id >
                     < url >http://repo1.maven.org/maven2/</ url >
                     < releases >
                         < enabled >true</ enabled >
                     </ releases >
                     < snapshots >
                         < enabled >true</ enabled >
                     </ snapshots >
                 </ repository >
                 < repository >
                     <!-- Cloudera Repository -->
                     < id >cloudera</ id >
                     < url >https://repository.cloudera.com/artifactory/cloudera-repos</ url >
                     < releases >
                         < enabled >true</ enabled >
                     </ releases >
                     < snapshots >
                         < enabled >true</ enabled >
                    </ snapshots >
                 </ repository >
             </ repositories >
         </ profile >
     </ profiles >
</ settings >

Create Maven project

On eclipse, create a new Maven project as follow

maven

maven2

maven3

Add Hadoop Nature

For Cloudera distribution CDH4, open pom.xml file and add the following dependencies

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
< dependencyManagement >
     < dependencies >
         < dependency >
             < groupId >jdk.tools</ groupId >
             < artifactId >jdk.tools</ artifactId >
             < version >1.6</ version >
         </ dependency >
         < dependency >
             < groupId >org.apache.hadoop</ groupId >
             < artifactId >hadoop-hdfs</ artifactId >
             < version >2.0.0-cdh4.0.0</ version >
         </ dependency >
         < dependency >
             < groupId >org.apache.hadoop</ groupId >
             < artifactId >hadoop-auth</ artifactId >
             < version >2.0.0-cdh4.0.0</ version >
         </ dependency >
         < dependency >
             < groupId >org.apache.hadoop</ groupId >
             < artifactId >hadoop-common</ artifactId >
             < version >2.0.0-cdh4.0.0</ version >
         </ dependency >
         < dependency >
             < groupId >org.apache.hadoop</ groupId >
             < artifactId >hadoop-core</ artifactId >
             < version >2.0.0-mr1-cdh4.0.1</ version >
         </ dependency >
         < dependency >
             < groupId >junit</ groupId >
             < artifactId >junit-dep</ artifactId >
             < version >4.8.2</ version >
         </ dependency >
     </ dependencies >
</ dependencyManagement >
< dependencies >
     < dependency >
         < groupId >org.apache.hadoop</ groupId >
         < artifactId >hadoop-hdfs</ artifactId >
     </ dependency >
     < dependency >
         < groupId >org.apache.hadoop</ groupId >
         < artifactId >hadoop-auth</ artifactId >
     </ dependency >
     < dependency >
         < groupId >org.apache.hadoop</ groupId >
         < artifactId >hadoop-common</ artifactId >
     </ dependency >
     < dependency >
         < groupId >org.apache.hadoop</ groupId >
         < artifactId >hadoop-core</ artifactId >
     </ dependency >
     < dependency >
         < groupId >junit</ groupId >
         < artifactId >junit</ artifactId >
         < version >4.10</ version >
         < scope >test</ scope >
     </ dependency >
</ dependencies >
< build >
     < plugins >
         < plugin >
             < groupId >org.apache.maven.plugins</ groupId >
             < artifactId >maven-compiler-plugin</ artifactId >
             < version >2.1</ version >
             < configuration >
                 < source >1.6</ source >
                 < target >1.6</ target >
             </ configuration >
         </ plugin >
     </ plugins >
</ build >

Download dependencies

Now that you have added your Cloudera repository and created your project, download dependencies. This can be easily done by right-clicking on your eclipse project, “update Maven dependencies”.
All these dependencies must have been added on your .m2 repository.

[developer@localhost ~]$ find .m2/repository/org/apache/hadoop -name "*.jar" 
.m2/repository/org/apache/hadoop/hadoop-tools/1.0.4/hadoop-tools-1.0.4.jar
.m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0-sources.jar
.m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0.jar
.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1-sources.jar
.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1.jar
.m2/repository/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.0.0/hadoop-hdfs-2.0.0-cdh4.0.0.jar
.m2/repository/org/apache/hadoop/hadoop-streaming/1.0.4/hadoop-streaming-1.0.4.jar
.m2/repository/org/apache/hadoop/hadoop-auth/2.0.0-cdh4.0.0/hadoop-auth-2.0.0-cdh4.0.0.jar
[developer@localhost ~]$ 

Create WordCount example

Create your driver code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package com.aamend.hadoop.MapReduce;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 
public class WordCount {
 
     public static void main(String[] args) throws IOException,
             InterruptedException, ClassNotFoundException {
 
         Path inputPath = new Path(args[ 0 ]);
         Path outputDir = new Path(args[ 1 ]);
 
         // Create configuration
         Configuration conf = new Configuration( true );
 
         // Create job
         Job job = new Job(conf, "WordCount" );
         job.setJarByClass(WordCountMapper. class );
 
         // Setup MapReduce
         job.setMapperClass(WordCountMapper. class );
         job.setReducerClass(WordCountReducer. class );
         job.setNumReduceTasks( 1 );
 
         // Specify key / value
         job.setOutputKeyClass(Text. class );
         job.setOutputValueClass(IntWritable. class );
 
         // Input
         FileInputFormat.addInputPath(job, inputPath);
         job.setInputFormatClass(TextInputFormat. class );
 
         // Output
         FileOutputFormat.setOutputPath(job, outputDir);
         job.setOutputFormatClass(TextOutputFormat. class );
 
         // Delete output if exists
         FileSystem hdfs = FileSystem.get(conf);
         if (hdfs.exists(outputDir))
             hdfs.delete(outputDir, true );
 
         // Execute job
         int code = job.waitForCompletion( true ) ? 0 : 1 ;
         System.exit(code);
 
     }
 
}

Create Mapper class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package com.aamend.hadoop.MapReduce;
 
import java.io.IOException;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class WordCountMapper extends
         Mapper<Object, Text, Text, IntWritable> {
 
     private final IntWritable ONE = new IntWritable( 1 );
     private Text word = new Text();
 
     public void map(Object key, Text value, Context context)
             throws IOException, InterruptedException {
 
         String[] csv = value.toString().split( "," );
         for (String str : csv) {
             word.set(str);
             context.write(word, ONE);
         }
     }
}

Create your Reducer class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
package com.aamend.hadoop.MapReduce;
 
import java.io.IOException;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class WordCountReducer extends
         Reducer<Text, IntWritable, Text, IntWritable> {
 
     public void reduce(Text text, Iterable<IntWritable> values, Context context)
             throws IOException, InterruptedException {
         int sum = 0 ;
         for (IntWritable value : values) {
             sum += value.get();
         }
         context.write(text, new IntWritable(sum));
     }
}

Build project

Exporting jar file is actually out of the box using maven. Execute the following command

mvn clean install

You should see same output as below

.../...

[INFO] 
[INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ MapReduce ---
[INFO] Building jar: /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ MapReduce ---
[INFO] Installing /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.jar
[INFO] Installing /home/developer/Workspace/hadoop/MapReduce/pom.xml to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.159s
[INFO] Finished at: Sat May 25 00:35:56 GMT+02:00 2013
[INFO] Final Memory: 16M/212M
[INFO] ------------------------------------------------------------------------

And your jar file must be available on project’s target directory (additionally in your ${HOME}/.m2 local repository).

maven5

This jar is ready to be executed on your Hadoop environment.

hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.aamend.hadoop.MapReduce.WordCount input output

Each time I need to create a new Hadoop project, I simply copy pom.xml template described above, and that’s it..

转载于:https://www.cnblogs.com/haoliansheng/p/5130234.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值