Hadoop: Add third-party libraries to MapReduce job

来自:http://hadoopi.wordpress.com/2014/06/05/hadoop-add-third-party-libraries-to-mapreduce-job/

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job.

Add libjars option

The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must useGenericOptionsParser class. Easiest way is to implement the Hadoop Tool interface as described in post Hadoop: Implementing the Tool interface for MapReduce driver.

$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar /path/to/my.jar com.wordpress.hadoopi.MyClass -libjars ${LIBJARS} value

This will obviously work only when playing with CLI, so how the heck can we add such external jar files when not using CLI ?

Add jar files to Hadoop classpath

You could certainly upload external jar files to each tasktracker and update HADOOOP_CLASSPATH accordingly, but are you really willing to bother Ops team each time you need to add a new jar ? Works well on a single server node, but are you going to upload such jar across all of the 10, 100 or even more Hadoop nodes ? This approach does not scale at all !

Create a fat jar

Another approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). Be aware this Jar will not only contain your classes, but might also include all your project dependencies (such as Hadoop libraries) unless you explicitly exclude them (using provided tag).
Here is an example of maven plugin you will need to set up

            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass></mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>
                             jar-with-dependencies
                        </descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

Following a “mvn clean package” command, your fat JAR will be located in maven project’s target directory as follows

drwxr-xr-x  2 antoine  staff        68 Jun 10 09:30 archive-tmp
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 classes
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-sources
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-test-sources
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 maven-archiver
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 myproject-1.0-SNAPSHOT
-rw-r--r--  1 antoine  staff  63880020 Jun 10 09:30 myproject-1.0-SNAPSHOT-jar-with-dependencies.jar
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 surefire-reports
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 test-classes

In above example, note the actual size of your JAR file (61MB). Quite fat, isn’t it ?
You can ensure all dependencies have been added by firing up below command

$ jar -tf myproject-1.0-SNAPSHOT-jar-with-dependencies.jar

META-INF/
META-INF/MANIFEST.MF
com/aamend/hadoop/allMyClasses.class
...
com/others/allMyDependencies.class
...

Use Distributed cache

I am always following such approach when using third-party libraries in my MapReduce jobs. One would say such approach is not elegant, but I can work without annoying anyone from Ops team :). I first create a directory “lib” in my HDFS home directory (“/user/hadoopi/”). You could even use “/tmp”, it does not matter. I then create a static method that

  1. Locate the jar file that includes the class I need
  2. Upload this jar to Hadoop HDFS
  3. Add the uploaded jar file to Hadoop distributed cache

Simply add the following lines to some Utils class.

    private static void addJarToDistributedCache(
            Class classToAdd, Configuration conf)
        throws IOException {

        // Retrieve jar file for class2Add
        String jar = classToAdd.getProtectionDomain().
                getCodeSource().getLocation().
                getPath();
        File jarFile = new File(jar);

        // Declare new HDFS location
        Path hdfsJar = new Path("/user/hadoopi/lib/"
                + jarFile.getName());

        // Mount HDFS
        FileSystem hdfs = FileSystem.get(conf);

        // Copy (override) jar file to HDFS
        hdfs.copyFromLocalFile(false, true,
            new Path(jar), hdfsJar);

        // Add jar to distributed classPath
        DistributedCache.addFileToClassPath(hdfsJar, conf);
    }

The only thing you need to remember is to add this class prior to Job submission…

    public static void main(String[] args) throws Exception {

        // Create Hadoop configuration
        Configuration conf = new Configuration();

        // Add 3rd-party libraries
        addJarToDistributedCache(MyFirstClass.class, conf);
        addJarToDistributedCache(MySecondClass.class, conf);

        // Create my job
        Job job = new Job(conf, "Hadoop-classpath");
        .../...
    }

Here you are, your MapReduce is now able to use any external JAR file.

转载于:https://www.cnblogs.com/sunxucool/p/3845113.html

遇到找不到依赖项 'org.apache.hadoop:hadoop-mapreduce-clientjobclient:3.3.6' 的情况,通常是在Java项目中使用Maven或Gradle这类构建工具时发生的。这个错误表示你在项目的pom.xml(对于Maven)或build.gradle(对于Gradle)文件中引用了Apache Hadoop MapReduce Job Client 3.3.6版本,但在实际编译或安装过程中,该版本的jar包并未正确添加到项目的类路径中。 解决这个问题的步骤如下: 1. **检查版本信息**:确保你的Maven或Gradle配置中指定的Hadoop版本与实际可用的版本一致。如果不是3.3.6,尝试下载对应版本的JAR包。 2. **添加依赖**: - Maven: 在pom.xml中添加正确的Hadoop依赖。如果是Maven,确保有如下配置: ```xml <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>3.3.6</version> </dependency> ``` - Gradle: 如果使用的是Gradle,应在build.gradle中添加类似: ```groovy implementation 'org.apache.hadoop:hadoop-mapreduce-client-jobclient:3.3.6' ``` 3. **本地仓库查找**:确保你的本地Maven或Gradle仓库已经包含了这个依赖。如果没有,你需要从Maven中央仓库或其他源下载并添加到你的本地仓库。 4. **重新同步/构建**:在Maven中执行 `mvn clean install` 或者在Gradle中执行 `gradle build`,这将强制更新你的项目依赖。 5. **检查网络连接**:如果以上都正常,可能是网络问题导致无法下载依赖。确认你的机器能够访问Maven或Gradle的仓库服务器。 6. **排除冲突**:检查是否有其他依赖项引入了冲突的版本,可能需要调整它们的版本或者排除冲突。 如果你在公司内部环境,可能还需要检查公司的防火墙设置是否允许访问相关的外部库。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值