java 远程 yarn jar,我如何从Java向远程YARN集群提交级联作业?

I know that I can submit a Cascading job by packaging it into a JAR, as detailed in the Cascading user guide. That job will then run on my cluster if I manually submit it using hadoop jar CLI command.

However, in the original Hadoop 1 Cascading version, it was possible to submit a job to the cluster by setting certain properties on the Hadoop JobConf. Setting fs.defaultFS and mapred.job.tracker caused the local Hadoop library to automatically attempt to submit the job to the Hadoop1 JobTracker. However, setting these properties does not seem to work in the newer version. Submitting to a CDH5 5.2.1 Hadoop cluster using Cascading version 2.5.3 (which lists CDH5 as a supported platform) leads to an IPC exception when negotiating with the server, as detailed below.

I believe that this platform combination -- Cascading 2.5.6, Hadoop 2, CDH 5, YARN, and the MR1 API for submission -- is a supported combination based on the compatibility table (see under "Prior Releases" heading). And submitting the job using hadoop jar works fine on this same cluster. Port 8031 is open between the submitting host and the ResourceManager. An error with the same message is found in the ResourceManager logs on the server side.

I am using the cascading-hadoop2-mr1 library.

Exception in thread "main" cascading.flow.FlowException: unhandled exception

at cascading.flow.BaseFlow.complete(BaseFlow.java:894)

at WordCount.main(WordCount.java:91)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): Unknown rpc kind in rpc headerRPC_WRITABLE

at org.apache.hadoop.ipc.Client.call(Client.java:1411)

at org.apache.hadoop.ipc.Client.call(Client.java:1364)

at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)

at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)

at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1368)

at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)

at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)

at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)

at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)

at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)

at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Demo code is below, which is basically identical to the WordCount sample from the Cascading user guide.

public class WordCount {

public static void main(String[] args) {

String inputPath = "/user/vagrant/wordcount/input";

String outputPath = "/user/vagrant/wordcount/output";

Scheme sourceScheme = new TextLine( new Fields( "line" ) );

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextDelimited( new Fields( "word", "count" ) );

Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( "wordcount" );

String regex = "(?

Function function = new RegexGenerator( new Fields( "word" ), regex );

assembly = new Each( assembly, new Fields( "line" ), function );

assembly = new GroupBy( assembly, new Fields( "word" ) );

Aggregator count = new Count( new Fields( "count" ) );

assembly = new Every( assembly, count );

Properties properties = AppProps.appProps()

.setName( "word-count-application" )

.setJarClass( WordCount.class )

.buildProperties();

properties.put("fs.defaultFS", "hdfs://192.168.30.101");

properties.put("mapred.job.tracker", "192.168.30.101:8032");

FlowConnector flowConnector = new HadoopFlowConnector( properties );

Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

flow.complete();

}

}

I've also tried setting a bunch of other properties to try to get it working:

mapreduce.jobtracker.address

mapreduce.framework.name

yarn.resourcemanager.address

yarn.resourcemanager.host

yarn.resourcemanager.hostname

yarn.resourcemanager.resourcetracker.address

None of these worked, they just cause the job to run in local mode (unless mapred.job.tracker is also set).

解决方案

I've now resolved this problem. It comes from trying to use the older Hadoop classes that Cloudera distributes, particularly JobClient. This will happen if you use hadoop-core with the provided 2.5.0-mr1-cdh5.2.1 version, or the hadoop-client dependency with this same version number. Although this claims to be the MR1 version, and we are using the MR1 API to submit, this version actually ONLY supports submission to the Hadoop1 JobTracker, and it does not support YARN.

In order to allow submitting to YARN, you must use the hadoop-client dependency with the non-MR1 2.5.0-cdh5.2.1 version, which still supports submission of MR1 jobs to YARN.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值