Hadoop框架MapReduce客户端Job提交过程

Hadoop MapReduce 客户端提交Job的过程

MapReduce WordCount的示例代码

package MapReduceLearn.MapReduceLearn.Art;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCount {
	public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
		private Text word = new Text();
		private final static IntWritable one = new IntWritable(1);
		
		@Override
		protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			StringTokenizer itr = new StringTokenizer(value.toString());
			while (itr.hasMoreTokens()) {
				word.set(itr.nextToken());
				context.write(word, one);
			}
		}
	}
	
	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable result = new IntWritable(0);
		
		@Override
		protected void reduce(Text arg0, Iterable<IntWritable> arg1,
				Reducer<Text, IntWritable, Text, IntWritable>.Context arg2) throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			int sum = 0;
			for (IntWritable val : arg1) {
				sum += val.get();
			}
			result.set(sum);
			arg2.write(arg0, result);
		}
	}
	
	public static void main(String[] args) {
		Configuration conf = new Configuration();
		// FileInputFormat.addInputPath(job.getConfiguration(), new Path(args[0]));
		try {
			Job job = Job.getInstance();
			JobConf jobConf = (JobConf)job.getConfiguration();
			// Hadoop 2.0为Job设置 Jar class,否则任务执行过程中无法找到Map类和Reduce类
			job.setJarByClass(WordCount.class);
			job.setJobName("word count");
			job.setMapperClass(TokenizerMapper.class);
			job.setReducerClass(IntSumReducer.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			FileInputFormat.addInputPath(jobConf, new Path("hdfs://bigdatamaster:9000/data/wordcount/words/"));
			FileOutputFormat.setOutputPath(jobConf, new Path("hdfs://bigdatamaster:9000/data/wordcount/result/"));
			job.waitForCompletion(true);
		} catch (ClassNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (InterruptedException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

Job提交过程

Created with Raphaël 2.2.0 开始 新建Job对象 job.waitForCompletion job.submit Job的submit过程 1. job.connect,其中新建cluster对象 2. 通过ServiceLoader获取到ClientProtocolProvider的继承类 3. Yarn框架下应该获取到YarnClientProtocolProvider 4. 然后通过provider创建ClientProtocol 5. YarnClientProtocolProvier会返回YARNRunner 获取JobSubmitter对象(submitClient为YARNRunner) YARNRunner通过实例的resMgrDelegate提交Application submitClient(YARNRunner)提交任务submitJob YARNRunner通过自身实例的client(YarnClientImpl)成员提交应用 YARNRunner通过自身实例的rmClient(类型ApplicationClientProtocol, 实际类型是ApplicationClientProtocolPBClientImpl)提交应用 ApplicationClientProtocolPBClientImpl通过自身实例的proxy提交任务到服务器 结束

YARNRunner的成员rmClient的获取过程

rmClient的获取过程

Created with Raphaël 2.2.0 开始 新建ResourceMgrDelegate实例 新建ResourceMgrDelegate.client:YarnClientImpl对象 启动ResourceMgrDelegate,其中会调用 serviceStart函数,serviceStart会调用client.start 调用YarnClientImpl.serviceStart函数, 此函数中生成YarnClientImpl.rmClient实例 ClientRMProxy.createRMProxy(getConfig(), ApplicationClientProtocol.class) 检测集群是否处在HA状态 生成RMFailoverProxyProvider对象provider “org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider” RetryProxy.create(protocol, provider, retryPolicy) 调用Proxy.newProxyInstance方法(java反射),提供的Invoker为: new RetryInvocationHandler<T>(proxyProvider, retryPolicy) 将新生成的Proxy返回给用户代码侧 结束 为DefaultFailoverProxyProvider提供Proxy对象 "RMProxy.<T>getProxy(conf, protocol, rmAddress)" 生成DefaultFailoverProxyProvider的实例对象provider yes no

之后,调用proxy的任何方法时,proxy类都会把方法和参数带入到自己的InvocationHandler的invoke函数中,RetryInvocationHandler最终会去调用proxy类的同名方法

关于java在Proxy这块儿的反射机制我会在后面的ApplicationClientProtocolPBClientImpl描述部分解释如何运行
DefaultFailoverProxyProvider的Proxy的生成过程

Created with Raphaël 2.2.0 开始 通过YarnRPC获取rpc类,key值为"yarn.ipc.rpc.class", 默认值是"org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC" HadoopYarnProtoRPC.getClientFactory,获取client创建工厂类, key值为“yarn.ipc.client.factory.class”, 默认值“org.apache.hadoop.yarn.factories.impl.pb.RpcClientFactoryPBImpl” 通过工厂创建Rpc客户端,工厂默认是RpcClientFactoryPBImpl,生成类: "org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl" 结束

默认情况下,此阶段最终会获得类ApplicationClientProtocolPBClientImpl的实例

ApplicationClientProtocolPBClientImpl的成员proxy的获取过程

从源代码出发,获取proxy的代码如下:

public ApplicationClientProtocolPBClientImpl(long clientVersion,
      InetSocketAddress addr, Configuration conf) throws IOException {
    RPC.setProtocolEngine(conf, ApplicationClientProtocolPB.class,
      ProtobufRpcEngine.class);
    proxy = RPC.getProxy(ApplicationClientProtocolPB.class, clientVersion, addr, conf);
  }
Created with Raphaël 2.2.0 开始 设置RPC协议引擎为ProtobufRpcEngine 获取RPC协议(ApplicationClientProtocolPB)的客户端的代理类 从设置的RPC协议引擎获取客户端的代理类 新建ProtobufRpcEngine.Invoker 新建ApplicationClientProtocolPB的代理类 返回代理类 结束

ApplicationClientProtocolPB协议代理类的生成过程

生成代理类的代码如下:

public <T> ProtocolProxy<T> getProxy(Class<T> protocol, long clientVersion,
      InetSocketAddress addr, UserGroupInformation ticket, Configuration conf,
      SocketFactory factory, int rpcTimeout, RetryPolicy connectionRetryPolicy,
      AtomicBoolean fallbackToSimpleAuth) throws IOException {
      
    final Invoker invoker = new Invoker(protocol, addr, ticket, conf, factory,
        rpcTimeout, connectionRetryPolicy, fallbackToSimpleAuth);
    return new ProtocolProxy<T>(protocol, (T) Proxy.newProxyInstance(
        protocol.getClassLoader(), new Class[]{protocol}, invoker), false);
  }

我们主要看Proxy.newProxyInstance的过程

Created with Raphaël 2.2.0 开始 Class<?> cl = getProxyClass0(loader, intfs); final Constructor<?> cons = cl.getConstructor(constructorParams); cons.newInstance(new Object[]{h}); 结束

getProxyClass0

函数原型:private static Class<?> getProxyClass0(ClassLoader loader, Class<?>… interfaces)
函数作用:根据loader和interfaces创造一个代理类,这个类是动态生成的
创造新类的过程:

Created with Raphaël 2.2.0 开始 Proxy的static成员proxyClassCache "new WeakCache<>(new KeyFactory(), new ProxyClassFactory());" 生成proxyClassCache中存放的key 从proxyClassCache中获取value 结束 创建新的Factory对象 调用ProxyClassFactory类的apply函数,新建Proxy类 yes no
  • proxyClassCache的valuesMap内存结构
KeyValue
Proxy.Key1 (KeyFactory.apply)Factory (get=>ProxyClassFactory.apply)
Proxy.Key2 (KeyFactory.apply)Factory (get=>ProxyClassFactory.apply)
Proxy.key0 (KeyFactory.apply)Factory (get=>ProxyClassFactory.apply)
Proxy.KeyX (KeyFactory.apply)Factory (get=>ProxyClassFactory.apply)
  • ProxyClassFactory.apply
Created with Raphaël 2.2.0 开始 检测interfaces是否是classloader可以访问的 检测interfaces是否都是接口 检测interfaces中是否有重复的 生成Proxy类应该所隶属的包 interfaces是否存在修饰符不是public的 将非public的interface的包赋值给proxyPkg 生成一个Proxy类名,大致的模式是 【proxyPkg】【proxyClassNamePrefix】【唯一数字】 举例:com.sun.proxy.$Proxy1 生成Proxy类,动态生成 生成ProxyGenerator 生成class 生成interfaces中的所有方法 结束 将“com.sun.proxy.”赋值给proxyPkg 抛出不合法参数异常 yes no yes no yes no yes no
  • Proxy动态类的方法生成
    这一部分,是生成的供jvm直接使用的方法
    大致生成的方法如下:
行号函数有参数时的指令函数无参数时的指令指令含义
1aload_0aload_0
2getfiledgetfiled获取指定类的实例域,并将其值压入到栈中
3refref执行class Proxy {InvocationHandler h;}的reference
4aload_0aload_0
5getstaticgetstatic获取指定类的静态域,并将其值压入到栈中
6refref本函数对应函数名的reference
7a. 将参数列表长度压入到栈中; b. anewarray; c. 关于java.lang.Object的引用入栈; d. dup; e. 参数对应在参数列表中的序列值入栈; f.参数类型对应的字长入栈;aconst_null参数入栈
8invokeinterfaceinvokeinterface调用接口方法
9refref对应java.lang.reflect.InvocationHandler {java.lang.Object invoke(java.lang.Object, java.lang.relfect.Method, java.lang.Object)}的reference
1044
1100
12返回值返回值

翻译后的大致结果大概是,以submitApplication函数为例

  • 函数原型:
public org.apache.hadoop.yarn.proto.YarnServiceProtos.SubmitApplicationResponseProto submitApplication(
com.google.protobuf.RpcController controller, 
org.apache.hadoop.yarn.proto.YarnServiceProtos.SubmitApplicationRequestProto request)
  • 生成的代理函数为
   class Proxy1  {
   java.lang.reflect.InvocationHandler h;
   org.apache.hadoop.yarn.proto.YarnServiceProtos.SubmitApplicationResponseProto submitApplication(com.google.protobuf.RpcController controller, org.apache.hadoop.yarn.proto.YarnServiceProtos.SubmitApplicationRequestProto request) {
          h.invoke(Proxy, submitApplication, {RpcController, SubmitApplicationRequestProto request})}
   } 

总结,最终,动态生成的Proxy类,在调用被代理类的方法时,实际上是调用了用户自己提供的InvocationHandler中的invoke函数来调用相应的方法

ProtobufPpcEngine.Invoker类的invoke方法

Created with Raphaël 2.2.0 开始 构建RPC请求头 向服务端发送RPC请求 结束
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值