Oryx2源码学习

最新推荐文章于 2020-12-23 14:14:13 发布

CodingCatX

最新推荐文章于 2020-12-23 14:14:13 发布

阅读量3.4k

点赞数 1

分类专栏：学习总结 Internet 文章标签：源码 spark kafka oryx

本文链接：https://blog.csdn.net/CodingCatX/article/details/51202212

版权

本文深入解析Oryx 2源码，涵盖部署、框架、WordCount实例等方面。分析了AbstractSparkLayer、SpeedLayer、BatchLayer及 ServingLayer，探讨了它们在实时计算中的角色与功能，如SpeedLayer的多线程处理和BatchLayer的HDFS窗口数据管理。

摘要由CSDN通过智能技术生成

简介

oryx官网

代码分析

基于oryx-2.1.2

代码目录结构

+ oryx
| - app # 基于oryx平台实现的可复用的als、kmeans、rdf算法应用和一个wordcount例子
     | - conf # 样例的conf
     | - example # wordcount代码
     | - oryx-app # als、kmeans、rdf算法应用代码
     | - oryx-app-api # 各应用的可定制复用接口
     | - oryx-app-common # 各应用的公用代码
     | - oryx-app-mllib # als、kmeans、rdf应用的底层算法实现
     | - oryx-app-serving # als、kmeans、rdf应用的的servinglayer实现
| - deploy # 与部署运行相关的代码
     | - bin # 启动脚本
     | - oryx-batch # BatchLayer的二进制main函数
     | - oryx-serving # ServingLayer的二进制main函数
     | - oryx-speed # SpeedLayer的二进制main函数
| - framework # 框架主要代码实现
     | - kafka-util # kafka相关功能
     | - oryx-api # 框架的API接口
     | - oryx-common # 框架的公用功能
     | - oryx-lambda # 框架BatchLayer/SpeedLayer运行、调度、数据分发逻辑代码，这里是框架的主体代码
     | - oryx-lambda-serving # 框架ServingLayer的运行逻辑主体代码
     | - oryx-ml # 机器学习特别定制的BatchLayer接口，实现了一些机器学习相关的通用逻辑
| - src # 文档等其他文件

wordcount例子

为了方便后续的说明，这里举例一个官方wordcount的配置文件例子，文件位于app/conf/wordcount-example.conf：

# A very basic example config file configuring only the essential elements to
# run the example "word count" application
# Values are examples, appropriate for Cloudera quickstart VM:
kafka-brokers = "quickstart.cloudera:9092"
zk-servers = "quickstart.cloudera:2181"
hdfs-base = "hdfs:///user/cloudera/OryxWordCountExample"
oryx {
  id = "WordCountExample"
  input-topic {
    broker = ${kafka-brokers}
    lock = {
      master = ${zk-servers}
    }
  }
  update-topic {
    broker = ${kafka-brokers}
    lock = {
      master = ${zk-servers}
    }
  }
  batch {
    streaming {
      generation-interval-sec = 60
      num-executors = 1
      executor-cores = 2
      executor-memory = "1g"
    }
    update-class = "com.cloudera.oryx.example.batch.ExampleBatchLayerUpdate"
    storage {
      data-dir =  ${hdfs-base}"/data/"
      model-dir = ${hdfs-base}"/model/"
    }
    ui {
      port = 4040
    }
  }
  speed {
    streaming {
      num-executors = 1
      executor-cores = 2
      executor-memory = "1g"
    }
    model-manager-class = "com.cloudera.oryx.example.speed.ExampleSpeedModelManager"
    ui {
      port = 4041
    }
  }
  serving {
    memory = "1000m"
    model-manager-class = "com.cloudera.oryx.example.serving.ExampleServingModelManager"
    application-resources = "com.cloudera.oryx.example.serving"
    api {
      port = 8080
    }
  }
}

完整的配置说明请见：oryx2默认配置文件

1. deploy

1.1. bin

bin目录下是架构的启动脚本，用途包括
- 启动batch/speed/serving Layer
- 按照配置文件配置kafka
- kafka的输入和输出功能

usage: oryx-run.sh command [--option value] ...
  where command is one of:
    batch        Run Batch Layer
    speed        Run Speed Layer
    serving      Run Serving Layer
    kafka-setup  Inspect ZK/Kafka config and configure Kafka topics
    kafka-tail   Follow output from Kafka topics
    kafka-input  Push data to input topic
  and options are one of:
    --layer-jar  Oryx JAR file, like oryx-{serving,speed,batch}-x.y.z.jar
                 Defaults to any oryx-*.jar in working dir
    --conf       Oryx configuration file, like oryx.conf. Defaults to 'oryx.conf'
    --app-jar    User app JAR file
    --jvm-args   Extra args to Oryx JVM processes (including drivers and executors)
    --deployment Only for Serving Layer now; can be 'yarn' or 'local', Default: local.
    --input-file Only for kafka-input. Input file to send
    --help       Display this messag

1.2. oryx-batch/oryx-serving/oryx-speed

这几个目录是对应层的启动主函数，都只是简单的调用了frame的对应层的启动函数而已。
batch

    try (BatchLayer<?,?,?> batchLayer = new BatchLayer<>(ConfigUtils.getDefault())) {
      HadoopUtils.closeAtShutdown(batchLayer);
      batchLayer.start();
      batchLayer.await();
    }

speed

    try (SpeedLayer<?,?,?> speedLayer = new SpeedLayer<>(ConfigUtils.getDefault())) {
      HadoopUtils.closeAtShutdown(speedLayer);
      speedLayer.start();
      speedLayer.await();
    }

serving

    try (ServingLayer servingLayer = new ServingLayer(ConfigUtils.getDefault())) {
      JVMUtils.closeAtShutdown(servingLayer);
      servingLayer.start();
      servingLayer.await();
    }

2. framework

2.0. AbstractSparkLayer

AbstractSparkLayer是batch和speedlayer的基类，因此先介绍AbstractSparkLayer类。

2.0.1. 类的定义和主要函数

/**
 * Encapsulates commonality between Spark-based layer processes,
 * {@link com.cloudera.oryx.lambda.batch.BatchLayer} and
 * {@link com.cloudera.oryx.lambda.speed.SpeedLayer}
 *
 * @param <K> input topic key type
 * @param <M> input topic message type
 */
public abstract class AbstractSparkLayer<K,M> implements Closeable {
   
  protected AbstractSparkLayer(Config config);
  ...
  protected abstract String getConfigGroup();
  protected abstract String getLayerName();
  ...
  protected final JavaStreamingContext buildStreamingContext();
  protected final JavaInputDStream<MessageAndMetadata<K,M>> buildInputDStream(JavaStreamingContext streamingContext);
  private static void fillInLatestOffsets(Map<TopicAndPartition,Long> offsets, Map<String,String> kafkaParams);
}

2.0.2. AbstractSparkLayer主要函数

AbstractSparkLayer构造函数 - 主要功能是读取conf并初始化成员变量。

   protected AbstractSparkLayer(Config config) {
     Objects.requireNonNull(config);
     log.info("Configuration:\n{}", ConfigUtils.prettyPrint(config));

     String group = getConfigGroup();
     this.config = config;
     String configuredID = ConfigUtils.getOptionalString(config, "oryx.id");
     this.id = configuredID == null ? generateRandomID() : configuredID;
     this.streamingMaster = config.getString("oryx." + group + ".streaming.master");
     this.inputTopic = config.getString("oryx.input-topic.message.topic");
     this.inputTopicLockMaster = config.getString("oryx.input-topic.lock.master");
     this.inputBroker = config.getString("oryx.input-topic.broker");
     this.updateTopic = ConfigUtils.getOptionalString(config, "oryx.update-topic.message.topic");
     this.updateTopicLockMaster = ConfigUtils.getOptionalString(config, "oryx.update-topic.lock.master");

     // 加载对应的类，整个框架用了很多反射手段
     this.keyClass = ClassUtils.loadClass(config.getString("oryx.input-topic.message.key-class"));
     this.messageClass = ClassUtils.loadClass(config.getString("oryx.input-topic.message.message-class"));
     this.keyDecoderClass = (Class<? extends Decoder<K>>) ClassUtils.loadClass(config.getString("oryx.input-topic.message.key-decoder-class"), Decoder.class);
     this.messageDecoderClass = (Class<? extends Decoder<M>>) ClassUtils.loadClass(config.getString("oryx.input-topic.message.message-decoder-class"), Decoder.class);

     // streaming的计算周期
     this.generationIntervalSec = config.getInt("oryx." + group + ".streaming.generation-interval-sec");

     // 注意这里，可以添加额外的spark配置，这里会统一读取，并在初始化StreamingContext时设置。
     this.extraSparkConfig = new HashMap<>();
     for (Map.Entry<String,ConfigValue> e : config.getConfig("oryx." + group + ".streaming.config").entrySet()) {
       extraSparkConfig.put(e.getKey(), e.getValue().unwrapped());
     }

     Preconditions.checkArgument(generationIntervalSec > 0);
   }

需要基类重载的标示函数 - 用于基类判断子类的类别

   /**
    * @return layer-specific config grouping under "oryx", like "batch" or "speed"
    */
   protected abstract String getConfigGroup();

   /**
    * @return display name for layer like "BatchLayer"
    */
   protected abstract String getLayerName();

buildStreamingContext - 初始化StreamingContext

   protected final JavaStreamingContext buildStreamingContext() {
     log.info("Starting SparkContext with interval {} seconds", generationIntervalSec);

     // 初始化sparkconf
     SparkConf sparkConf = new SparkConf();

     // 下面两部是给测试使用的，正常情况下不会有这种情况发生
     // Only for tests, really
     if (sparkConf.getOption("spark.master").isEmpty()) {
       log.info("Overriding master to {} for tests", streamingMaster);
       sparkConf.setMaster(streamingMaster);
     }
     // Only for tests, really
     if (sparkConf.getOption("spark.app.name").isEmpty()) {
       String appName = "Oryx" + getLayerName();
       if (id !=