简介
代码分析
代码目录结构
+ oryx
| - app # 基于oryx平台实现的可复用的als、kmeans、rdf算法应用和一个wordcount例子
| - conf # 样例的conf
| - example # wordcount代码
| - oryx-app # als、kmeans、rdf算法应用代码
| - oryx-app-api # 各应用的可定制复用接口
| - oryx-app-common # 各应用的公用代码
| - oryx-app-mllib # als、kmeans、rdf应用的底层算法实现
| - oryx-app-serving # als、kmeans、rdf应用的的servinglayer实现
| - deploy # 与部署运行相关的代码
| - bin # 启动脚本
| - oryx-batch # BatchLayer的二进制main函数
| - oryx-serving # ServingLayer的二进制main函数
| - oryx-speed # SpeedLayer的二进制main函数
| - framework # 框架主要代码实现
| - kafka-util # kafka相关功能
| - oryx-api # 框架的API接口
| - oryx-common # 框架的公用功能
| - oryx-lambda # 框架BatchLayer/SpeedLayer运行、调度、数据分发逻辑代码,这里是框架的主体代码
| - oryx-lambda-serving # 框架ServingLayer的运行逻辑主体代码
| - oryx-ml # 机器学习特别定制的BatchLayer接口,实现了一些机器学习相关的通用逻辑
| - src # 文档等其他文件
wordcount例子
为了方便后续的说明,这里举例一个官方wordcount的配置文件例子,文件位于app/conf/wordcount-example.conf
:
# A very basic example config file configuring only the essential elements to
# run the example "word count" application
# Values are examples, appropriate for Cloudera quickstart VM:
kafka-brokers = "quickstart.cloudera:9092"
zk-servers = "quickstart.cloudera:2181"
hdfs-base = "hdfs:///user/cloudera/OryxWordCountExample"
oryx {
id = "WordCountExample"
input-topic {
broker = ${kafka-brokers}
lock = {
master = ${zk-servers}
}
}
update-topic {
broker = ${kafka-brokers}
lock = {
master = ${zk-servers}
}
}
batch {
streaming {
generation-interval-sec = 60
num-executors = 1
executor-cores = 2
executor-memory = "1g"
}
update-class = "com.cloudera.oryx.example.batch.ExampleBatchLayerUpdate"
storage {
data-dir = ${hdfs-base}"/data/"
model-dir = ${hdfs-base}"/model/"
}
ui {
port = 4040
}
}
speed {
streaming {
num-executors = 1
executor-cores = 2
executor-memory = "1g"
}
model-manager-class = "com.cloudera.oryx.example.speed.ExampleSpeedModelManager"
ui {
port = 4041
}
}
serving {
memory = "1000m"
model-manager-class = "com.cloudera.oryx.example.serving.ExampleServingModelManager"
application-resources = "com.cloudera.oryx.example.serving"
api {
port = 8080
}
}
}
完整的配置说明请见:oryx2默认配置文件
1. deploy
1.1. bin
- bin目录下是架构的启动脚本,用途包括
- 启动batch/speed/serving Layer
- 按照配置文件配置kafka
- kafka的输入和输出功能
usage: oryx-run.sh command [--option value] ...
where command is one of:
batch Run Batch Layer
speed Run Speed Layer
serving Run Serving Layer
kafka-setup Inspect ZK/Kafka config and configure Kafka topics
kafka-tail Follow output from Kafka topics
kafka-input Push data to input topic
and options are one of:
--layer-jar Oryx JAR file, like oryx-{serving,speed,batch}-x.y.z.jar
Defaults to any oryx-*.jar in working dir
--conf Oryx configuration file, like oryx.conf. Defaults to 'oryx.conf'
--app-jar User app JAR file
--jvm-args Extra args to Oryx JVM processes (including drivers and executors)
--deployment Only for Serving Layer now; can be 'yarn' or 'local', Default: local.
--input-file Only for kafka-input. Input file to send
--help Display this messag
1.2. oryx-batch/oryx-serving/oryx-speed
这几个目录是对应层的启动主函数,都只是简单的调用了frame的对应层的启动函数而已。
batch
try (BatchLayer<?,?,?> batchLayer = new BatchLayer<>(ConfigUtils.getDefault())) {
HadoopUtils.closeAtShutdown(batchLayer);
batchLayer.start();
batchLayer.await();
}
- speed
try (SpeedLayer<?,?,?> speedLayer = new SpeedLayer<>(ConfigUtils.getDefault())) {
HadoopUtils.closeAtShutdown(speedLayer);
speedLayer.start();
speedLayer.await();
}
- serving
try (ServingLayer servingLayer = new ServingLayer(ConfigUtils.getDefault())) {
JVMUtils.closeAtShutdown(servingLayer);
servingLayer.start();
servingLayer.await();
}
2. framework
2.0. AbstractSparkLayer
AbstractSparkLayer是batch和speedlayer的基类,因此先介绍AbstractSparkLayer类。
2.0.1. 类的定义和主要函数
/**
* Encapsulates commonality between Spark-based layer processes,
* {@link com.cloudera.oryx.lambda.batch.BatchLayer} and
* {@link com.cloudera.oryx.lambda.speed.SpeedLayer}
*
* @param <K> input topic key type
* @param <M> input topic message type
*/
public abstract class AbstractSparkLayer<K,M> implements Closeable {
protected AbstractSparkLayer(Config config);
...
protected abstract String getConfigGroup();
protected abstract String getLayerName();
...
protected final JavaStreamingContext buildStreamingContext();
protected final JavaInputDStream<MessageAndMetadata<K,M>> buildInputDStream(JavaStreamingContext streamingContext);
private static void fillInLatestOffsets(Map<TopicAndPartition,Long> offsets, Map<String,String> kafkaParams);
}
2.0.2. AbstractSparkLayer主要函数
- AbstractSparkLayer构造函数 - 主要功能是读取conf并初始化成员变量。
protected AbstractSparkLayer(Config config) {
Objects.requireNonNull(config);
log.info("Configuration:\n{}", ConfigUtils.prettyPrint(config));
String group = getConfigGroup();
this.config = config;
String configuredID = ConfigUtils.getOptionalString(config, "oryx.id");
this.id = configuredID == null ? generateRandomID() : configuredID;
this.streamingMaster = config.getString("oryx." + group + ".streaming.master");
this.inputTopic = config.getString("oryx.input-topic.message.topic");
this.inputTopicLockMaster = config.getString("oryx.input-topic.lock.master");
this.inputBroker = config.getString("oryx.input-topic.broker");
this.updateTopic = ConfigUtils.getOptionalString(config, "oryx.update-topic.message.topic");
this.updateTopicLockMaster = ConfigUtils.getOptionalString(config, "oryx.update-topic.lock.master");
// 加载对应的类,整个框架用了很多反射手段
this.keyClass = ClassUtils.loadClass(config.getString("oryx.input-topic.message.key-class"));
this.messageClass = ClassUtils.loadClass(config.getString("oryx.input-topic.message.message-class"));
this.keyDecoderClass = (Class<? extends Decoder<K>>) ClassUtils.loadClass(config.getString("oryx.input-topic.message.key-decoder-class"), Decoder.class);
this.messageDecoderClass = (Class<? extends Decoder<M>>) ClassUtils.loadClass(config.getString("oryx.input-topic.message.message-decoder-class"), Decoder.class);
// streaming的计算周期
this.generationIntervalSec = config.getInt("oryx." + group + ".streaming.generation-interval-sec");
// 注意这里,可以添加额外的spark配置,这里会统一读取,并在初始化StreamingContext时设置。
this.extraSparkConfig = new HashMap<>();
for (Map.Entry<String,ConfigValue> e : config.getConfig("oryx." + group + ".streaming.config").entrySet()) {
extraSparkConfig.put(e.getKey(), e.getValue().unwrapped());
}
Preconditions.checkArgument(generationIntervalSec > 0);
}
- 需要基类重载的标示函数 - 用于基类判断子类的类别
/**
* @return layer-specific config grouping under "oryx", like "batch" or "speed"
*/
protected abstract String getConfigGroup();
/**
* @return display name for layer like "BatchLayer"
*/
protected abstract String getLayerName();
- buildStreamingContext - 初始化StreamingContext
protected final JavaStreamingContext buildStreamingContext() {
log.info("Starting SparkContext with interval {} seconds", generationIntervalSec);
// 初始化sparkconf
SparkConf sparkConf = new SparkConf();
// 下面两部是给测试使用的,正常情况下不会有这种情况发生
// Only for tests, really
if (sparkConf.getOption("spark.master").isEmpty()) {
log.info("Overriding master to {} for tests", streamingMaster);
sparkConf.setMaster(streamingMaster);
}
// Only for tests, really
if (sparkConf.getOption("spark.app.name").isEmpty()) {
String appName = "Oryx" + getLayerName();
if (id !=