spark开源镜像_引入“一环”-适用于所有Spark应用程序的开源管道

spark开源镜像

If you utilize Apache Spark, you probably have a few applications that consume some data from external sources and produce some intermediate result, that is about to be consumed by some applications further down the processing chain, and so on until you get a final result.

如果使用Apache Spark,则可能有一些应用程序会消耗来自外部源的一些数据并产生一些中间结果,这些中间结果将被处理链下游的一些应用程序所消耗,依此类推,直到获得最终结果。

We suspect that because we have a similar pipeline with lots of processes like this one:

我们怀疑这是因为我们有一个类似的管道,其中包含许多这样的流程:

A process flowchart with more than 50 applications and about 70 datasets

Click here for a bit larger version

点击这里查看大图

Each rectangle is a Spark application with a set of their own execution parameters, and each arrow is an equally parametrized dataset (externally stored highlighted with a color; note the number of intermediate ones). This example is not the most complex of our processes, it’s fairly a simple one. And we don’t assemble such workflows manually, we generate them from Process Templates (outlined as groups on this flowchart).

每个矩形是一个带有一组自己的执行参数的Spark应用程序,每个箭头是一个参数相同的数据集(外部存储并以颜色突出显示;请注意中间的数量)。 这个例子不是我们过程中最复杂的,而是一个简单的例子。 而且,我们不会手动组装此类工作流程,而是从“流程模板”(在此流程图中概述为组)生成它们。

So here comes the One Ring, a Spark pipelining framework with very robust configuration abilities, which makes it easier to compose and execute a most complex Process as a single large Spark job.

因此出现了One Ring ,这是一个具有强大配置能力的Spark流水线框架,它使作为单个大型Spark作业的组成和执行最复杂的流程变得更加容易。

And we just made it open source. Perhaps, you’re interested in the details.

我们只是将其开源。 也许您对细节感兴趣。

Let we speak about Spark applications. This document explains how to use...

让我们谈谈Spark应用程序。 本文档说明了如何使用...

  • One Ring to unify them all,

    一环统一起来,
  • One Ring to unite them,

    一环团结起来
  • One Ring to bind them all

    一枚戒指将它们绑在一起
  • And in the Clouds run them

    然后在云端运行它们

...in parametrized chains.

在参数化链中

DSL. Or even generate a very complex ones from shorter Template fragments. DSL中进行简单配置。 甚至从较短的Template片段生成非常复杂的片段。

Let us start with a description how to build and extend One Ring, then proceed to configuration and composition stuff.

让我们开始描述如何构建和扩展“一环”,然后继续进行配置和组合工作。

一环统一 (One Ring to unify them all)

To build One Ring you need Apache Maven, version 3.5 or higher.

要构建One Ring,您需要3.5或更高版本的Apache Maven。

Make sure you've cloned this repo with all submodules:

确保已使用所有子模块克隆了此仓库:

git clone --recursive https://github.com/PastorGL/OneRing.git

一环CLI (One Ring CLI)

If you're planning to compute on an EMR cluster, just cd to OneRing directory and execute Maven in the default profile:

如果您打算在EMR群集上进行计算,只需将cd转到OneRing目录并在默认配置文件中执行Maven:

mvn clean package

The ./TaskWrapper/target/one-ring-cli.jar is a fat executable JAR targeted at the Spark environment provided by EMR version 5.23.0.

./TaskWrapper/target/one-ring-cli.jar是针对EMMR 5.23.0版本提供的Spark环境的胖可执行JAR。

If you're planning to build a local artifact with full Spark built-in, execute build in 'local' profile:

如果您打算使用完整的内置Spark构建本地工件,请在“本地”配置文件中执行构建:

mvn clean package -Plocal

Make sure the resulting JAR is about ~100 MB in size.

确保生成的JAR大小约为100 MB。

It isn't recommended to skip tests in the build process, but if you're running Spark locally on your build machine, you could add -DskipTests to Maven command line, because they will interfere.

不建议在构建过程中跳过测试,但是如果在构建计算机上本地运行Spark,则可以将-DskipTests添加到Maven命令行,因为它们会干扰。

After you've built your CLI artifact, look into './RESTWrapper/docs' for the automatically generated documentation of available Packages and Operations (in Markdown format).

生成CLI工件后,请查看“ ./RESTWrapper/docs”,以自动生成可用软件包和操作的文档(Markdown格式)。

一环距离 (One Ring Dist)

The ./DistWrapper/target/one-ring-dist.jar is a fat executable JAR that generates Hadoop's dist-cp or EMR's s3-dist-cp script to copy the source data from the external storage (namely, S3) to cluster's internal HDFS, and the computation's result back.

./DistWrapper/target/one-ring-dist.jar是一个胖可执行文件,它生成Hadoop的dist-cp或EMR的s3-dist-cp脚本,以将源数据从外部存储(即S3)复制到集群的内部HDFS,然后返回计算结果。

It is an optional component, documented a bit later.

它是一个可选组件,稍后进行记录。

一环REST (One Ring REST)

The ./RESTWrapper/target/one-ring-rest.jar is a fat executable JAR that serves a REST-ish back-end for the not-yet-implemented but much wanted Visual Template Editor. It also serves the docs via dedicated endpoint.

./RESTWrapper/target/one-ring-rest.jar是一个胖的可执行文件JAR,它为尚未实现 ./RESTWrapper/target/one-ring-rest.jar Visual Template Editor提供REST式的后端。 它还通过专用端点为文档提供服务。

One Ring is designed with extensibility in mind.

一环的设计考虑了可扩展性。

扩展运营 (Extend Operations)

To extend One Ring built-in set of Operations, you have to implement an Operation class according to a set of conventions described in this doc.

要扩展One Ring内置的操作集,您必须根据本文档中描述的一组约定实现一个Operation类。

First off, you should create a Maven module. Place it at the same level as root pom.xml, and include your module in root project's <modules> section. You can freely choose the group and artifact IDs you like.

首先,您应该创建一个Maven模块。 将其放在与根pom.xml相同的级别,并将模块包含在根项目的<modules>部分中。 您可以自由选择所需的组和工件ID。

To make One Ring know your module, include its artifact reference in TaskWrapper's pom.xml <dependencies>. To make your module know One Ring, include a reference of artifact 'ash.nazg:Commons' in your module's <dependencies> (and its test-jar scope too). For an example, look into Math's pom.xml.

为了使One Ring知道您的模块,请将其工件引用包含在TaskWrapper的pom.xml <dependencies> 。 为了使您的模块知道“一环”,请在模块的<dependencies> (以及其test-jar范围)中包含工件“ ash.nazg:Commons”的引用。 例如,查看Math的pom.xml

Now you can proceed to create an Operation package and describe it.

现在,您可以继续创建一个操作包并对其进行描述。

By convention, Operation package must be named your.package.name.operations and have package-info.java annotated with @ash.nazg.config.tdl.Description. That annotation is required by One Ring to recognize the contents of your.package.name. There's an example.

按照惯例,Operation程序包必须命名为your.package.name.operations并且具有package-info.java并用@ash.nazg.config.tdl.Description注释。 One Ring需要该批注才能识别your.package.name的内容。 有一个例子

Place all your Operations there.

将所有操作都放在这里。

If your module contains a number of Operation that share same Parameter Definitions, names of these parameters must be placed in a public final class your.package.name.config.ConfigurationParameters class as public static final String constants with same @Description annotation each. An Operation can define its own parameters inside its class following the same convention.

如果您的模块包含多个共享相同参数定义的Operation,则必须将这些参数的名称作为public static final String常量(分别带有相同的@Description注释)放入public final class your.package.name.config.ConfigurationParameters类中。 一个操作可以遵循相同的约定在其类内定义自己的参数。

Parameter Definitions and their default value constants have names, depending on their purpose:

参数定义及其默认值常量的名称取决于其用途:

  • DS_INPUT_ for input DataStream references,

    DS_INPUT_用于输入数据流引用,

  • DS_OUTPUT_ for output DataStream references,

    DS_OUTPUT_用于输出数据流引用,

  • OP_ for the operation's Parameter,

    OP_作为操作的参数,

  • DEF_ for any default value, and the rest of the name should match a corresponding OP_ or DS_,

    DEF_作为任何默认值,其余名称应与相应的OP_DS_匹配,

  • GEN_ for any column, generated by this Operation.

    此操作生成的任何列的GEN_

References to columns of input DataStreams must end with _COLUMN suffix, and to column lists with _COLUMNS.

对输入数据流的列的引用必须以_COLUMN后缀结尾,并且对列列表的引用必须以_COLUMN _COLUMNS

An Operation in essence is

实质上,一项行动是

public abstract class Operation implements Serializable {
    public abstract Map<String, JavaRDDLike> getResult(Map<String, JavaRDDLike> input) throws Exception;
}

...but enlightened with a surplus metadata that allows One Ring to flexibly configure it, and to ensure the correctness and consistency of all Operation configurations in the Process chain.

...但是有多余的元数据所启发,它使一个环可以灵活地对其进行配置,并确保流程链中所有操作配置的正确性和一致性。

It is up to you, the author, to provide all that metadata. In the lack of any required metadata the build will be prematurely failed by One Ring Guardian, so incomplete class won't fail your extended copy of One Ring CLI on your Process execution time.

作者(您)有责任提供所有这些元数据。 在缺少任何必需的元数据的情况下,One Ring Guardian会过早地使构建失败,因此,不完整的类不会在您的Process执行时使您的One Ring CLI的扩展副本失败。

At least, you must implement the following methods:

至少,您必须实现以下方法:

  • abstract public String verb() that returns a short symbolic name to reference your Operation instance in the config file, annotated with a @Description,

    abstract public String verb()返回一个简短的符号名称,以引用配置文件中的Operation实例,并带有@Description注释,

  • abstract public TaskDescriptionLanguage.Operation description() that defines Operation's entire configuration space in TDL2 (Task Description Language),

    abstract public TaskDescriptionLanguage.Operation description() ,用于在TDL2( 任务描述语言 )中定义Operation的整个配置空间,

  • the getResult() that contains an entry point of your business code. It'll be fed with all DataStreams accumulated by the current Process at the moment of your Operation invocation, and should return any DataStreams your Operation should emit.

    包含您的业务代码入口点的getResult() 。 在调用Operation时,它将被当前Process累积的所有DataStreams馈送,并应返回Operation应该发出的任何DataStreams。

Also you must override public void setConfig(OperationConfig config) throws InvalidConfigValueException, call its super() at the beginning, and then read all parameters from the configuration space to your Operation class' fields. If any of the parameters have invalid value, you're obliged to throw an InvalidConfigValueException with a descriptive message about the configuration mistake.

另外,您必须重写public void setConfig(OperationConfig config) throws InvalidConfigValueException ,在开始时调用其super() ,然后将所有参数从配置空间读取到Operation类的字段。 如果任何参数的值无效,则必须抛出InvalidConfigValueException并带有有关配置错误的描述性消息。

You absolutely should create a test case for your Operation. See existing tests for a reference.

您绝对应该为您的操作创建一个测试用例。 请参阅现有测试作为参考。

There is a plenty of examples to learn by, just look into the source code for Operation's descendants. For your convenience, there's a list of most notable ones:

有很多示例可以学习,只需查看Operation的后代的源代码即可。 为了您的方便,这里列出了一些最著名的:

  • FilterByDateOperation with lots of parameters of different types that have defaults,

    FilterByDateOperation具有许多具有默认值的不同类型的参数,

  • SplitByDateOperation — its sister Operation generates a lot of output DataStreams with wildcard names,

    SplitByDateOperation-它的姐妹Operation会生成许多带有通配符名称的输出DataStream,

  • DummyOperation — this one properly does nothing, just creates aliases for its input DataStreams,

    DummyOperation -这个功能什么也没做,只是为其输入DataStreams创建别名,

  • SubtractOperation can consume and emit both RDDs and PairRDDs as DataStreams,

    SubtractOperation可以消耗和发出RDD和PairRDD作为数据流,

  • WeightedSumOperation generates a lot of columns that either come from input DataStreams or are created anew,

    WeightedSumOperation会生成很多列,这些列要么来自输入数据流,要么是重新创建的,

  • and the package Proximity contains Operations that deal with Point and Polygon RDDs in their DataStreams.

    Proximity包含在其DataStream中处理Point和Polygon RDD的Operations。

扩展存储适配器 (Extend Storage Adapters)

To extend One Ring with a custom Storage Adapter, you have to implement a pair of InputAdapter and OutputAdapter interfaces. They're fairly straightforward, just see existing Adapter sources for the reference.

要使用自定义存储适配器扩展一个环,您必须实现一对InputAdapterOutputAdapter接口。 它们非常简单,只需参阅现有的适配器源即可获得参考。

For example, you could create an Adapter for Spark 'parquet' files instead of CSV if you have your source data stored that way.

例如,如果您以这种方式存储了源数据,则可以为Spark“ parquet”文件而不是CSV创建适配器。

A single restriction exists: you can't set your Adapter as a fallback one, as that is reserved to One Ring Hadoop Adapter.

存在一个限制:您不能将适配器设置为后备限制,因为该限制保留给One Ring Hadoop适配器。

Hopefully this bit of information is enough for the beginning.

希望这方面的信息足以作为开始。

一环绑住他们 (One Ring to bind them)

Now entering the configuration space.

现在进入配置空间。

There is a domain specific language named TDL3 that stands for One Ring Task Definition Language. (There also are DSLs named TDL1 and TDL2, but they're fairly subtle.)

有一种名为TDL3的域特定语言,代表一种环任务定义语言。 (还有名为TDL1和TDL2的DSL,但它们相当微妙。)

For the language's object model, see TaskDefinitionLanguage.java. Note that the main form intended for human use isn't JSON but a simple non-sectioned .ini (or Java's .properties) file. We refer to this file as tasks.ini, or just a config.

有关语言的对象模型,请参见TaskDefinitionLanguage.java 。 请注意,供人类使用的主要格式不是JSON,而是简单的未分段的.ini(或Java的.properties)文件。 我们将此文件称为tasks.ini,或仅称为配置。

A recommended practice is to write keys in paragraphs grouped for each Operation, preceded by its Input DataStreams and succeeded by Output DataStreams groups of keys.

建议的做法是在每个操作分组的段落中编写键,在其输入数据流之前,然后在输出数据流键组之后写入键。

命名空间层 (Namespace Layers)

As you could see, each key begins with a prefix spark.meta.. One Ring can (and first tries to) read its configuration directly from Spark context, not only a config file, and each Spark property must start with a spark. prefix. We add another prefix meta. (by convention; this can be any unique token of your choice) to distinguish our own properties from Spark's. Also a single tasks.ini may contain a number of Processes if properly prefixed, just start their keys with spark.process1_name., spark.another_process. and so on.

如您所见,每个键都以前缀spark.meta. 。 一个Ring可以(并且首先尝试)直接从Spark上下文读取其配置,不仅是配置文件,而且每个Spark属性必须以spark.开头spark. 字首。 我们添加另一个前缀meta. (按照惯例;这可以是您选择的任何唯一标记),以区分我们自己的属性和Spark的属性。 同样,单个task.ini如果包含适当的前缀,则可能包含多个进程,只需使用spark.process1_name.启动其键即可spark.process1_name.spark.another_process. 等等。

If you run One Ring in Local mode, you can supply properties via .ini file, and omit all prefixes. Let assume that we've stripped all Spark's prefixes in mind and now look directly into namespaces of keys.

如果在本地模式下运行“一环”,则可以通过.ini文件提供属性,并忽略所有前缀。 假设我们已经牢记所有Spark前缀,现在直接查看键的名称空间。

The config is layered into several namespaces, and all parameter names must be unique in the corresponding namespace. These layers are distinguished, again, by some prefix.

该配置分为多个名称空间,并且所有参数名称在相应的名称空间中必须唯一。 这些层又通过一些前缀来区分。

异物层 (Foreign Layers)

First namespace layer is One Ring DistWrapper's distcp. which instructs that utility to generate a script file for the dist-cp calls:

第一个名称空间层是一个Ring DistWrapper的distcp. 它指示该实用程序为dist-cp调用生成脚本文件:

distcp.wrap=both
distcp.exe=s3-dist-cp

Let us discuss it a bit later. CLI itself ignores all foreign layers.

让我们稍后再讨论。 CLI本身会忽略所有外部层。

变数 (Variables)

If a key or a value contains a token of the form {ALL_CAPS}, it'll be treated by the CLI as a configuration Variable, and will be replaced by the value supplied via command line or variables file.

如果键或值包含{ALL_CAPS}形式的令牌,则CLI会将其视为配置变量,并将替换为通过命令行或变量文件提供的值。

ds.input.part_count.signals={PARTS}

If the Variable's value wasn't supplied, no replacement will be made, unless the variable doesn't include a default value for itself in the form of {ALL_CAPS:any default value}. Default values may not contain the '}' symbol.

如果未提供变量的值,则将不会进行替换,除非变量不包含{ALL_CAPS:any default value}形式的{ALL_CAPS:any default value} 。 默认值可能不包含'}'符号。

ds.input.part_count.signals={PARTS:50}

There are a few other restrictions to default values. First, each Variable occurrence has a different default and does not carry one over entire config, so you should set them each time you use that Variable. Second, if a Variable after a replacement forms a reference to another Variable, it will not be processed recursively. We do not like to build a Turing-complete machine out of tasks.ini.

默认值还有其他一些限制。 首先,每个变量的出现都具有不同的默认值,并且不会在整个配置中携带一个默认值,因此,每次使用该变量时都应进行设置。 其次,如果替换后的变量形成对另一个变量的引用,则不会递归处理它。 我们不喜欢用task.ini构建图灵完整的计算机。

It is notable that Variables may be encountered at any side of = in the tasks.ini lines, and there is no limit of them for a single line and/or config file.

值得注意的是,在task.ini行的=任何一侧都可能遇到变量,并且对于单行和/或配置文件没有限制。

进程的CLI任务 (CLI Task of the Process)

Next layer is task., and it contains properties that configure the CLI itself for the current Process' as a Spark job, or a CLI Task.

下一层是task. ,并且包含将当前进程的CLI本身配置为Spark作业或CLI任务的属性。

task.operations=range_filter,accuracy_filter,h3,timezone,center_mass_1,track_type,type_other,track_type_filter,remove_point_type,iron_glitch,slow_motion,center_mass_2,aqua2,map_by_user,map_tracks_non_pedestrian,map_pedestrian,map_aqua2,count_by_user,count_tracks_non_pedestrian,count_pedestrian,count_aqua2
task.input.sink=signals

task.tee.output=timezoned,tracks_non_pedestrian,pedestrian,aqua2,count_by_user,count_tracks_non_pedestrian,count_pedestrian,count_aqua2

task.operations (required) is a comma-separated list of Operation names to execute in the specified order. Any number of them, but not less than one. Names must be unique.

task.operations (必填)是要以指定顺序执行的操作名称的逗号分隔列表。 其中任意数量,但不少于一个。 名称必须唯一。

task.input.sink (required too) is an input sink. Any DataStream referred here is considered as one sourced from outside storage, and will be created by Storage Adapters of CLI (discussed later) for the consumption of Operations.

task.input.sink (也是必需的)是输入接收器。 此处提到的任何DataStream都被视为来自外部存储的一个DataStream,并且将由CLI的存储适配器(稍后讨论)创建以使用Operations。

task.tee.output (also required) is a T-connector. Any DataStream referred here can be consumed by Operations as usual, but also will be diverted by Storage Adapters of CLI into the outside storage as well.

task.tee.output (也是必需的)是一个T型连接器。 此处引用的任何DataStream都可以像平常一样由Operations使用,但也将由CLI的存储适配器转移到外部存储中。

操作实例 (Operation Instances)

Operations share the layer op., and it has quite a number of sub-layers.

操作共享层op. ,它具有许多子层。

Operation of a certain name is a certain Java class, but we don't like to call Operations by fully-qualified class names, and ask them nicely how they would like to be called by a short name.

某个名称的操作是某个Java类,但是我们不希望使用完全限定的类名来调用Operations,而是很好地询问它们如何使用短名称来调用它们。

So, you must specify such short names for each of your Operations in the chain, for example:

因此,您必须为链中的每个操作指定这样的短名称,例如:

op.operation.range_filter=rangeFilter
op.operation.accuracy_filter=accuracyFilter
op.operation.h3=h3
op.operation.timezone=timezone
op.operation.center_mass_1=trackCentroidFilter
op.operation.track_type=trackType
op.operation.map_by_user=mapToPair
op.operation.map_pedestrian=mapToPair
op.operation.count_pedestrian=countByKey
op.operation.count_aqua2=countByKey

You see that you may have any number of calls of the same Operation class in your Process, they'll be all initialized as independent instances with different reference names.

您会看到您的Process中可能有任意数量的相同Operation类的调用,它们都将被初始化为具有不同引用名称的独立实例。

操作输入和输出 (Operation Inputs and Outputs)

Now we go down to Operations' namespace op. sub-layers.

现在我们进入Operations的命名空间op. 子层。

First is op.input. that defines which DataStreams an Operation is about to consume as named. They names are assigned by the Operation itself internally. Also, an Operation could decide to process an arbitrary number (or even wildcard) DataStreams, positioned in the order specified by op.inputs. layer.

首先是op.input. 定义一个操作将要使用的命名名称的DataStreams。 它们的名称由内部操作本身分配。 同样,一个Operation可以决定处理任意数目(甚至通配符)的DataStream,它们以op.inputs.指定的顺序op.inputs. 层。

Examples from the config are:

配置示例包括:

op.inputs.range_filter=signals
op.input.accuracy_filter.signals=range_accurate_signals
op.inputs.h3=accurate_signals
op.inputs.timezone=AG
op.input.center_mass_1.signals=timezoned
op.input.track_type.signals=if_step1
op.inputs.type_other=tracks

Note that the keys end with just a name of an Operation in the case of positional Inputs, or 'name of an Operation' + '.' + 'its internal name of input' for named ones. These layers are mutually exclusive for a given Operation.

请注意,在位置输入的情况下,键仅以操作名称结尾,或以“操作名称” +“。”结尾。 +“内部输入名称”(用于命名的名称)。 这些层对于给定的操作是互斥的。

All the same goes for the op.output. and op.outputs. layers that describe DataStreams an Operation is about to produce. Examples:

op.output.op.outputs. 描述操作将要产生的DataStreams的层。 例子:

op.outputs.range_filter=range_accurate_signals
op.output.accuracy_filter.signals=accurate_signals
op.outputs.h3=AG
op.outputs.timezone=timezoned
op.output.center_mass_1.signals=if_step1
op.output.track_type.signals=tracks
op.outputs.type_other=tracks_non_pedestrian
op.output.track_type_filter.signals=pedestrian_typed

A wildcard DataStream reference is defined like:

通配符DataStream引用的定义如下:

op.inputs.union=prefix*

It'll match all DataStreams with said prefix available at the point of execution, and will be automatically converted into a list with no particular order.

它会在执行时将所有具有上述前缀的DataStream匹配,并将自动转换为没有特定顺序的列表。

操作参数 (Parameters of Operations)

Next sub-layer is for Operation Parameter Definitions, op.definition.. Parameters names take the rest of op.definition. keys. And the first prefix of Parameter name is the name of the Operation it is belonging to.

下一子层是用于操作参数定义的操作定义op.definition. 。 参数名称采用op.definition.的其余部分op.definition. 键。 参数名称的第一个前缀是它所属的操作的名称。

Each Parameter Definition is supplied to CLI by the Operation itself via TDL2 interface (Task Description Language), and they are strongly typed. So they can have a value of any Number descendant, String, enums, String[] (as a comma-separated list), and Boolean types.

每个参数定义都由操作本身通过TDL2接口(任务描述语言)提供给CLI,并且它们是强类型的。 因此,它们可以具有任何Number后裔, Stringenum s, String[] (以逗号分隔的列表)和Boolean类型的值。

Some Parameters may be defined as optional, and in that case they have a default value.

某些参数可以定义为可选参数,在这种情况下,它们具有默认值。

Some Parameters may be dynamic, in that case they have a fixed prefix and variable ending.

一些参数可能是动态的,在这种情况下,它们具有固定的前缀和可变的结尾。

Finally, there is a variety of Parameters that refer specifically to columns of input DataStreams. Their names must end in .column or .columns by the convention, and values must refer to a valid column or list of columns, or to one of columns generated by the Operation. By convention, generated column names start with an underscore.

最后,有各种参数专门引用输入DataStreams的列。 根据惯例,它们的名称必须以.column.columns结尾,并且值必须引用有效列或列列表,或者引用由Operation生成的列之一。 按照约定,生成的列名称以下划线开头。

Look for some examples:

寻找一些例子:

op.definition.range_filter.filtering.column=signals.accuracy
op.definition.range_filter.filtering.range=[0 50]
op.definition.h3.hash.level=9
op.definition.timezone.source.timezone.default=GMT
op.definition.timezone.destination.timezone.default={TZ}
op.definition.timezone.source.timestamp.column=AG.timestamp
op.definition.type_other.match.values=car,bike,non_residential,extremely_large
op.definition.track_type_filter.target.type=pedestrian
op.definition.track_type_filter.stop.time=900
op.definition.track_type_filter.upper.boundary.stop=0.05
op.definition.map_pedestrian.key.columns=pedestrian.userid,pedestrian.dow,pedestrian.hour
op.definition.map_aqua2.key.columns=aqua2.userid,aqua2.dow,aqua2.hour

Parameter filering.column of an Operation named range_filter points to the column accuracy from the DataStream signals, as well as source.timestamp.column of timezone is a reference to AG column timestamp. And map_pedestrian's key.columns refers to list of pedestrian columns.

名为range_filter的操作的参数filering.column指向DataStream signals的列accuracy ,并且timezone source.timestamp.columnAGtimestamp的引用。 map_pedestriankey.columns引用pedestrian列的列表。

Parameter hash.level of h3 is of type Byte, type_other's match.values is String[], and track_type_filter's upper.boundary.stop is Double.

h3参数hash.level的类型为Bytetype_othermatch.valuesString[]track_type_filterupper.boundary.stopDouble

To set an optional Parameter to its default value, you may omit that key altogether, or, if you like completeness, comment it out:

要将可选参数设置为其默认值,可以完全省略该键,或者,如果您喜欢完整性,则将其注释掉:

#op.definition.another_h3.hash.level=9

For the exhaustive table of each Operation Parameters, look for the docs inside your 'RESTWrapper/docs' directory (assuming you've successfully built the project, otherwise it'll be empty).

对于每个“操作参数”的详尽表,请在“ RESTWrapper / docs”目录中查找文档(假设您已经成功构建了项目,否则将为空)。

数据流参数 (Parameters of DataStreams)

Next layer is the ds. configuration namespace of DataStreams, and its rules are quite different.

下一层是ds. DataStreams的配置名称空间,其规则完全不同。

First off, DataStreams are always typed. There are types of:

首先,始终要键入DataStreams。 有以下几种类型:

  • CSV (column-based Text RDD with freely defined, but strongly referenced columns)

    CSV (基于列的Text RDD,具有可自由定义但被强烈引用的列)

  • Fixed (CSV, but column order and format is considered fixed)

    Fixed ( CSV ,但列顺序和格式被认为是固定的)

  • Point (object-based, contains Point coordinates with metadata)

    Point (基于对象,包含带有元数据的点坐标)

  • Polygon (object-based, contains Polygon outlines with metadata)

    Polygon (基于对象,包含带有元数据的多边形轮廓)

  • KeyValue (PairRDD with an opaque key and column-based value like CSV)

    KeyValue (具有不透明键和基于列的值(如CSV )的PairRDD)

  • Plain (RDD is generated by CLI as just opaque Hadoop Text, or it can be a custom-typed RDD handled by Operation)

    Plain (RDD由CLI生成为不透明的Hadoop Text ,也可以是由Operation处理的自定义类型的RDD)

Each DataStream can be configured as input for a number of Operations, and as an output of only one of them.

每个DataStream可以配置为多个操作的输入,也可以配置为其中一个操作的输出。

DataStream name is always the last part of any ds. key. And the set of DataStream Parameters is fixed.

DataStream名称始终是任何ds.的最后一部分ds. 键。 并且DataStream参数集是固定的。

ds.input.path. keys must point to some abstract paths for all DataStreams listed under the task.input.sink key. The format of the path must always include the protocol specification, and is validated by a Storage Adapter of the CLI (Adapters are discussed in the last section of this document).

ds.input.path. 键必须指向task.input.sink键下面列出的所有DataStream的某些抽象路径。 路径的格式必须始终包含协议规范,并由CLI的存储适配器验证(在本文档的最后一部分中讨论了适配器)。

For example, for a DataStream named 'signals' there is a path recognized by the S3 Direct Adapter:

例如,对于名为“ signals”的DataStream,存在S3 Direct Adapter识别的路径:

ds.input.path.signals=s3d://{BUCKET_NAME}/key/name/with/{a,the}/mask*.spec.?

Notice the usage of glob expressions. '{a,the}' token won't be processed as a Variable, but it is expanded to list of 'a' and 'the' directories inside '/key/name/with' directory by Adapter.

注意全局表达式的用法。 '{a,the}'标记不会作为变量处理,但是适配器会将其扩展到'/ key / name / with'目录中的'a'和'the'目录列表。

Same true for ds.output.path. keys, that must be specified for all DataStreams listed under the task.tee.output key. Let divert DataStream 'scores' to Local filesystem:

ds.output.path. 键,必须为task.tee.output键下列出的所有DataStreams指定键。 让我们将DataStream的“分数”转移到本地文件系统:

ds.output.path.scores={OUTPUT_PATH:file:/tmp/testing}/scores

But you may cheat here. There are all-input and all-output default keys:

但是你可能在这里作弊。 有全输入和全输出默认键:

ds.input.path=jdbc:SELECT * FROM scheme.
ds.output.path=aero:output/

In that case, for each DataStream that doesn't have its own path, its name will be added to the end of corresponding cheat key value without a separator. We don't recommend usage of these cheat keys in the production environment.

在这种情况下,对于每个没有自己路径的DataStream,其名称将被添加到对应的作弊键值的末尾而没有分隔符。 我们不建议在生产环境中使用这些作弊键。

ds.input.columns. and ds.output.columns. layers define columns for column-based DataStreams or metadata properties for object-based ones. Column names must be unique for that particular DataStream.

ds.input.columns.ds.output.columns. 层为基于列的DataStream定义列,为基于对象的数据流定义元数据属性。 列名对于该特定DataStream必须是唯一的。

Output columns must always refer to valid columns of inputs passed to the Operation that emits said DataStream, or its generated columns (which names start with an underscore).

输出列必须始终引用传递给发出所述DataStream的操作的输入的有效列或其生成的列(名称以下划线开头)。

Input columns list just assigns new column names for all consuming Operations. It may contain a single underscore instead of some column name to make that column anonymous. Anyways, if a column is 'anonymous', it still may be referenced by its number starting from _1_.

输入列列表只是为所有使用的操作分配新的列名。 它可能包含单个下划线而不是某些列名,以使该列匿名。 无论如何,如果一列是“匿名的”,它仍然可以通过其从_1_开始的编号来引用。

There is an exhaustive example of all column definition rules:

所有列定义规则都有一个详尽的示例:

ds.input.columns.signals=userid,lat,lon,accuracy,idtype,timestamp

ds.output.columns.AG=accurate_signals.userid,accurate_signals.lat,accurate_signals.lon,accurate_signals.accuracy,accurate_signals.idtype,accurate_signals.timestamp,_hash
ds.input.columns.AG=userid,lat,lon,accuracy,idtype,timestamp,gid

ds.output.columns.timezoned=AG.userid,AG.lat,AG.lon,AG.accuracy,AG.idtype,AG.timestamp,_output_date,_output_year_int,_output_month_int,_output_dow_int,_output_day_int,_output_hour_int,_output_minute_int,AG.gid
ds.input.columns.timezoned=userid,lat,lon,accuracy,idtype,timestamp,date,year,month,dow,day,hour,minute,gid

ds.output.columns.tracks=if_step1.userid,if_step1.lat,if_step1.lon,if_step1.accuracy,_velocity,if_step1.timestamp,if_step1.date,if_step1.year,if_step1.month,if_step1.dow,if_step1.day,if_step1.hour,if_step1.minute,if_step1.gid,_track_type
ds.input.columns.tracks=userid,lat,lon,_,_,_,_,_,_,_,_,_,_,_,track_type

ds.output.columns.pedestrian_typed=if_step1.userid,if_step1.lat,if_step1.lon,if_step1.accuracy,if_step1.velocity,if_step1.timestamp,if_step1.date,if_step1.year,if_step1.month,if_step1.dow,if_step1.day,if_step1.hour,if_step1.minute,if_step1.gid,_point_type
ds.input.columns.pedestrian_typed=_,_,_,_,_,_,_,_,_,_,_,_,_,_,point_type

ds.output.columns.pedestrian=pedestrian_typed._1_,pedestrian_typed._2_,pedestrian_typed._3_,pedestrian_typed._4_,pedestrian_typed._5_,pedestrian_typed._6_,pedestrian_typed._7_,pedestrian_typed._8_,pedestrian_typed._9_,pedestrian_typed._10_,pedestrian_typed._11_,pedestrian_typed._12_,pedestrian_typed._13_,pedestrian_typed._14_
ds.input.columns.pedestrian=userid,lat,lon,_,_,timestamp,_,_,_,_,_,_,_,_

In CSV varieties of DataStreams, columns are separated by a separator character, so there are ds.input.separator. and ds.output.separator. layers, along with cheat keys ds.input.separator and ds.output.separator that set them globally. The super global default value of column separator is the tabulation (TAB, 0x09) character.

CSV类型的DataStream中,列由分隔符分隔,因此有ds.input.separator.ds.output.separator. 图层以及作弊键ds.input.separatords.output.separator对其进行全局设置。 列分隔符的超级全局默认值是制表符(TAB,0x09)。

The final ds. layers control the partitioning of DataStream underlying RDDs, namely, ds.input.part_count. and ds.output.part_count.. These are quite important because the only super global default value for the part count is always 1 (one) part, and no cheats are allowed. You must always set them for at least initial input DataStreams from task.input.sink list, and may tune the partitioning in the middle of the Process according to the further flow of the Task.

最后的ds. 层控制基础RDD的DataStream分区,即ds.input.part_count.ds.output.part_count. 。 这些非常重要,因为零件计数的唯一超级全局默认值始终是1(一个)零件,并且不允许作弊。 您必须始终至少为task.input.sink列表中的初始输入DataStreams设置它们,并可能根据Task的进一步流程在Process的中间调整分区。

If both part_count. are specifies for some intermediate DataStream, it will be repartitioned first to the output one (immediately after the Operation that generated it), and then to input one (before feeding it to the first consuming Operation). Please keep that in mind.

如果都为part_count. 在为某些中间DataStream指定时,它将首先被重新分区为输出一个(在生成该操作的操作之后立即),然后再输入一个(在将其提供给第一个使用的操作之前)。 请记住这一点。

储存适配器 (Storage Adapters)

Input DataStreams of an entire Process come from the outside world, and output DataStreams are stored somewhere outside. CLI does this job via its Storage Adapters.

整个流程的输入数据流来自外界,输出数据流存储在外部某个地方。 CLI通过其存储适配器完成此工作。

There are following Storage Adapters currently implemented:

当前实现了以下存储适配器:

  • Hadoop (fallback, uses all protocols available in your Spark environment, i.e. 'file:', 's3:')

    Hadoop(回退,使用您的Spark环境中可用的所有协议,即“ file:”,“ s3:”)
  • HDFS (same Hadoop, but just for 'hdfs:' protocol)

    HDFS(相同的Hadoop,但仅适用于“ hdfs:”协议)
  • S3 Direct (any S3-compatible storage with a protocol of 's3d:')

    S3 Direct(任何具有“ s3d:”协议的S3兼容存储)
  • Aerospike ('aero:')

    Aerospike('aero:')
  • JDBC ('jdbc:')

    JDBC('jdbc:')

The fallback Hadoop Adapter is called if and only if another Adapter doesn't recognize the protocol of the path, so the priority of 'hdfs:' protocol is higher than other platform-supplied ones.

仅当另一个适配器无法识别路径协议时才调用后备Hadoop适配器,因此“ hdfs:”协议的优先级高于其他平台提供的协议。

Storage Adapters share two namesake layers of input. and output., and all their Parameters are global.

存储适配器共享两个同名input.input.output. ,并且它们的所有参数都是全局的。

Hadoop Adapter has no explicit Parameters. So does HDFS Adapter.

Hadoop适配器没有明确的参数。 HDFS适配器也是如此。

S3 Direct uses standard Amazon S3 client provider and has only the Parameter for output:

S3 Direct使用标准的Amazon S3客户端提供程序,并且仅具有用于输出的参数:

  • output.content.type with a default of 'text/csv'

    output.content.type ,默认值为“ text / csv”

JDBC Adapter Parameters are:

JDBC适配器参数是:

  • input.jdbc.driver and output.jdbc.driver for fully qualified class names of driver, available in the classpath. No default.

    用于驱动程序的完全限定类名的input.jdbc.driveroutput.jdbc.driver ,在类路径中可用。 没有默认值。

  • input.jdbc.url and output.jdbc.url for connection URLs. No default.

    input.jdbc.url URL的input.jdbc.urloutput.jdbc.url 。 没有默认值。

  • input.jdbc.user and output.jdbc.user with no default.

    没有默认值的input.jdbc.useroutput.jdbc.user

  • input.jdbc.password and output.jdbc.password with no default.

    没有默认值的input.jdbc.passwordoutput.jdbc.password

  • output.jdbc.batch.size for output batch size, default is '500'.

    output.jdbc.batch.size用于输出批处理大小,默认值为“ 500”。

Aerospike Adapter Parameters are:

Aerospike适配器参数为:

  • input.aerospike.host and output.aerospike.host defaults to 'localhost'.

    input.aerospike.hostoutput.aerospike.host默认为“ localhost”。

  • input.aerospike.port and output.aerospike.port defaults to '3000'.

    input.aerospike.portoutput.aerospike.port默认值为'3000'。

This concludes the configuration of One Ring CLI for a single Process. After you've assembled a library of basic Processes, you'll may want to know how to Compose them into larger workflows.

到此结束了单个进程的One Ring CLI的配置。 组装完基本流程库之后,您可能想知道如何将它们组合成更大的工作流程。

一环团结起来 (One Ring to unite them)

There is an utility in the CLI to merge (or Compose) two or more One Ring Process Templates into one larger Template.

CLI中有一个实用程序,可将两个或多个“一环处理”模板合并(或组成)为一个更大的模板。

This action is final and should be performed only when the pipeline of all participating Processes is fully established, as it mangles most of named entities from the composed tasks.inis and emits a much less readable config.

此动作是最终动作,只有在所有参与流程的流水线完全建立后才应执行,因为它会从组成的task.inis中破坏大多数命名实体,并发出可读性差得多的配置。

Name mangling is necessary because tasks.ini from different Processes may contain Operations and DataStreams with same names, and we want to avoid reference clashes. DataStreams may persist they names and carry over the resulting config, though.

名称修改是必要的,因为来自不同Processs的task.ini可能包含具有相同名称的Operations和DataStreams,并且我们希望避免引用冲突。 不过,DataStream可以保留它们的名称并保留结果配置。

Command line invocation of Composer is as follows (also available via REST):

Composer的命令行调用如下(也可以通过REST获得):

java -cp ./TaskWrapper/target/one-ring-cli.jar ash.nazg.composer.Composer -X spark.meta -C "/path/to/process1.ini=alias1,/path/to/process2.ini=alias2" -o /path/to/process1and2.ini -M /path/to/mapping.file -v /path/to/variables.file -F

-C parameter is a list of config files path=alias pairs, separated by a comma. Order of Operations in the resulting config follows the order of this list. Source configs may be in .ini and JSON formats, and even freely mixed, just use .json extension for JSON configs.

-C参数是配置文件路径=别名对的列表,用逗号分隔。 结果配置中的操作顺序遵循此列表的顺序。 源配置可能采用.ini和JSON格式,甚至可以自由混合,只需将.json扩展名用于JSON配置即可。

-X lists task prefix(es), if they're present. If each is different, specify them all in a comma-separated list. If all are the same, specify it only once. If there are no prefixes, just omit this switch.

-X列出任务前缀(如果存在)。 如果每个都不相同,请在所有以逗号分隔的列表中指定它们。 如果全部相同,则仅指定一次。 如果没有前缀,则忽略此开关。

-V same as CLI: name=value pairs of Variables for all configs, separated by a newline and encoded to Base64.

-V与CLI相同:所有配置的“名称=值”变量对,用换行符分隔并编码为Base64。

-v same as CLI: path to variables file, name=value pairs per each line.

-v与CLI相同:变量文件的路径,每行名称=值对。

-M path to a DataStream mapping file used to pass DataStreams from one Process to other(s).

-M映射文件的路径,该文件用于将数据流从一个进程传递到其他进程。

The syntax of that file is like this:

该文件的语法如下:

alias1.name1 alias2.name1
alias1.name2 alias2.name5 alias3.name8

This example's first line means that the DataStream 'name1' from the Process 'alias2' will be replaced by DataStream 'name1' from 'alias1' and retain the 'name1' in the resulting config. Second line replaces 'name5' in 'alias2' and 'name8' in 'alias3' with 'name2' from 'alias1', and persists 'name2' across the merged config. So the principle is simple: if you want to merge several DataStreams from different Processes, place the main one first, and then list the DataStreams to be replaced.

此示例的第一行表示流程'alias2'的DataStream'name1'将被'alias1'的DataStream'name1'替换,并将'name1'保留在结果配置中。 第二行用'alias1'中的'name2'替换'alias2'中的'name5'和'alias3'中的'name8',并在合并后的配置中保留'name2'。 因此,原理很简单:如果要合并来自不同流程的多个DataStream,请先放置主要的,然后列出要替换的DataStream。

-o path to the composed output config file, in .ini format by default. For JSON output use .json extension.

-o到组成的输出配置文件的路径,默认为.ini格式。 对于JSON输出,请使用.json扩展名。

-F perform a Full Compose, if this switch is given. Resulting task.tee.outputs will only contain same outputs as the very last config in the chain, otherwise it'll contain outputs from all merged tasks.

-F如果给出此开关,则执行“完全编写”。 结果task.tee.outputs仅包含与链中最后一个配置相同的输出,否则它将包含所有合并任务的输出。

然后在云端运行它们 (And in the Clouds run them)

There are two supported ways to execute One Ring Tasks. It's always better to start with a Local mode, debug the whole process on smaller verified datasets, and them move it to the Cloud.

有两种支持的方式来执行“一环任务”。 最好从本地模式开始,在经过验证的较小数据集上调试整个过程,然后将其移至云中。

本地执行 (Local Execution)

After you've composed the Process configuration, you definitely should test it locally with a small but reasonably representative sample of your source data. It is much easier to debug a new Process locally rather than on the real cluster.

构成流程配置之后,您绝对应该使用少量但具有代表性的源数据样本在本地对其进行测试。 在本地调试新流程要比在实际集群上调试容易得多。

After you've built the local artifact, as described above, call it like:

如上所述,构建了本地工件之后,按如下方式调用它:

java -jar ./RestWrapper/target/one-ring-cli.jar -c /path/to/tasks.ini -l -m 6g -x spark.meta -S /path/to/dist_interface.file

-c sets the path to tasks.ini.

-c设置task.ini的路径。

-l means the local execution mode of Spark context ('local[*]', to be precise).

-l表示Spark上下文的本地执行模式(准确地说是“ local [*]”)。

-m sets the amount of Spark memory, like '4g' or '512m'.

-m设置Spark内存量,例如“ 4g”或“ 512m”。

-x sets the current task prefix, if needed. If you're planning to pass tasks.ini to your cluster via Spark context, you should use prefixed tasks.ini locally too.

-x设置当前任务前缀(如果需要)。 如果您打算通过Spark上下文将task.ini传递给集群,则也应该在本地使用带前缀的task.ini。

-S to interface with One Ring Dist, discussed a bit further.

-S与“一环通”接口,进一步讨论。

Also you should pass all the input and output paths via Variables, to ease transition between your local file system storage and the cluster's storage.

另外,您还应该通过变量传递所有输入和输出路径,以简化本地文件系统存储和群集存储之间的转换。

For example, let us assume your Process has two source datasets and one result, stored under paths specified by SOURCE_SIGNALS, SOURCE_POIS and OUTPUT_SCORES Variables. Just prepare a newline-separated list of name=value pairs of them, and then you have two ways to pass them to One Ring CLI:

例如,让我们假设您的流程有两个源数据集和一个结果,分别存储在SOURCE_SIGNALSSOURCE_POISOUTPUT_SCORES变量指定的路径下。 只需准备一个以新行分隔的名称=值对的列表,然后您可以通过两种方式将它们传递给One Ring CLI:

  1. Encode as Base64 string and pass with -V command line key

    编码为Base64字符串,并使用-V命令行键传递

  2. Place into a file (or other Adapter-supported Storage) and pass its path with -v command line key

    放入文件(或其他适配器支持的存储)中,并使用-v命令行键传递其路径

If both keys are specified, -V has higher priority, and -v will be ignored.

如果同时指定了两个键,则-V具有更高的优先级,而-v将被忽略。

For example,

例如,

cat > /path/to/variables.ini
SIGNALS_PATH=file:/path/to/signals
POIS_PATH=file:/path/to/pois
OUTPUT_PATH=file:/path/to/output
^D

base64 -w0 < /path/to/variables.ini
U0lHTkFMU19QQVRIPWZpbGU6L3BhdGgvdG8vc2lnbmFscwpQT0lTX1BBVEg9ZmlsZTovcGF0aC90by9wb2lzCk9VVFBVVF9QQVRIPWZpbGU6L3BhdGgvdG8vb3V0cHV0Cg==

java -jar ./RestWrapper/target/one-ring-cli.jar -c /path/to/tasks.ini -l -m 6g -V U0lHTkFMU19QQVRIPWZpbGU6L3BhdGgvdG8vc2lnbmFscwpQT0lTX1BBVEg9ZmlsZTovcGF0aC90by9wb2lzCk9VVFBVVF9QQVRIPWZpbGU6L3BhdGgvdG8vb3V0cHV0Cg==

java -jar ./RestWrapper/target/one-ring-cli.jar -c /path/to/tasks.ini -l -m 6g -v /path/to/variables.ini

You'll see a lot of Spark output, as well as the dump of your Task. If everything is successful, you'll see no exceptions in that output. If not, read exception messages carefully and fix your tasks.ini and/or check the source data files.

您会看到很多Spark输出以及Task的转储。 如果一切成功,则在该输出中将看不到任何异常。 如果不是,请仔细阅读异常消息并修复您的task.ini和/或检查源数据文件。

在计算集群上执行 (Execution on a Compute Cluster)

One Ring officially supports the execution on EMR Spark clusters via TeamCity continuous deployment builds, but it could be relatively easy adapted for other clouds, continuous integration services, and automation scenarios.

One Ring通过TeamCity连续部署构建正式支持在EMR Spark集群上执行,但相对容易地适用于其他云,连续集成服务和自动化方案。

We assume you're already familiar with AWS and have the utility EC2 instance in that cloud. You may have or may not have to set up TeamCity or some other CI service of your preference on that instance. We like it automated though.

我们假设您已经熟悉AWS,并且在该云中拥有实用程序EC2实例。 您可能需要或不必在该实例上设置TeamCity或您喜欢的其他CI服务。 我们喜欢它虽然自动化。

First off, you need to set up some additional environment on the utility instance, starting with latest version of PowerShell (at the very least, version 6 is required) and AWS Tools for PowerShell. Please follow the official AWS documentation, and register your AWS API access key with these Tools.

首先,您需要在实用程序实例上设置一些其他环境,从最新版本的PowerShell(至少需要版本6)和适用于PowerShell的AWS工具开始。 请遵循官方的AWS文档 ,并使用这些工具注册您的AWS API访问密钥。

Get the scripts and CloudFormation deployment template:

获取脚本和CloudFormation部署模板:

git clone https://github.com/PastorGL/one-ring-emr.git

Also get a template of configuration files:

还获得配置文件模板:

git clone https://github.com/PastorGL/one-ring-emr-settings.git

And there are TC configs you may import into your TC:

您可以将一些TC配置导入到TC中:

git clone https://github.com/PastorGL/one-ring-tc-builds.git

Don't forget to customize VCS roots, and always use your own private copy of one-ring-emr-settings, because there'll go most sensitive data. In the case of other CI service, you may extract build steps from TC's XMLs. Their structure is pretty straightforward, just dig into them.

不要忘记自定义VCS根目录,并始终使用您自己的one-ring-emr-settings的私有副本,因为会出现最敏感的数据。 对于其他CI服务,您可以从TC的XML中提取构建步骤。 它们的结构非常简单,只需深入研究它们即可。

The environment set up by build configurations is a directory, where the contents of one-ring-emr is augmented with addition of one-ring-emr-settings and One Ring artifacts one-ring-cli.jar and one-ring-dist.jar, so it looks like this (you also may use symlinks to place them into manually):

通过构建配置设置环境是一个目录,其中的内容one-ring-emr被扩充并加入one-ring-emr-settings一圈工件one-ring-cli.jarone-ring-dist.jar ,所以看起来像这样(您也可以使用符号链接将它们手动放入):

/common
/presets
/settings
 one-ring-cli.jar
 one-ring-dist.jar
 create-cluster.ps1
 list-jobs.ps1
 preset-params.ps1
 remove-cluster.ps1
 run-job.ps1
 set-params.ps1
 cluster.template

You place your tasks.ini into /settings subdirectory alongside other .ini files. Also, you must fill in all required values in all .ini files inside this directory, that conform to your AWS account environment.

您将您的task.ini和其他.ini文件放到/settings子目录中。 此外,您必须在此目录内的所有.ini文件中填写与AWS账户环境相符的所有必需值。

We usually put presets for all our Processes in different branches of our copy if the one-ring-emr-settings repo, and just switch to the required branch of that repo for each Process' build configuration.

如果one-ring-emr-settings回购,我们通常将所有流程的预设放在副本的不同分支中,然后针对每个流程的构建配置切换到该回购的所需分支。

Build steps are executed in the following order:

生成步骤按以下顺序执行:

  1. Ask for Variables on TC UI

    在TC UI上询问变量
  2. preset-params.ps1

    preset-params.ps1

  3. set-params.ps1

    set-params.ps1

  4. create-cluster.ps1

    create-cluster.ps1

  5. Encode Variables to Base64

    将变量编码为Base64
  6. run-job.ps1

    run-job.ps1

  7. remove-cluster.ps1

    remove-cluster.ps1

Let we explain what each step does.

让我们解释一下每个步骤的作用。

TC has a feature to define 'build configuration parameters', and provides an UI to set them at build execution time (along with corresponding REST methods). We use these build parameters to set Variables in our Process template, and ask the user for their values. Also we ask for any additional parameters specific for the environment, such as for a preset of cluster size.

TC具有定义“构建配置参数”的功能,并提供了一个UI在构建执行时对其进行设置(以及相应的REST方法)。 我们使用这些构建参数在流程模板中设置变量,并要求用户提供其值。 此外,我们还要求任何其他特定于环境的参数,例如集群大小的预设。

At the next step we select one of four cluster size presets from /preset directory (S, M, L, XL .ini files) if it was selected on the previous step, and place its contents into build parameters.

在下一步中,如果在上一步中选择了/preset目录(S,M,L,XL .ini文件),则从四个群集大小预设中选择一个,并将其内容放入构建参数中。

set-params.ps1 has an ability to override any line of any existing .ini file from /settings subdirectory by replacing it with a custom build parameter named as 'filename.ini' + '.' + 'parameter.name', which gives you another level of build parametrization flexibility. This script overwrites .ini files with these parameters, so all further scripts receive augmented configurations.

set-params.ps1可以通过将其替换为名为'filename.ini'+'的自定义构建参数来覆盖/settings子目录中任何现有.ini文件的任何行的功能。 +'parameter.name',它为您提供了另一级的构建参数化灵活性。 该脚本使用这些参数覆盖.ini文件,因此所有其他脚本都将接受增强的配置。

At the next step we create a Spark cluster in the EMR by deploying CloudFormation template augmented with all parameters gathered to this moment, and parameters from /settings/create.ini.

下一步,我们通过部署CloudFormation模板在EMR中创建一个Spark集群,该模板扩充了到目前为止收集的所有参数以及/settings/create.ini参数。

Then we encode Variables with Base64, just as we did in Local mode.

然后,就像在本地模式下一样,我们使用Base64对变量进行编码。

At this moment everything is ready to run the Process on the cluster. run-job.ini sets up all required environment (from the per-component .ini files from /settings), calls Livy REST method on the cluster, and waits for the completion. If tasks.ini contains more than one Task, all of them will be ran in the order of definition. Its own parameters are set by /settings/run.ini.

此时,一切就绪,可以在集群上运行该流程。 run-job.ini设置所有必需的环境(从/settings的每个组件.ini文件),在集群上调用Livy REST方法,然后等待完成。 如果tasks.ini包含多个任务,则将按照定义的顺序运行所有任务。 它自己的参数由/settings/run.ini设置。

Even if any of previous steps fail, remove-cluster.ps1 should be called. This script does the cleanup, and is controlled by /settings/remove.ini.

即使先前的任何步骤失败,也应调用remove-cluster.ps1 。 该脚本进行清理,并由/settings/remove.ini控制。

All scripts that deal with the cluster also share parameters from /settings/aws.ini global file.

处理集群的所有脚本还共享/settings/aws.ini全局文件中的参数。

It is possible to execute every PowerShell script in the interactive mode and manually copy-paste their output variables between steps via command line parameters. It may be helpful to familiarize yourself with that stuff before going fully automated.

可以在交互模式下执行每个PowerShell脚本,并可以通过命令行参数在步骤之间手动复制粘贴它们的输出变量。 在完全自动化之前熟悉一下这些内容可能会有所帮助。

We also usually go on the higher level of automation and enqueue TC builds with their REST API.

我们通常还会进行更高级别的自动化,并通过其REST API来加入TC构建。

Anyways, closely watch your CloudFormation, EMR and EC2 consoles for at least few first tries. There may be insufficient access rights, and a lot of other issues, but we assume you are already experienced with AWS and EMR, if you are here.

无论如何,请密切关注您的CloudFormation,EMR和EC2控制台,至少进行几次尝试。 可能没有足够的访问权限,还有很多其他问题,但是如果您在这里,我们假设您已经对AWS和EMR有所了解。

And if you are, you already know that the S3 object storage is not too well suited for Spark because of its architectural peculiarities like 'eventual consistency' and response time dependency on the number of objects in the bucket. To avoid timeout errors, it is recommended to always copy the source data from S3 to HDFS on the cluster before Spark invocation, and, vice versa, to copy the result back from HDFS to S3 after it has been computed.

如果是这样,您已经知道S3对象存储不太适合Spark,因为它的体系结构特性如“最终一致性”和响应时间取决于存储桶中对象的数量。 为避免超时错误,建议始终在Spark调用之前将源数据从S3复制到群集上的HDFS,反之亦然,在计算结果之后将结果从HDFS复制回S3。

EMR provides an utility named s3-dist-cp, but its usage is cumbersome because you must know the exact paths.

EMR提供了一个名为s3-dist-cp的实用程序,但是它的用法很麻烦,因为您必须知道确切的路径。

One Ring provides a Dist wrapper to automate handling of s3-dist-cp while focusing around your Task config, so you can still use Variables for source paths and generate result paths dynamically while not bothering yourself with s3-dist-cp command line.

一枚戒指提供了一个Dist包装器,可自动处理s3-dist-cp同时着重于Task配置,因此您仍然可以将Variables用作源路径并动态生成结果路径,而不会打扰s3-dist-cp命令行。

One Ring Dist also can be used with other flavors of dist-cp, if it is required by your environment.

如果您的环境需要,也可以将一个Ring Dist与dist-cp其他版本一起使用。

呼叫一个铃声 (Calling One Ring Dist)

The syntax is similar to CLI:

语法类似于CLI:

java -jar ./DistWrapper/target/one-ring-dist.jar -c /path/to/tasks.ini -o /path/to/call_distcp.sh -S /path/to/dist_interface.file -d DIRECTION -x spark.meta

-c, -x, -v/-V switches have the same meaning for Dist as to CLI.

-c-x-v / -V开关对于Dist具有与CLI相同的含义。

-d specifies the direction of the copying process:

-d指定复制过程的方向:

  • 'from' to copy the source data from S3 to HDFS,

    'from'将源数据从S3复制到HDFS,
  • 'to' to copy the result back.

    “至”将结果复制回去。

-S specifies the path to interface file with a list of HDFS paths of Task outputs generated by the CLI.

-S用CLI生成的Task输出的HDFS路径列表指定接口文件的路径。

-o is the path where to output the script with full s3-dist-cp commands.

-o是使用完整s3-dist-cp命令输出脚本的路径。

距离配置 (Dist Configuration)

Dist has its own layer in tasks.ini, prefixed with distcp., with a small set of keys.

Dist在task.ini中有自己的层,其前缀为distcp. ,带有少量键。

distcp.exe specifies which executable should be used. By default, it has a value of 's3-dist-cp'.

distcp.exe指定应使用哪个可执行文件。 默认情况下,其值为“ s3-dist-cp”。

distcp.direction sets which copy operations are implied to be performed. In addition to -d switch 'from' and 'to' there are:

distcp.direction设置隐式执行的复制操作。 除了-d开关“ from”和“ to”之外,还有:

  • 'both' to indicate the copy in both directions is required,

    需要'both'表示双向复制,
  • 'nop' (default) to suppress the copying.

    'nop'(默认)以禁止复制。

Boolean distcp.move directs to remove files after copying. By default it is set to true.

布尔值distcp.move指示在复制后删除文件。 默认情况下,它设置为true。

distcp.dir.to and distcp.dir.from specify which HDFS paths are to be used to store files gathered from HDFS and for the results respectively, with the defaults of '/input' and '/output'. Subdirectories named after DataStreams will be automatically created for their files under these paths.

distcp.dir.to and distcp.dir.from specify which HDFS paths are to be used to store files gathered from HDFS and for the results respectively, with the defaults of '/input' and '/output'. Subdirectories named after DataStreams will be automatically created for their files under these paths.

distcp.store and distcp.ini provide another way to set -S and -o values (but command line switches always have higher priority and override these keys if set).

distcp.store and distcp.ini provide another way to set -S and -o values (but command line switches always have higher priority and override these keys if set).

Dist Usage (Dist Usage)

When CLI encounters distcp.direction directive in the config, it transparently replaces all its S3 input and output paths with HDFS paths according to the provided direction.

When CLI encounters distcp.direction directive in the config, it transparently replaces all its S3 input and output paths with HDFS paths according to the provided direction.

This is useful for multi-Process tasks.ini. If the outputs from the first Task are stored in HDFS, it allows next Tasks to consume them without a round-trip to S3 while still providing the paths pointing to S3:

This is useful for multi-Process tasks.ini. If the outputs from the first Task are stored in HDFS, it allows next Tasks to consume them without a round-trip to S3 while still providing the paths pointing to S3:

spark.task1.distcp.direction=to
spark.task2.distcp.direction=nop
spark.task3.distcp.direction=from

...and if same Task is executed solo, it should just use bi-directional copy:

...and if same Task is executed solo, it should just use bi-directional copy:

spark.task1.distcp.direction=both

...without further changes to path Variables and other configuration parameters.

...without further changes to path Variables and other configuration parameters.

For any DataStream that goes to distcp.dir.from the CLI adds a line with the resulting HDFS path to Dist interface file (under distcp.store / -S path). That allows Dist to gather them and generate commands for the from direction.

For any DataStream that goes to distcp.dir.from the CLI adds a line with the resulting HDFS path to Dist interface file (under distcp.store / -S path). That allows Dist to gather them and generate commands for the from direction.

Actually you should never invoke Dist manually, it is a job of automation scripts to call it before and after the execution of CLI.

Actually you should never invoke Dist manually, it is a job of automation scripts to call it before and after the execution of CLI.

Join the Fellowship (Join the Fellowship)

If you want to contribute, please refer to a list of One Ring issues first.

If you want to contribute, please refer to a list of One Ring issues first.

An issue for your most desired feature may have already been created by someone, but if it has not, you're free to create one. Do not attach any assignees, labels, projects and so on, just provide a detailed explanation of its use case and write a simple initial specification.

An issue for your most desired feature may have already been created by someone, but if it has not, you're free to create one. Do not attach any assignees, labels, projects and so on, just provide a detailed explanation of its use case and write a simple initial specification.

If you have some spare programming horsepower and just want to code something, you can select an issue labeled as 'Help Wanted', 'Wishlist' and assigned priority — one of priority ('Px') labels. There are labels 'Okay to tackle' and 'Good first issue' that designate the complexity of an issue. Just assign it to yourself and ask for complete specification, if it doesn't exist yet.

If you have some spare programming horsepower and just want to code something, you can select an issue labeled as 'Help Wanted', 'Wishlist' and assigned priority — one of priority ('Px') labels. There are labels 'Okay to tackle' and 'Good first issue' that designate the complexity of an issue. Just assign it to yourself and ask for complete specification, if it doesn't exist yet.

Do not choose issues labeled as 'Caution advised' unless you become really familiar with entire One Ring codebase.

Do not choose issues labeled as 'Caution advised' unless you become really familiar with entire One Ring codebase.

Note that One Ring has an established code style. Your contributions must firmly adhere to coding patterns we use so they don't feel alien.

Note that One Ring has an established code style. Your contributions must firmly adhere to coding patterns we use so they don't feel alien.

Make sure your pull requests don't touch anything outside the scope of an issue, and don't add or change versions of any external dependencies. At least, discuss these topics with original authors before you contribute.

Make sure your pull requests don't touch anything outside the scope of an issue, and don't add or change versions of any external dependencies. At least, discuss these topics with original authors before you contribute.

We won't accept any code that has been borrowed from any sources that aren't compatible by license. Ours is New BSD with do no evil clause. Just do not use One Ring as a weapon, okay?

We won't accept any code that has been borrowed from any sources that aren't compatible by license. Ours is New BSD with do no evil clause. Just do not use One Ring as a weapon, okay?

This project has no Code of Conduct and never will. You can be as sarcastic as you want but stay polite and don't troll anyone if you don't want to be trolled in return.

This project has no Code of Conduct and never will. You can be as sarcastic as you want but stay polite and don't troll anyone if you don't want to be trolled in return.

Happy data engineering!

Happy data engineering!

翻译自: https://habr.com/en/post/486466/

spark开源镜像

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值