序
本文主要研究一下flink的Parallel Execution
实例
Operator Level
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();DataStream text = [...]DataStream> wordCounts = text .flatMap(new LineSplitter()) .keyBy(0) .timeWindow(Time.seconds(5)) .sum(1).setParallelism(5);wordCounts.print();env.execute("Word Count Example");
- operators、data sources、data sinks都可以调用setParallelism()方法来设置parallelism
Execution Environment Level
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setParallelism(3);DataStream text = [...]DataStream> wordCounts = [...]wordCounts.print();env.execute("Word Count Example");
- 在ExecutionEnvironment里头可以通过setParallelism来给operators、data sources、data sinks设置默认的parallelism;如果operators、data sources、data sinks自己有设置parallelism则会覆盖ExecutionEnvironment设置的parallelism
Client Level
./bin/flink run -p 10 ../examples/*WordCount-java*.jar
或者
try { PackagedProgram program = new PackagedProgram(file, args); InetSocketAddress jobManagerAddress = RemoteExecutor.getInetFromHostport("localhost:6123"); Configuration config = new Configuration(); Client client = new Client(jobManagerAddress, config, program.getUserCodeClassLoader()); // set the parallelism to 10 here client.run(program, 10, true);} catch (ProgramInvocationException e) { e.printStackTrace();}
- 使用CLI client,可以在命令行调用是用-p来指定,或者Java/Scala调用时在Client.run的参数中指定parallelism
System Level
# The parallelism used for programs that did not specify and other parallelism.parallelism.default: 1
- 可以在flink-conf.yaml中通过parallelism.default配置项给所有execution environments指定系统级的默认parallelism
ExecutionEnvironment
flink-java-1.7.1-sources.jar!/org/apache/flink/api/java/ExecutionEnvironment.java
@Publicpublic abstract class ExecutionEnvironment { //...... private final ExecutionConfig config = new ExecutionConfig(); /** * Sets the parallelism for operations executed through this environment. * Setting a parallelism of x here will cause all operators (such as join, map, reduce) to run with * x parallel instances. * *
This method overrides the default parallelism for this environment. * The {@link LocalEnvironment} uses by default a value equal to the number of hardware * contexts (CPU cores / threads). When executing the program via the command line client * from a JAR file, the default parallelism is the one configured for that setup. * * @param parallelism The parallelism */ public void setParallelism(int parallelism) { config.setParallelism(parallelism); } @Internal public Plan createProgramPlan(String jobName, boolean clearSinks) { if (this.sinks.isEmpty()) { if (wasExecuted) { throw new RuntimeException("No new data sinks have been defined since the " + "last execution. The last execution refers to the latest call to " + "'execute()', 'count()', 'collect()', or 'print()'."); } else { throw new RuntimeException("No data sinks have been created yet. " + "A program needs at least one sink that consumes data. " + "Examples are writing the data set or printing it."); } } if (jobName == null) { jobName = getDefaultName(); } OperatorTranslation translator = new OperatorTranslation(); Plan plan = translator.translateToPlan(this.sinks, jobName); if (getParallelism() > 0) { plan.setDefaultParallelism(getParallelism()); } plan.setExecutionConfig(getConfig()); // Check plan for GenericTypeInfo's and register the types at the serializers. if (!config.isAutoTypeRegistrationDisabled()) { plan.accept(new Visitor>() { private final Set> registeredTypes = new HashSet<>(); private final Set> visitedOperators = new HashSet<>(); @Override public boolean preVisit(org.apache.flink.api.common.operators.Operator> visitable) { if (!visitedOperators.add(visitable)) { return false; } OperatorInformation> opInfo = visitable.getOperatorInfo(); Serializers.recursivelyRegisterType(opInfo.getOutputType(), config, registeredTypes); return true; } @Override public void postVisit(org.apache.flink.api.common.operators.Operator> visitable) {} }); } try { registerCachedFilesWithPlan(plan); } catch (Exception e) { throw new RuntimeException("Error while registering cached files: " + e.getMessage(), e); } // clear all the sinks such that the next execution does not redo everything if (clearSinks) { this.sinks.clear(); wasExecuted = true; } // All types are registered now. Print information. int registeredTypes = config.getRegisteredKryoTypes().size() + config.getRegisteredPojoTypes().size() + config.getRegisteredTypesWithKryoSerializerClasses().size() + config.getRegisteredTypesWithKryoSerializers().size(); int defaultKryoSerializers = config.getDefaultKryoSerializers().size() + config.getDefaultKryoSerializerClasses().size(); LOG.info("The job has {} registered types and {} default Kryo serializers