最近要做个舆情的项目,需要用到流计算,网上好多说flink各种好的,就干脆学一下,基本看官网,顺便挖源码
正文开始
1. StreamExecutionEnvironment的获取
概述里介绍获得StreamExecutionEnvironment方法有三个
StreamExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment.createLocalEnvironment();
StreamExecutionEnvironment.createRemoteEnvironment(String host, int port, String... jarFiles);
getExecutionEnvironment()
如果是在idea里面执行,会使用本地环境,在命令行里面执行jar文件,会使用当前的执行环境
createLocalEnvironment()
使用本地环境
createRemoteEnvironment(String host, int port, String... jarFiles)
使用远程环境,参数主机地址、端口号、需要提交的jar包
翻了api文档发现其实有十个方法,看了眼源码,全是public,也就是说都可以外部调用,不过其他方法都是上面方法的重载,源码如下:
public static StreamExecutionEnvironment getExecutionEnvironment() {
return getExecutionEnvironment(new Configuration());
}
public static StreamExecutionEnvironment getExecutionEnvironment(Configuration configuration) {
return (StreamExecutionEnvironment)Utils.resolveFactory(threadLocalContextEnvironmentFactory, contextEnvironmentFactory).map((factory) -> {
return factory.createExecutionEnvironment(configuration);
}).orElseGet(() -> {
return createLocalEnvironment(configuration);
});
}
getExecutionEnvironment()的重载方法只有1个,区别就是,你可以自定义Configuration,Configration的配置可以看一下configure()方法
public static LocalStreamEnvironment createLocalEnvironment() {
return createLocalEnvironment(defaultLocalParallelism);
}
public static LocalStreamEnvironment createLocalEnvironment(int parallelism) {
return createLocalEnvironment(parallelism, new Configuration());
}
public static LocalStreamEnvironment createLocalEnvironment(int parallelism, Configuration configuration) {
Configuration copyOfConfiguration = new Configuration();
copyOfConfiguration.addAll(configuration);
copyOfConfiguration.set(CoreOptions.DEFAULT_PARALLELISM, parallelism);
return createLocalEnvironment(copyOfConfiguration);
}
public static LocalStreamEnvironment createLocalEnvironment(Configuration configuration) {
if (configuration.getOptional(CoreOptions.DEFAULT_PARALLELISM).isPresent()) {
return new LocalStreamEnvironment(configuration);
} else {
Configuration copyOfConfiguration = new Configuration();
copyOfConfiguration.addAll(configuration);
copyOfConfiguration.set(CoreOptions.DEFAULT_PARALLELISM, defaultLocalParallelism);
return new LocalStreamEnvironment(copyOfConfiguration);
}
}
createLocalEnvironment()有3个重载方法,可自定义的参数有int类型的parallelism,是设定使用的processor数量,以及Configuration。看源码会发现parallelism其实是Configuration里面的一个配置CoreOptions.DEFAULT_PARALLELISM,默认值defaultLocalParallelism是可用的全部processor数量,初始化在static代码块里
defaultLocalParallelism = Runtime.getRuntime().availableProcessors();
@PublicEvolving
public static StreamExecutionEnvironment createLocalEnvironmentWithWebUI(Configuration conf) {
Preconditions.checkNotNull(conf, "conf");
if (!conf.contains(RestOptions.PORT)) {
conf.setInteger(RestOptions.PORT, (Integer)RestOptions.PORT.defaultValue());
}
return createLocalEnvironment(conf);
}
本地环境的创建还有一个createLocalEnvironmentWithWebUI(Configuration conf),可以修改web访问端口号,默认8081,看代码,Configuration里的其他设置应该也是可以生效的。
public static StreamExecutionEnvironment createRemoteEnvironment(String host, int port, String... jarFiles) {
return new RemoteStreamEnvironment(host, port, jarFiles);
}
public static StreamExecutionEnvironment createRemoteEnvironment(String host, int port, int parallelism, String... jarFiles) {
RemoteStreamEnvironment env = new RemoteStreamEnvironment(host, port, jarFiles);
env.setParallelism(parallelism);
return env;
}
public static StreamExecutionEnvironment createRemoteEnvironment(String host, int port, Configuration clientConfig, String... jarFiles) {
return new RemoteStreamEnvironment(host, port, clientConfig, jarFiles);
}
createRemoteEnvironment(String host, int port, String... jarFiles)的重载方法有两个,多出来的参数还是parallelism和configuration,和本地不同的是这里的parallelism是直接设置给了RemoteStreamEnvironment(StreamExecutionEnvironment的子类)的ExecutionConfig,应该是服务器的运行配置,看下面源码,而Configuration在这里指客户端的配置。
public StreamExecutionEnvironment setParallelism(int parallelism) {
this.config.setParallelism(parallelism);
return this;
}
如果两个都要配置可以在调用createRemoteEnvironment(String host, int port, Configuration clientConfig, String... jarFiles)之后,用上面的setParllelism(int parallelism)方法配置processor数量。
2. 添加源
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
1)添加文件作为源
DataStream<String> text = env.readTextFile("file:///path/to/file");
源码中的读取文件作为源的方法,deprecated的忽略了
public DataStreamSource<String> readTextFile(String filePath) {
return this.readTextFile(filePath, "UTF-8");
}
public DataStreamSource<String> readTextFile(String filePath, String charsetName) {
Preconditions.checkArgument(!StringUtils.isNullOrWhitespaceOnly(filePath), "The file path must not be null or blank.");
TextInputFormat format = new TextInputFormat(new Path(filePath));
format.setFilesFilter(FilePathFilter.createDefaultFilter());
TypeInformation<String> typeInfo = BasicTypeInfo.STRING_TYPE_INFO;
format.setCharsetName(charsetName);
return this.readFile(format, filePath, FileProcessingMode.PROCESS_ONCE, -1L, (TypeInformation)typeInfo);
}
public <OUT> DataStreamSource<OUT> readFile(FileInputFormat<OUT> inputFormat, String filePath) {
return this.readFile(inputFormat, filePath, FileProcessingMode.PROCESS_ONCE, -1L);
}
@PublicEvolving
public <OUT> DataStreamSource<OUT> readFile(FileInputFormat<OUT> inputFormat, String filePath, FileProcessingMode watchType, long interval) {
TypeInformation typeInformation;
try {
typeInformation = TypeExtractor.getInputFormatTypes(inputFormat);
} catch (Exception var8) {
throw new InvalidProgramException("The type returned by the input format could not be automatically determined. Please specify the TypeInformation of the produced type explicitly by using the 'createInput(InputFormat, TypeInformation)' method instead.");
}
return this.readFile(inputFormat, filePath, watchType, interval, typeInformation);
}
@PublicEvolving
public <OUT> DataStreamSource<OUT> readFile(FileInputFormat<OUT> inputFormat, String filePath, FileProcessingMode watchType, long interval, TypeInformation<OUT> typeInformation) {
Preconditions.checkNotNull(inputFormat, "InputFormat must not be null.");
Preconditions.checkArgument(!StringUtils.isNullOrWhitespaceOnly(filePath), "The file path must not be null or blank.");
inputFormat.setFilePath(filePath);
return this.createFileInput(inputFormat, typeInformation, "Custom File Source", watchType, interval);
}
private <OUT> DataStreamSource<OUT> createFileInput(FileInputFormat<OUT> inputFormat, TypeInformation<OUT> typeInfo, String sourceName, FileProcessingMode monitoringMode, long interval) {
Preconditions.checkNotNull(inputFormat, "Unspecified file input format.");
Preconditions.checkNotNull(typeInfo, "Unspecified output type information.");
Preconditions.checkNotNull(sourceName, "Unspecified name for the source.");
Preconditions.checkNotNull(monitoringMode, "Unspecified monitoring mode.");
Preconditions.checkArgument(monitoringMode.equals(FileProcessingMode.PROCESS_ONCE) || interval >= 1L, "The path monitoring interval cannot be less than 1 ms.");
ContinuousFileMonitoringFunction<OUT> monitoringFunction = new ContinuousFileMonitoringFunction(inputFormat, monitoringMode, this.getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory(inputFormat);
Boundedness boundedness = monitoringMode == FileProcessingMode.PROCESS_ONCE ? Boundedness.BOUNDED : Boundedness.CONTINUOUS_UNBOUNDED;
SingleOutputStreamOperator<OUT> source = this.addSource(monitoringFunction, sourceName, (TypeInformation)null, boundedness).transform("Split Reader: " + sourceName, typeInfo, factory);
return new DataStreamSource(source);
}
readTextFile的两个方法很好懂,就文件路径和编码方式两个参数,应该也是最常用的方法。
readFile的几个方法参数就比较多,但是能看出所有这几个方法是依次向下调用的,它们最终调用的都是最后一个readFile方法。就直接看一下这几个参数:
OUT
- 返回的数据流格式,泛型都懂不多说
inputFormat
- 输入流的格式信息,包括文件路径,编码格式,打开文件的超时时间,路径的过滤方法等等。FileInputFormat的子类有AbstractCsvInputFormat, AvroInputFormat, BinaryInputFormat, DelimitedInputFormat, HiveTableFileInputFormat, OrcInputFormat, ParquetInputFormat,第二个方法中出现的TextInputFormat是DelimitedInputFormat的子类
filePath
- 文件路径(例如 "file:///some/local/file" 或者 "hdfs://host:port/file/path")
watchType
- FileProcessingMode类型,只有两个值:PROCESS_ONCE, PROCESS_CONTINUOUSLY,字面意思,执行一次和持续扫描获取新数据
typeInformation
- 类型信息,TypeInformation是flink类型系统的核心类,记录的是类型信息,可以用org.apache.flink.api.java.typeutils.TypeExtractor里面的方法获得,比如倒数第二个方法里面的TypeExtractor.getInputFormatTypes(inputFormat); 基础类型可以直接设置BasicTypeInfo,如第二个方法里的BasicTypeInfo.STRING_TYPE_INFO;
interval
- 周期性监控路径的情况下,两次扫描的时间间隔,单位毫秒,如果watchType是PROCESS_CONTINUOUSLY,则interval不能小于1
2)添加输入流作为源
和文件的参数差不多,只是inputFormat的类型是
InputFormat接口,
FileInputFormat是InputFormat的实现之一,这里可以是其他实现,api里列出了好多,可以看一下
InputFormat (Flink : 1.12-SNAPSHOT API)
@PublicEvolving
public <OUT> DataStreamSource<OUT> createInput(InputFormat<OUT, ?> inputFormat) {
return this.createInput(inputFormat, TypeExtractor.getInputFormatTypes(inputFormat));
}
@PublicEvolving
public <OUT> DataStreamSource<OUT> createInput(InputFormat<OUT, ?> inputFormat, TypeInformation<OUT> typeInfo) {
DataStreamSource source;
if (inputFormat instanceof FileInputFormat) {
FileInputFormat<OUT> format = (FileInputFormat)inputFormat;
source = this.createFileInput(format, typeInfo, "Custom File source", FileProcessingMode.PROCESS_ONCE, -1L);
} else {
source = this.createInput(inputFormat, typeInfo, "Custom Source");
}
return source;
}
private <OUT> DataStreamSource<OUT> createInput(InputFormat<OUT, ?> inputFormat, TypeInformation<OUT> typeInfo, String sourceName) {
InputFormatSourceFunction<OUT> function = new InputFormatSourceFunction(inputFormat, typeInfo);
return this.addSource(function, sourceName, typeInfo);
}
3)添加集合作为源
DataStream<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);
List<Tuple2<String, Integer>> data = ...
DataStream<Tuple2<String, Integer>> myTuples = env.fromCollection(data);
Iterator<Long> longIt = ...
DataStream<Long> myLongs = env.fromCollection(longIt, Long.class);
这个就比较简单了,不同的方法要求不同的集合,多出来的参数就是传一下类型,直接看源码吧
@SafeVarargs
public final <OUT> DataStreamSource<OUT> fromElements(OUT... data) {
if (data.length == 0) {
throw new IllegalArgumentException("fromElements needs at least one element as argument");
} else {
TypeInformation typeInfo;
try {
typeInfo = TypeExtractor.getForObject(data[0]);
} catch (Exception var4) {
throw new RuntimeException("Could not create TypeInformation for type " + data[0].getClass().getName() + "; please specify the TypeInformation manually via StreamExecutionEnvironment#fromElements(Collection, TypeInformation)", var4);
}
return this.fromCollection((Collection)Arrays.asList(data), (TypeInformation)typeInfo);
}
}
@SafeVarargs
public final <OUT> DataStreamSource<OUT> fromElements(Class<OUT> type, OUT... data) {
if (data.length == 0) {
throw new IllegalArgumentException("fromElements needs at least one element as argument");
} else {
TypeInformation typeInfo;
try {
typeInfo = TypeExtractor.getForClass(type);
} catch (Exception var5) {
throw new RuntimeException("Could not create TypeInformation for type " + type.getName() + "; please specify the TypeInformation manually via StreamExecutionEnvironment#fromElements(Collection, TypeInformation)", var5);
}
return this.fromCollection((Collection)Arrays.asList(data), (TypeInformation)typeInfo);
}
}
public <OUT> DataStreamSource<OUT> fromCollection(Collection<OUT> data) {
Preconditions.checkNotNull(data, "Collection must not be null");
if (data.isEmpty()) {
throw new IllegalArgumentException("Collection must not be empty");
} else {
OUT first = data.iterator().next();
if (first == null) {
throw new IllegalArgumentException("Collection must not contain null elements");
} else {
TypeInformation typeInfo;
try {
typeInfo = TypeExtractor.getForObject(first);
} catch (Exception var5) {
throw new RuntimeException("Could not create TypeInformation for type " + first.getClass() + "; please specify the TypeInformation manually via StreamExecutionEnvironment#fromElements(Collection, TypeInformation)", var5);
}
return this.fromCollection(data, typeInfo);
}
}
}
public <OUT> DataStreamSource<OUT> fromCollection(Collection<OUT> data, TypeInformation<OUT> typeInfo) {
Preconditions.checkNotNull(data, "Collection must not be null");
FromElementsFunction.checkCollection(data, typeInfo.getTypeClass());
FromElementsFunction function;
try {
function = new FromElementsFunction(typeInfo.createSerializer(this.getConfig()), data);
} catch (IOException var5) {
throw new RuntimeException(var5.getMessage(), var5);
}
return this.addSource(function, "Collection Source", typeInfo, Boundedness.BOUNDED).setParallelism(1);
}
public <OUT> DataStreamSource<OUT> fromCollection(Iterator<OUT> data, Class<OUT> type) {
return this.fromCollection(data, TypeExtractor.getForClass(type));
}
public <OUT> DataStreamSource<OUT> fromCollection(Iterator<OUT> data, TypeInformation<OUT> typeInfo) {
Preconditions.checkNotNull(data, "The iterator must not be null");
SourceFunction<OUT> function = new FromIteratorFunction(data);
return this.addSource(function, "Collection Source", typeInfo, Boundedness.BOUNDED);
}
public <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, Class<OUT> type) {
return this.fromParallelCollection(iterator, TypeExtractor.getForClass(type));
}
public <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, TypeInformation<OUT> typeInfo) {
return this.fromParallelCollection(iterator, typeInfo, "Parallel Collection Source");
}
private <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, TypeInformation<OUT> typeInfo, String operatorName) {
return this.addSource(new FromSplittableIteratorFunction(iterator), operatorName, typeInfo, Boundedness.BOUNDED);
}
4)添加地址作为源
DataStream<Tuple2<String, Integer>> dataStream = env.socketTextStream("localhost", 9999)
一共3个方法,最多四个参数,hostname主机地址和port端口号是必须的,delimiter分割符,默认"\n",maxRetry最大重试时间,单位秒,默认0,代表出错立即停止连接,负数代表没有限制反复重连
@PublicEvolving
public DataStreamSource<String> socketTextStream(String hostname, int port) {
return this.socketTextStream(hostname, port, "\n");
}
@PublicEvolving
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter) {
return this.socketTextStream(hostname, port, delimiter, 0L);
}
@PublicEvolving
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter, long maxRetry) {
return this.addSource(new SocketTextStreamFunction(hostname, port, delimiter, maxRetry), (String)"Socket Stream");
}
5)自定义数据源
其实上面所有方法最终调用的都是最后一个addSource方法。
第一个参数SourceFunction是一个接口,上面的方法中最终生成的都是它的实现,文件是ContinuousFileMonitoringFunction,输入流是InputFormatSourceFunction,集合用到了FromElementsFunction,FromIteratorFunction和FromSplittableIteratorFunction,地址用的是SocketTextStreamFunction。api地址SourceFunction (Flink : 1.12-SNAPSHOT API)
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function) {
return this.addSource(function, "Custom Source");
}
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName) {
return this.addSource(function, sourceName, (TypeInformation)null);
}
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, TypeInformation<OUT> typeInfo) {
return this.addSource(function, "Custom Source", typeInfo);
}
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {
return this.addSource(function, sourceName, typeInfo, Boundedness.CONTINUOUS_UNBOUNDED);
}
private <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, @Nullable TypeInformation<OUT> typeInfo, Boundedness boundedness) {
Preconditions.checkNotNull(function);
Preconditions.checkNotNull(sourceName);
Preconditions.checkNotNull(boundedness);
TypeInformation<OUT> resolvedTypeInfo = this.getTypeInfo(function, sourceName, SourceFunction.class, typeInfo);
boolean isParallel = function instanceof ParallelSourceFunction;
this.clean(function);
StreamSource<OUT, ?> sourceOperator = new StreamSource(function);
return new DataStreamSource(this, resolvedTypeInfo, sourceOperator, isParallel, sourceName, boundedness);
}
3. 添加算子
@Internal
public void addOperator(Transformation<?> transformation) {
Preconditions.checkNotNull(transformation, "transformation must not be null.");
this.transformations.add(transformation);
}
可添加一个Transformation算子,在excute后执行,DataStream里面的map,flatMap,process最终调用的都是这个方法。算子需要单独鼓捣一下,就不在这里深挖了。
4. 执行
使用excute执行,会一直等待到执行结束,直接返回执行结果JobExcutionResult
public JobExecutionResult execute() throws Exception {
return this.execute(this.getJobName());
}
public JobExecutionResult execute(String jobName) throws Exception {
Preconditions.checkNotNull(jobName, "Streaming Job name should not be null.");
return this.execute(this.getStreamGraph(jobName));
}
@Internal
public JobExecutionResult execute(StreamGraph streamGraph) throws Exception {
JobClient jobClient = this.executeAsync(streamGraph);
try {
Object jobExecutionResult;
if (this.configuration.getBoolean(DeploymentOptions.ATTACHED)) {
jobExecutionResult = (JobExecutionResult)jobClient.getJobExecutionResult().get();
} else {
jobExecutionResult = new DetachedJobExecutionResult(jobClient.getJobID());
}
this.jobListeners.forEach((jobListener) -> {
jobListener.onJobExecuted(jobExecutionResult, (Throwable)null);
});
return (JobExecutionResult)jobExecutionResult;
} catch (Throwable var5) {
Throwable strippedException = ExceptionUtils.stripExecutionException(var5);
this.jobListeners.forEach((jobListener) -> {
jobListener.onJobExecuted((JobExecutionResult)null, strippedException);
});
ExceptionUtils.rethrowException(strippedException);
return null;
}
}
使用excuteAsync执行不会等待结束,只返回JobClient用于在需要的时候获取结果:
JobExecutionResult jobExecutionResult = (JobExecutionResult)jobClient.getJobExecutionResult().get();
@PublicEvolving
public final JobClient executeAsync() throws Exception {
return this.executeAsync(this.getJobName());
}
@PublicEvolving
public JobClient executeAsync(String jobName) throws Exception {
return this.executeAsync(this.getStreamGraph((String)Preconditions.checkNotNull(jobName)));
}
@Internal
public JobClient executeAsync(StreamGraph streamGraph) throws Exception {
Preconditions.checkNotNull(streamGraph, "StreamGraph cannot be null.");
Preconditions.checkNotNull(this.configuration.get(DeploymentOptions.TARGET), "No execution.target specified in your configuration file.");
PipelineExecutorFactory executorFactory = this.executorServiceLoader.getExecutorFactory(this.configuration);
Preconditions.checkNotNull(executorFactory, "Cannot find compatible factory for specified execution.target (=%s)", new Object[]{this.configuration.get(DeploymentOptions.TARGET)});
CompletableFuture jobClientFuture = executorFactory.getExecutor(this.configuration).execute(streamGraph, this.configuration, this.userClassloader);
try {
JobClient jobClient = (JobClient)jobClientFuture.get();
this.jobListeners.forEach((jobListener) -> {
jobListener.onJobSubmitted(jobClient, (Throwable)null);
});
return jobClient;
} catch (ExecutionException var6) {
Throwable strippedException = ExceptionUtils.stripExecutionException(var6);
this.jobListeners.forEach((jobListener) -> {
jobListener.onJobSubmitted((JobClient)null, strippedException);
});
throw new FlinkException(String.format("Failed to execute job '%s'.", streamGraph.getJobName()), strippedException);
}
}
仔细看源码会发现excute在最终执行的时候调用的也是excuteAsync,稍微挖了一下,等待是在jobClient.getJobExecutionResult().get()位置执行的,实际执行等待的方法是CompletableFuture中的get()方法
public T get() throws InterruptedException, ExecutionException {
Object r;
return reportGet((r = result) == null ? waitingGet(true) : r);
}
这里有一个结果为空继续等待的方法waitingGet(true)
无论用哪种方式执行都要注意异常处理
5.监听
添加和清除监听的方法
//添加监听
@PublicEvolving
public void registerJobListener(JobListener jobListener) {
Preconditions.checkNotNull(jobListener, "JobListener cannot be null");
this.jobListeners.add(jobListener);
}
//清除监听
@PublicEvolving
public void clearJobListeners() {
this.jobListeners.clear();
}
添加JobListener
env.registerJobListener(new JobListener() {
public void onJobSubmitted(@Nullable JobClient jobClient, @Nullable Throwable throwable) {
//任务被提交回调
}
public void onJobExecuted(@Nullable JobExecutionResult jobExecutionResult, @Nullable Throwable throwable) {
//任务执行完成回调
}
});
看源码用excuteAsync提交时onJobExcuted方法并不会执行,只能用get()方法获得结果
6. 设置mode
@PublicEvolving
public StreamExecutionEnvironment setRuntimeMode(RuntimeExecutionMode executionMode) {
Preconditions.checkNotNull(executionMode);
this.configuration.set(ExecutionOptions.RUNTIME_MODE, executionMode);
return this;
}
RumtimeExcutionMode有三个值
STREAMING - 流计算 BATCH - 批量 AUTOMATIC - 根据数据是否有边界设置