01_Flink Streaming env

最新推荐文章于 2024-06-27 10:02:24 发布

codemosi

最新推荐文章于 2024-06-27 10:02:24 发布

阅读量3.1k

点赞数

分类专栏： flink streaming

本文链接：https://blog.csdn.net/codemosi/article/details/51576448

版权

本文介绍了Flink流计算的核心，重点关注StreamExecutionEnvironment（env）对象，它作为流式计算上下文，能接入数据源、设置运行参数并启动任务。此外，还提到了DataStream对象，用于数据落地、逻辑计算等操作，以及Flink Core提供的业务时间提取、精确一次语义、数据反压等稳定性功能。

摘要由CSDN通过智能技术生成

<pre name="code" class="java">package com.alibaba.flink.train.streaming;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class HelloWorld {
	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment env = StreamExecutionEnvironment
				.getExecutionEnvironment();
		// env.setParallelism(4);//并发度
		DataStream<String> dataStream = env
				.readTextFile("D:/flinkdata/helloworld"); // 1:(flink storm
															// )(hadoop hive)
		dataStream
				.flatMap(
						new FlatMapFunction<String, Tuple2<String, Integer>>() {
							@Override
							public void flatMap(String input,
									Collector<Tuple2<String, Integer>> collector)
									throws Exception {
								String[] objs = input.split(" ");
								for (String obj : objs) {
									collector
											.collect(new Tuple2<String, Integer>(
													obj, 1));// (这里很关键，表示0位置是word，1的位置是1次数)
								}
							}
						})// 2:(flink 1)(storm 1)
				.keyBy(0)// 3:以第0个位置的值，做分区。
				.sum(1)// (flink:8)(storm:5)，对第1个位置的值做sum的操作。
				.printToErr();
		env.execute();//启动任务
		while (true) {

		}
	}

}

storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
 flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive  storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase storm flink spark
hadoop hive hbase

连接flink流计算的核心是

1：StreamExecutionEnvironment对象简称env，作为流式计算的上下文。env对象可以完成流式计算的功能。包括

1：接入数据源

2：设置并发度等运行参数（通过持有的ExecutionConfig对象，间接的设置运行参数）

3：启动任务

2：DataStream对象。DataStream对象可以完成流式计算的功能。包括

1：数据落地

2：逻辑计算

3：数据分区

4：过滤

5：窗口

6：join

除了上面这些业务相关的流计算功能，还有一些流计算稳定性的底层的功能，由flink core完成

1：业务time提取

2：exactly once

3：数据反压

4：容错机制

5：状态管理

6：checkpoint

首先最关注的应该是env这个对象

/*


 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.streaming.api.environment;

import com.esotericsoftware.kryo.Serializer;
import com.google.common.base.Preconditions;

import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.Public;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.InvalidProgramException;
import org.apache.flink.api.common.JobExecutionResult;
import org.apache.flink.api.common.functions.InvalidTypesException;
import org.apache.flink.api.common.functions.StoppableFunction;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.api.common.io.InputFormat;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.ClosureCleaner;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.io.TextInputFormat;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.typeutils.MissingTypeInfo;
import org.apache.flink.api.java.typeutils.PojoTypeInfo;
import org.apache.flink.api.java.typeutils.ResultTypeQueryable;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.client.program.ContextEnvironment;
import org.apache.flink.client.program.OptimizerPlanEnvironment;
import org.apache.flink.client.program.PreviewPlanEnvironment;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.functions.source.FileMonitoringFunction;
import org.apache.flink.streaming.api.functions.source.FileMonitoringFunction.WatchType;
import org.apache.flink.streaming.api.functions.source.FileReadFunction;
import org.apache.flink.streaming.api.functions.source.FileSourceFunction;
import org.apache.flink.streaming.api.functions.source.FromElementsFunction;
import org.apache.flink.streaming.api.functions.source.FromIteratorFunction;
import org.apache.flink.streaming.api.functions.source.FromSplittableIteratorFunction;
import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction;
import org.apache.flink.streaming.api.functions.source.SocketTextStreamFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.source.StatefulSequenceSource;
import org.apache.flink.streaming.api.graph.StreamGraph;
import org.apache.flink.streaming.api.graph.StreamGraphGenerator;
import org.apache.flink.streaming.api.operators.StoppableStreamSource;
import org.apache.flink.streaming.api.operators.StreamSource;
import org.apache.flink.runtime.state.AbstractStateBackend;
import org.apache.flink.streaming.api.transformations.StreamTransformation;
import org.apache.flink.util.SplittableIterator;

import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Iterator;
import java.util.List;

import static java.util.Objects.requireNonNull;

/**
 * An ExecutionEnvironment for streaming jobs. An instance of it is
 * necessary to construct streaming topologies.
 */
/**
 * The StreamExecutionEnvironment is the context in which a streaming program is executed. A
 * {@link LocalStreamEnvironment} will cause execution in the current JVM, a
 * {@link RemoteStreamEnvironment} will cause execution on a remote setup.
 * 
 * <p>The environment provides methods to control the job execution (such as setting the parallelism
 * or the fault tolerance/checkpointing parameters) and to interact with the outside world (data access).
 *
 * @see org.apache.flink.streaming.api.environment.LocalStreamEnvironment
 * @see org.apache.flink.streaming.api.environment.RemoteStreamEnvironment
 */
@Public
public abstract class StreamExecutionEnvironment {

	/** The default name to use for a streaming job if no other name has been specified */
	public static final String DEFAULT_JOB_NAME = "Flink Streaming Job";
	
	/** The time characteristic that is used if none other is set */
	private static final TimeCharacteristic DEFAULT_TIME_CHARACTERISTIC = TimeCharacteristic.ProcessingTime;

	/** The default buffer timeout (max delay of records in the network stack) */
	private static final long DEFAULT_NETWORK_BUFFER_TIMEOUT = 100L;

	/** The environment of the context (local by default, cluster if invoked through command line) */
	private static StreamExecutionEnvironmentFactory contextEnvironmentFactory;

	/** The default parallelism used when creating a local environment */
	private static int defaultLocalParallelism = Runtime.getRuntime().availableProcessors();
	
	// ------------------------------------------------------------------------

	/** The execution configuration for this environment */
	private final ExecutionConfig config = new ExecutionConfig();
	
	/** Settings that control the checkpointing behavior */ 
	private final CheckpointConfig checkpointCfg = new CheckpointConfig();
	
	protected final List<StreamTransformation<?>> transformations = new ArrayList<>();
	
	private long bufferTimeout = DEFAULT_NETWORK_BUFFER_TIMEOUT;
	
	protected boolean isChainingEnabled = true;
	
	/** The state backend used for storing k/v state and state snapshots */
	private AbstractStateBackend defaultStateBackend;
	
	/** The time characteristic used by the data streams */
	private TimeCharacteristic timeCharacteristic = DEFAULT_TIME_CHARACTERISTIC;

	
	// --------------------------------------------------------------------------------------------
	// Constructor and Properties
	// --------------------------------------------------------------------------------------------

	/**
	 * Gets the config object.
	 */
	public ExecutionConfig getConfig() {
		return config;
	}

	/**
	 * Sets the parallelism for operations executed through this environment.
	 * Setting a parallelism of x here will cause all operators (such as map,
	 * batchReduce) to run with x parallel instances. This method overrides the
	 * default parallelism for this environment. The
	 * {@link LocalStreamEnvironment} uses by default a value equal to the
	 * number of hardware contexts (CPU cores / threads). When executing the
	 * program via the command line client from a JAR file, the default degree
	 * of parallelism is the one configured for that setup.
	 *
	 * @param parallelism The parallelism
	 */
	public StreamExecutionEnvironment setParallelism(int parallelism) {
		if (parallelism < 1) {
			throw new IllegalArgumentException("parallelism must be at least one.");
		}
		config.setParallelism(parallelism);
		return this;
	}

	/**
	 * Gets the parallelism with which operation are executed by default.
	 * Operations can individually override this value to use a specific
	 * parallelism.
	 *
	 * @return The parallelism used by operations, unless they override that
	 * value.
	 */
	public int getParallelism() {
		return config.getParallelism();
	}

	/**
	 * Sets the maximum time frequency (milliseconds) for the flushing of the
	 * output buffers. By default the output buffers flush frequently to provide
	 * low latency and to aid smooth developer experience. Setting the parameter
	 * can result in three logical modes:
	 * <p>
	 * <ul>
	 * <li>
	 * A positive integer triggers flushing periodically by that integer</li>
	 * <li>
	 * 0 triggers flushing after every record thus minimizing latency</li>
	 * <li>
	 * -1 triggers flushing only when the output buffer is full thus maximizing
	 * throughput</li>
	 * </ul>
	 *
	 * @param timeoutMillis
	 * 		The maximum time between two output flushes.
	 */
	public StreamExecutionEnvironment setBufferTimeout(long timeoutMillis) {
		if (timeoutMillis < -1) {
			throw new IllegalArgumentException("Timeout of buffer must be non-negative or -1");
		}

		this.bufferTimeout = timeoutMillis;
		return this;
	}

	/**
	 * Sets the maximum time frequency (milliseconds) for the flushing of the
	 * output buffers. For clarification on the extremal values see
	 * {@link #setBufferTimeout(long)}.
	 *
	 * @return The timeout of the buffer.
	 */
	public long getBufferTimeout() {
		return this.bufferTimeout;
	}

	/**
	 * Disables operator chaining for streaming operators. Operator chaining
	 * allows non-shuffle operations to be co-located in the same thread fully
	 * avoiding serialization and de-serialization.
	 *
	 * @return StreamExecutionEnvironment with chaining disabled.
	 */
	@PublicEvolving
	public StreamExecutionEnvironment disableOperatorChaining() {
		this.isChainingEnabled = false;
		return this;
	}

	/**
	 * Returns whether operator chaining is enabled.
	 *
	 * @return {@code true} if chaining is enabled, false otherwise.
	 */
	@PublicEvolving
	public boolean isChainingEnabled() {
		return isChainingEnabled;
	}

	// ------------------------------------------------------------------------
	//  Checkpointing Settings
	// ------------------------------------------------------------------------

	/**
	 * Gets the checkpoint config, which defines values like checkpoint interval, delay between
	 * checkpoints, etc.
	 * 
	 * @return The checkpoint config.
	 */
	public Checkpoin