最佳实践

翻译 2015年11月20日 16:36:11


This page contains a collection of best practices for Flink programmers on how to solve frequently encountered problems.

Almost all Flink applications, both batch and streaming rely on external configuration parameters. For example for specifying input and output sources (like paths or addresses), also system parameters (parallelism, runtime configuration) and application specific parameters (often used within the user functions).

Since version 0.9 we are providing a simple utility called ParameterTool to provide at least some basic tooling for solving these problems.

Please note that you don’t have to use the ParameterTool explained here. Other frameworks such as Commons CLIargparse4j and others work well with Flink as well.

Getting your configuration values into the ParameterTool

The ParameterTool provides a set of predefined static methods for reading the configuration. The tool is internally expecting a Map<String, String>, so its very easy to integrate it with your own configuration style.

From .properties files

The following method will read a Properties file and provide the key/value pairs:

String propertiesFile = "/home/sam/flink/myjob.properties";
ParameterTool parameter = ParameterTool.fromPropertiesFile(propertiesFile);

From the command line arguments

This allows getting arguments like --input hdfs:///mydata --elements 42 from the command line.

public static void main(String[] args) {
	ParameterTool parameter = ParameterTool.fromArgs(args);
	// .. regular code ..

From system properties

When starting a JVM, you can pass system properties to it: -Dinput=hdfs:///mydata. You can also initialize the ParameterTool from these system properties:

ParameterTool parameter = ParameterTool.fromSystemProperties();

Now that we’ve got the parameters from somewhere (see above) we can use them in various ways.

Directly from the ParameterTool

The ParameterTool itself has methods for accessing the values.

ParameterTool parameters = // ...
parameter.getRequired("input");
parameter.get("output", "myDefaultValue");
parameter.getLong("expectedCount", -1L);
parameter.getNumberOfParameters()
// .. there are more methods available.

You can use the return values of these methods directly in the main() method (=the client submitting the application). For example you could set the parallelism of a operator like this:

ParameterTool parameters = ParameterTool.fromArgs(args);
int parallelism = parameters.get("mapParallelism", 2);
DataSet<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).setParallelism(parallelism);

Since the ParameterTool is serializable, you can pass it to the functions itself:

ParameterTool parameters = ParameterTool.fromArgs(args);
DataSet<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer(parameters));

and then use them inside the function for getting values from the command line.

Passing it as a Configuration object to single functions

The example below shows how to pass the parameters as a Configuration object to a user defined function.

ParameterTool parameters = ParameterTool.fromArgs(args);
DataSet<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).withParameters(parameters.getConfiguration())

In the Tokenizer, the object is now accessible in the open(Configuration conf) method:

public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
	@Override
	public void open(Configuration parameters) throws Exception {
		parameters.getInteger("myInt", -1);
		// .. do

Register the parameters globally

Parameters registered as a global job parameter at the ExecutionConfig allow you to access the configuration values from the JobManager web interface and all functions defined by the user.

Register the parameters globally

ParameterTool parameters = ParameterTool.fromArgs(args);

// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters);

Access them in any rich user function:

public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {

	@Override
	public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
		ParameterTool parameters = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
		parameters.getRequired("input");
		// .. do more ..

Naming large TupleX types

It is recommended to use POJOs (Plain old Java objects) instead of TupleX for data types with many fields. Also, POJOs can be used to give largeTuple-types a name.

Example

Instead of using:

Tuple11<String, String, ..., String> var = new ...;

It is much easier to create a custom type extending from the large Tuple type.

CustomType var = new ...;

public static class CustomType extends Tuple11<String, String, ..., String> {
    // constructor matching super
}

If you use a custom type in your Flink program which cannot be serialized by the Flink type serializer, Flink falls back to using the generic Kryo serializer. You may register your own serializer or a serialization system like Google Protobuf or Apache Thrift with Kryo. To do that, simply register the type class and the serializer in the ExecutionConfig of your Flink program.

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// register the class of the serializer as serializer for a type
env.getConfig().registerTypeWithKryoSerializer(MyCustomType.class, MyCustomSerializer.class);

// register an instance as serializer for a type
MySerializer mySerializer = new MySerializer();
env.getConfig().registerTypeWithKryoSerializer(MyCustomType.class, mySerializer);

Note that your custom serializer has to extend Kryo’s Serializer class. In the case of Google Protobuf or Apache Thrift, this has already been done for you:

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// register the Google Protobuf serializer with Kryo
env.getConfig().registerTypeWithKryoSerializer(MyCustomType.class, ProtobufSerializer.class);

// register the serializer included with Apache Thrift as the standard serializer
// TBaseSerializer states it should be initalized as a default Kryo serializer
env.getConfig().addDefaultKryoSerializer(MyCustomType.class, TBaseSerializer.class);

For the above example to work, you need to include the necessary dependencies in your Maven project file (pom.xml). In the dependency section, add the following for Apache Thrift:

<dependency>
	<groupId>com.twitter</groupId>
	<artifactId>chill-thrift</artifactId>
	<version>0.5.2</version>
</dependency>
<!-- libthrift is required by chill-thrift -->
<dependency>
	<groupId>org.apache.thrift</groupId>
	<artifactId>libthrift</artifactId>
	<version>0.6.1</version>
	<exclusions>
		<exclusion>
			<groupId>javax.servlet</groupId>
			<artifactId>servlet-api</artifactId>
		</exclusion>
		<exclusion>
			<groupId>org.apache.httpcomponents</groupId>
			<artifactId>httpclient</artifactId>
		</exclusion>
	</exclusions>
</dependency>

For Google Protobuf you need the following Maven dependency:

<dependency>
	<groupId>com.twitter</groupId>
	<artifactId>chill-protobuf</artifactId>
	<version>0.5.2</version>
</dependency>
<!-- We need protobuf for chill-protobuf -->
<dependency>
	<groupId>com.google.protobuf</groupId>
	<artifactId>protobuf-java</artifactId>
	<version>2.5.0</version>
</dependency>

Please adjust the versions of both libraries as needed.

Using Logback instead of Log4j

Note: This tutorial is applicable starting from Flink 0.10

Apache Flink is using slf4j as the logging abstraction in the code. Users are advised to use sfl4j as well in their user functions.

Sfl4j is a compile-time logging interface that can use different logging implementations at runtime, such as log4j or Logback.

Flink is depending on Log4j by default. This page describes how to use Flink with Logback.

To get a logger instance in the code, use the following code:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MyClass implements MapFunction {
	private static final Logger LOG = LoggerFactory.getLogger(MyClass.class);
	// ...

In all cases were classes are executed with a classpath created by a dependency manager such as Maven, Flink will pull log4j into the classpath.

Therefore, you will need to exclude log4j from Flink’s dependencies. The following description will assume a Maven project created from a Flink quickstart.

Change your projects pom.xml file like this:

<dependencies>
	<!-- Add the two required logback dependencies -->
	<dependency>
		<groupId>ch.qos.logback</groupId>
		<artifactId>logback-core</artifactId>
		<version>1.1.3</version>
	</dependency>
	<dependency>
		<groupId>ch.qos.logback</groupId>
		<artifactId>logback-classic</artifactId>
		<version>1.1.3</version>
	</dependency>

	<!-- Add the log4j -> sfl4j (-> logback) bridge into the classpath
	 Hadoop is logging to log4j! -->
	<dependency>
		<groupId>org.slf4j</groupId>
		<artifactId>log4j-over-slf4j</artifactId>
		<version>1.7.7</version>
	</dependency>

	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-java</artifactId>
		<version>1.0-SNAPSHOT</version>
		<exclusions>
			<exclusion>
				<groupId>log4j</groupId>
				<artifactId>*</artifactId>
			</exclusion>
			<exclusion>
				<groupId>org.slf4j</groupId>
				<artifactId>slf4j-log4j12</artifactId>
			</exclusion>
		</exclusions>
	</dependency>
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-streaming-java</artifactId>
		<version>1.0-SNAPSHOT</version>
		<exclusions>
			<exclusion>
				<groupId>log4j</groupId>
				<artifactId>*</artifactId>
			</exclusion>
			<exclusion>
				<groupId>org.slf4j</groupId>
				<artifactId>slf4j-log4j12</artifactId>
			</exclusion>
		</exclusions>
	</dependency>
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-clients</artifactId>
		<version>1.0-SNAPSHOT</version>
		<exclusions>
			<exclusion>
				<groupId>log4j</groupId>
				<artifactId>*</artifactId>
			</exclusion>
			<exclusion>
				<groupId>org.slf4j</groupId>
				<artifactId>slf4j-log4j12</artifactId>
			</exclusion>
		</exclusions>
	</dependency>
</dependencies>

The following changes were done in the <dependencies> section:

  • Exclude all log4j dependencies from all Flink dependencies: This causes Maven to ignore Flink’s transitive dependencies to log4j.
  • Exclude the slf4j-log4j12 artifact from Flink’s dependencies: Since we are going to use the slf4j to logback binding, we have to remove the slf4j to log4j binding.
  • Add the Logback dependencies: logback-core and logback-classic
  • Add dependencies for log4j-over-slf4jlog4j-over-slf4j is a tool which allows legacy applications which are directly using the Log4j APIs to use the Slf4j interface. Flink depends on Hadoop which is directly using Log4j for logging. Therefore, we need to redirect all logger calls from Log4j to Slf4j which is in turn logging to Logback.

Please note that you need to manually add the exclusions to all new Flink dependencies you are adding to the pom file.

You may also need to check if other dependencies (non Flink) are pulling in log4j bindings. You can analyze the dependencies of your project withmvn dependency:tree.

This tutorial is applicable when running Flink on YARN or as a standalone cluster.

In order to use Logback instead of Log4j with Flink, you need to remove the log4j-1.2.xx.jar and sfl4j-log4j12-xxx.jar from the lib/directory.

Next, you need to put the following jar files into the lib/ folder:

  • logback-classic.jar
  • logback-core.jar
  • log4j-over-slf4j.jar: This bridge needs to be present in the classpath for redirecting logging calls from Hadoop (which is using Log4j) to Slf4j.

用spring 集成mybatis的最佳实践

第一篇博文,欢迎各位大神拍砖
  • yeluochengshang
  • yeluochengshang
  • 2014年09月24日 01:46
  • 1642

C++程序设计最佳实践

随着计算机语言的发展,我们现在编写一个程序越来越容易了。利用一些软件开发工具,往往只要通过鼠标的拖拖点点,计算机就会自动帮你生成许多代码。但在很多时候,计算机的这种能力被滥用了,我们往往只考虑把这个程...
  • qq1712088151
  • qq1712088151
  • 2012年04月24日 11:24
  • 852

RUP和六个最佳实践的关系

前几次图文主要介绍了软件工程实践中六个最佳经验(迭代化开发、需求管理、基于构件的体系结构、可视化建模、持续的质量验证、变更管理)的由来,以及它们具体的内容,本次图文介绍RUP的相关内容,首先什么是Ra...
  • LSGO_MYP
  • LSGO_MYP
  • 2016年09月07日 20:58
  • 1504

MySql官方建议:Innodb表最佳实践

原文:http://dev.mysql.com/doc/refman/5.5/en/innodb-default-se.html Best Practices for InnoDB Tables ...
  • qw_xingzhe
  • qw_xingzhe
  • 2017年07月18日 17:25
  • 242

工欲善其事必先利其器——研发团队开源管理工具最佳实践

持续集成:jenkins   加速项目推进心跳项目管理及构建:maven+nexus  强制的组件化开发开发任务及缺陷跟踪:redmine 高效的沟通自动化测试及自动缺陷检测:Junit,findbu...
  • kthq
  • kthq
  • 2013年12月19日 23:09
  • 8651

[原创]VMware best practise[VMware最佳实践]

安装VMWare 写本文时使用的是VMWare Workstation 5.5 注册码:XLWPN-W476D-68NDF-5PTX3 创建虚拟机 启动VMWare Workstation后点...
  • guileen
  • guileen
  • 2006年10月13日 17:32
  • 1874

sql优化最佳实践

orders_p op LEFT JOIN orders_b o ON op.orders_id = o.id  LEFT JOIN orders_u u ON o.id = u.order_i...
  • yx511500623
  • yx511500623
  • 2017年12月14日 09:43
  • 138

WEB前端开发最佳实践(1)

前端代码重构 删除无用代码,精简代码(不起作用的CSS样式和废弃的JavaScript函数) 前端代码规范化,把CSS代码放到独立的文件中,在JS定义局部变量,把部分全局变量改变成局部变量 整理基础类...
  • qq_17358905
  • qq_17358905
  • 2016年05月08日 10:17
  • 1515

【荐】Angular 最佳实践

推荐文章 Armen Vardanyan : Angular: Best Practices 推荐理由 作者根据自身的项目实践,总结出了一些Angular的最佳实践。主要包涵了TypeScrip...
  • fen747042796
  • fen747042796
  • 2017年08月30日 09:32
  • 376

SVN最佳实践

        版权声明:如有转载请求,请注明出处:http://blog.csdn.net/yzhz  杨争                  Subversion是新一代的版本控制工具,由于其优于...
  • yzhz
  • yzhz
  • 2006年09月27日 14:42
  • 6433
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:最佳实践
举报原因:
原因补充:

(最多只允许输入30个字)