DataSet API编程

Flink的DataSet API用于实现数据集上的转换,如过滤、映射、连接、分组等。数据集可以从文件、Java集合等创建,结果通过sink写入(如分布式文件、标准输出)。Flink提供了多种内置InputFormat读取常见文件格式。Transformation包括Map、FlatMap、Filter、Reduce、Join等,支持用户自定义函数。数据集可通过Hash-Partition、Range-Partition等方式分区,还能进行本地排序和前n个元素的选择。最后,数据通过OutputFormat写入,如本地文件、自定义格式。
摘要由CSDN通过智能技术生成

DataSet API编程

DataSet API开发概述

DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for example write the data to (distributed) files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines.

Source:源头reading files, or from local collections

​ Source ==》 flink(transformations) ==》 Sink

Sink:目的地filtering, mapping, joining, grouping

DataSource

Data sources create the initial data sets, such as from files or from Java collections. The general mechanism of creating data sets is abstracted behind an InputFormat. Flink comes with several built-in formats to create data sets from common file formats. Many of them have shortcut methods on the ExecutionEnvironment.

  • 基于文件File-based:

    • readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.
    • readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.
    • readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.
    • readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.
    • readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.
  • 基于集合Collection-based:

    • fromCollection(Collection) - Creates a data set from a Java.util.Collection. All elements in the collection must be of the same type.
    • fromCollection(Iterator, Class) - Creates a data set from an iterator. The class specifies the data type of the elements returned by the iterator.
    • fromElements(T ...) - Creates a data set from the given sequence of objects. All objects must be of the same type.
    • fromParallelCollection(SplittableIterator, Class) - Creates a data set from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.
    • generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.
  • 基于CSV(Configuring CSV Parsing)

    Flink offers a number of configuration options for CSV parsing:

    • types(Class ... types) specifies the types of the fields to parse. It is mandatory to configure the types of the parsed fields. In case of the type class Boolean.class, “True” (case-insensitive), “False” (case-insensitive), “1” and “0” are treated as booleans.
    • lineDelimiter(String del) specifies the delimiter of individual records. The default line delimiter is the new-line character '\n'.
    • fieldDelimiter(String del) specifies the delimiter that separates fields of a record. The default field delimiter is the comma character ','.
    • includeFields(boolean ... flag), includeFields(String mask), or includeFields(long bitMask) defines which fields to read from the input file (and which to ignore). By default the first n fields (as defined by the number of types in the types() call) are parsed.
    • parseQuotedStrings(char quoteChar) enables quoted string parsing. Strings are parsed as quoted strings if the first character of the string field is the quote character (leading or tailing whitespaces are not trimmed). Field delimiters within quoted strings are ignored. Quoted string parsing fails if the last character of a quoted string field is not the quote character or if the quote character appears at some point which is not the start or the end of the quoted string field (unless the quote character is escaped using ‘’). If quoted string parsing is enabled and the first character of the field is not the quoting string, the string is parsed as unquoted string. By default, quoted string parsing is disabled.
    • ignoreComments(String commentPrefix) specifies a comment prefix. All lines that start with the specified comment prefix are not parsed and ignored. By default, no lines are ignored.
    • ignoreInvalidLines() enables lenient parsing, i.e., lines that cannot be correctly parsed are ignored. By default, lenient parsing is disabled and invalid lines raise an exception.
    • ignoreFirstLine() configures the InputFormat to ignore the first line of the input file. By default no line is ignored.
  • 从递归文件夹的内容创建DataSet(Recursive Traversal of the Input Path Directory)

    For file-based inputs, when the input path is a directory, nested files are not enumerated by default. Instead, only the files inside the base directory are read, while nested files are ignored. Recursive enumeration of nested files can be enabled through the recursive.file.enumeration configuration parameter, like in the following example.

    // enable recursive enumeration of nested input files
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    
    // create a configuration object
    Configuration parameters = new Configuration();
    
    // set the recursive enumeration parameter
    parameters.setBoolean("recursive.file.enumeration", true);
    
    // pass the configuration to the data source
    DataSet<String> logs = env.readTextFile("file:///path/with.nested/files")
    			  .withParameters(parameters);
    
  • 读取压缩文件(Read Compressed Files)

    Flink currently supports transparent decompression of input files if these are marked with an appropriate file extension. In particular, this means that no further configuration of the input formats is necessary and any FileInputFormat support the compression, including custom input formats. Please notice that compressed files might not be read in parallel, thus impacting job scalability.

    The following table lists the currently supported compression methods.

Compression method File extensions Parallelizable
DEFLATE .deflate no
GZip .gz, .gzip no
Bzip2 .bz2 no
XZ .xz no

Transformation/算子

Data transformations transform one or more DataSets into a new DataSet. Programs can combine multiple transformations into sophisticated assemblies.

This section gives a brief overview of the available transformations. The transformations documentation has a full description of all transformations with examples.

Map

The Map transformation applies a user-defined map function on each element of a DataSet. It implements a one-to-one mapping, that is, exactly one element must be returned by the function.

Map转换将用户定义的映射函数应用于DataSet的每个元素。 它实现了一对一的映射,也就是说,函数必须恰好返回一个元素。类似于y = f(x)

The following code transforms a DataSet of Integer pairs into a DataSet of Integers:

以下代码将一个整数对数据集转换为一个整数数据集:

// MapFunction that adds two integer values
public class IntAdder implements MapFunction<Tuple2<Integer, Integer>, Integer> {
   
  @Override
  public Integer map(Tuple2<Integer, Integer> in) {
   
    return in.f0 + in.f1;
  }
}

// [...]
DataSet<Tuple2<Integer, Integer>> intPairs = // [...]
DataSet<Integer> intSums = intPairs.map(new IntAdder());
val intPairs: DataSet[(Int, Int)] = // [...]
val intSums = intPairs.map {
    pair => pair._1 + pair._2 }

FlatMap

The FlatMap transformation applies a user-defined flat-map function on each element of a DataSet. This variant of a map function can return arbitrary many result elements (including none) for each input element.

为每个输入元素返回任意许多结果元素(包括无结果元素)。

The following code transforms a DataSet of text lines into a DataSet of words:

// FlatMapFunction that tokenizes a String by whitespace characters and emits all String tokens.
public class Tokenizer implements FlatMapFunction<String, String> {
   
  @Override
  public void flatMap(String value, Collector<String> out) {
   
    for (String token : value.split("\\W")) {
   
      out.collect(token);
    }
  }
}

// [...]
DataSet<String> textLines = // [...]
DataSet<String> words = textLines.flatMap(new Tokenizer());
val textLines: DataSet[String] = // [...]
val words = textLines.flatMap {
    _.split(" ") }

MapPartition

MapPartition transforms a parallel partition in a single function call. The map-partition function gets the partition as Iterable and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations.

MapPartition在单个函数调用中转换并行分区。 map-partition函数将分区获取为Iterable,并可以产生任意数量的结果值。 每个分区中元素的数量取决于并行度和先前的操作。

The following code transforms a DataSet of text lines into a DataSet of counts per partition:

以下代码将文本行的数据集转换为每个分区计数的数据集:

public class PartitionCounter implements MapPartitionFunction<String, Long> {
   

  public void mapPartition(Iterable<String> values, Collector<Long> out) {
   
    long c = 0;
    for (String s : values) {
   
      c++;
    }
    out.collect(c);
  }
}

// [...]
DataSet<String> textLines = // [...]
DataSet<Long> counts = textLines.mapPartition(new PartitionCounter());
val textLines: DataSet[String] = // [...]
// Some is required because the return value must be a Collection.
// There is an implicit conversion from Option to a Collection.
val counts = texLines.mapPartition {
    in => Some(in.size) }

Filter

The Filter transformation applies a user-defined filter function on each element of a DataSet and retains only those elements for which the function returns true.

Filter转换将用户定义的过滤器功能应用于DataSet的每个元素,并且仅保留该函数为其返回true的那些元素。

The following code removes all Integers smaller than zero from a DataSet:

以下代码从数据集中删除所有小于零的整数:

// FilterFunction that filters out all Integers smaller than zero.
public class NaturalNumberFilter implements FilterFunction<Integer> {
   
  @Override
  public boolean filter(Integer number) {
   
    return number >= 0;
  }
}

// [...]
DataSet<Integer> intNumbers = // [...]
DataSet<Integer> naturalNumbers = intNumbers.filter(new NaturalNumberFilter());
val intNumbers: DataSet[Int] = // [...]
val naturalNumbers = intNumbers.filter { _ > 0 }

IMPORTANT: The system assumes that the function does not modify the elements on which the predicate is applied. Violating this assumption can lead to incorrect results.

Projection of Tuple DataSet

The Project transformation removes or moves Tuple fields of a Tuple DataSet. The project(int...) method selects Tuple fields that should be retained by their index and defines their order in the output Tuple.

Projections do not require the definition of a user function.

The following code shows different ways to apply a Project transformation on a DataSet:

Not supported.

Transformations on Grouped DataSet

The reduce operations can operate on grouped data sets. Specifying the key to be used for grouping can be done in many ways:

  • key expressions
  • a key-selector function
  • one or more field position keys (Tuple DataSet only)
  • Case Class fields (Case Classes only)

Please look at the reduce examples to see how the grouping keys are specified.

Reduce on Grouped DataSet

A Reduce transformation that is applied on a grouped DataSet reduces each group to a single element using a user-defined reduce function. For each group of input elements, a reduce function successively combines pairs of elements into one element until only a single element for each group remains.

Note that for a ReduceFunction the keyed fields of the returned object should match the input values. This is because reduce is implicitly combinable and objects emitted from the combine operator are again grouped by key when passed to the reduce operator.

Reduce on DataSet Grouped by Key Expression

Key expressions specify one or more fields of each element of a DataSet. Each key expression is either the name of a public field or a getter method. A dot can be used to drill down into objects. The key expression “*” selects all fields. The following code shows how to group a POJO DataSet using key expressions and to reduce it with a reduce function.

// some ordinary POJO
class WC(val word: String, val count: Int) {
   
  def this() {
   
    this(null, -1)
  }
  // [...]
}

val words: DataSet[WC] = // [...]
val wordCounts = words.groupBy("word").reduce {
   
  (w1, w2) => new WC(w1.word, w1.count + w2.count)
}
Reduce on DataSet Grouped by KeySelector Function

A key-selector function extracts a key value from each element of a DataSet. The extracted key value is used to group the DataSet. The following code shows how to group a POJO DataSet using a key-selector function and to reduce it with a reduce function.

// some ordinary POJO
class WC(val word: String, val count: Int) {
   
  def this() {
   
    this(null, -1)
  }
  // [...]
}

val words: DataSet[WC] = // [...]
val wordCounts = words.groupBy {
    _.word } reduce {
   
  (w1, w2) => new WC(w1.word, w1.count + w2.count)
}
Reduce on DataSet Grouped by Field Position Keys (Tuple DataSets only)

Field position keys specify one or more fields of a Tuple DataSet that are used as grouping keys. The following code shows how to use field position keys and apply a reduce function

val tuples = DataSet[(String, Int, Double)] = // [...]
// group on the first and second Tuple field
val reducedTuples = tuples.groupBy(0, 1).reduce {
    ... }
Reduce on DataSet grouped by Case Class Fields

When using Case Classes you can also specify the grouping key using the names of the fields:

case class MyClass(val a: String, b: Int, c: Double)
val tuples = DataSet[MyClass] = // [...]
// group on the first and second field
val reducedTuples = tuples.groupBy("a", "b").reduce {
    ... }

GroupReduce on Grouped DataSet

A GroupReduce transformation that is applied on a grouped DataSet calls a user-defined group-reduce function for each group. The difference between this and Reduce is that the user defined function gets the whole group at once. The function is invoked with an Iterable over all elements of a group and can return an arbitrary number of result elements.

GroupReduce on DataSet Grouped by Field Position Keys (Tuple DataSets only)

The following code shows how duplicate strings can be removed from a DataSet grouped by Integer.

val input: DataSet[(Int, String)] = // [...]
val output = input.groupBy(0).reduceGroup {
   
      (in, out: Collector[(Int, String)]) =>
        in.toSet foreach (out.collect)
    }
GroupReduce on DataSet Grouped by Key Expression, KeySelector Function, or Case Class Fields

Work analogous to key expressions, key-selector functions, and case class fields in Reduce transformations.

GroupReduce on sorted groups

A group-reduce function accesses the elements of a group using an Iterable. Optionally, the Iterable can hand out the elements of a group in a specified order. In many cases this can help to reduce the complexity of a user-defined group-reduce function and improve its efficiency.

The following code shows another example how to remove duplicate Strings in a DataSet grouped by an Integer and sorted by String.

val input: DataSet[(Int, String)] = // [...]
val output = input.groupBy(0).sortGroup(1, Order.ASCENDING).reduceGroup {
   
      (in, out: Collector[(Int, String)]) =>
        var prev: (Int, String) = null
        for (t <- in) {
   
          if (prev == null || prev != t)
            out.collect(t)
            prev = t
        }
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值