Flink应用开发之Table API & SQL（附源码解析）(一）

最新推荐文章于 2023-11-09 18:13:03 发布

牙几米

最新推荐文章于 2023-11-09 18:13:03 发布

阅读量1k

点赞数 1

分类专栏：大数据开发 Flink

本文链接：https://blog.csdn.net/weixin_46021633/article/details/114782396

版权

大数据开发同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

Flink

9 篇文章 0 订阅

订阅专栏

依赖图

Flink1.9开始，Flink提供了两个Table Planner实现来执行Table API和SQL程序：Blink Planner和Old Planner，Old Planner在1.9之前已经存在了。Planner的作用是将关系型操作翻译成可执行的、优化后的任务。两种Planner在优化规则和执行时类都不一样，在支持的功能上也有些差异。

对于生产环境，建议使用在1.11之后变成默认的Blink Planner。

所有的Table API和sql的代码都在flink-table或者flink-table-blink Maven artifats下

flink-table-common: 公共模块，比如自定义函数、格式等需要依赖的。
flink-table-api-java: Table 和 SQL API，使用 Java 语言编写的，给纯 table 程序使用（还在早期开发阶段，不建议使用）
flink-table-api-scala: Table 和 SQL API，使用 Scala 语言编写的，给纯 table 程序使用（还在早期开发阶段，不建议使用）
flink-table-api-java-bridge: Table 和 SQL API 结合 DataStream/DataSet API 一起使用，给 Java 语言使用。
flink-table-api-scala-bridge: Table 和 SQL API 结合 DataStream/DataSet API 一起使用，给 Scala 语言使用。
flink-table-planner: table Planner 和运行时。这是在1.9之前 Flink 的唯一的 Planner，但是从1.11版本开始我们不推荐继续使用。
flink-table-planner-blink: 新的 Blink Planner，从1.11版本开始成为默认的 Planner。
flink-table-runtime-blink: 新的 Blink 运行时。
flink-table-uber: 把上述模块以及 Old Planner 打包到一起，可以在大部分Table & SQL API 场景下使用。打包到一起的 jar 文件 flink-table-*.jar 默认会直接放到 Flink 发行版的 /lib 目录下。
flink-table-uber-blink: 把上述模块以及 Blink Planner 打包到一起，可以在大部分 Table & SQL API 场景下使用。打包到一起的 jar 文件 flink-table-blink-*.jar 默认会放到 Flink 发行版的 /lib 目录下。

Table程序的依赖

<!-- Either... -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-api-java-bridge_2.11</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>
<!-- or... -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-api-scala-bridge_2.11</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>

如果你想在IDE本地运行程序，你需要添加下面的模块，具体用哪个取决于你使用哪个Planner

<!-- Either... (for the old planner that was available before Flink 1.9) -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-planner_2.11</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>
<!-- or.. (for the new Blink planner) -->
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-planner-blink_2.11</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>

内部实现部分table功能是使用Scala编写的，所以下面依赖也有添加到程序中

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-scala_2.11</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>

扩展依赖

如果你想实现自定义格式来解析Kafka数据，或者自定义函数，下面的依赖也足够了，编译出来的jar包可以直接给sql client使用：

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-table-common</artifactId>
  <version>1.12.0</version>
  <scope>provided</scope>
</dependency>

当前本模块包括以下扩展接口

SerializationSchemaFactory
DeserializationSchemaFactory
ScalarFunction
TableFunction
AggregateFunction

SerializationSchemaFactory

/**
 * 用于创建SerialItaseStudio的配置实例的工厂。
 *
 * @param <T> 记录生成或者消费的什么格式
 */
@PublicEvolving
public interface SerializationSchemaFactory<T> extends TableFormatFactory<T> {

    /**
     * 使用给定的属性创建和配置一个SerializationSchema
     *
     * @param properties 描述格式的规范化属性
     * @return 如果工厂类无法提供此类的实例，则返回配置的序列化架构或者null
     *     of this class
     */
    SerializationSchema<T> createSerializationSchema(Map<String, String> properties);
}

DeserializationSchemaFactory

/**
 * Factory for creating configured instances of {@link DeserializationSchema}.
 *
 * @param <T> record type that the format produces or consumes.
 */
@PublicEvolving
public interface DeserializationSchemaFactory<T> extends TableFormatFactory<T> {

    /**
     * 使用给定的属性创建和配置一个DeserializationSchema
     *
     * @param properties 描述格式的规范化属性
     * @return 如果工厂类无法提供此类的实例，则返回配置的序列化架构或者null
     *     of this class
     */
    DeserializationSchema<T> createDeserializationSchema(Map<String, String> properties);
}

ScalarFunction

/**
 * 用户自定义Scala函数的基类，用户自定义的Scala函数将0个、1个或多个Scala值映射到新的Scala值
 *
 * Scala函数实现的操作可以通过定义一个求值方法来实现。求值方法必须公开透明，并必须命名为<code>eval</code>。通过实现名为<code>eval</code>多个方法，也可以重载这个求职方法。
 *
 * 默认情况下，使用反射自动提取和输入的数据类型。如果反射提供的信息不足，可以使用DataTypeHint和FunctionHint注解来支持需要的反射信息。
 *
 * <p>The following examples show how to specify a scalar function:
 *
 * 计算两个int类型的参数并计算求和
 * class SumFunction extends ScalarFunction {
 *   public Integer eval(Integer a, Integer b) {
 *     return a + b;
 *   }
 * }
 *
 * 接收int not null或者boolean not null，使用函数计算后返回字符串的函数
 * class StringifyFunction extends ScalarFunction {
 *   public String eval(int i) {
 *     return String.valueOf(i);
 *   }
 *   public String eval(boolean b) {
 *     return String.valueOf(b);
 *   }
 * }
 *
 * 接受int类型或者Boolean类型，并使用函数计算后返回一个字符串类型
 * @FunctionHint(input = [@DataTypeHint("INT")])
 * @FunctionHint(input = [@DataTypeHint("BOOLEAN")])
 * class StringifyFunction extends ScalarFunction {
 *   public String eval(Object o) {
 *     return o.toString();
 *   }
 * }
 *
 * // a function that accepts any data type as argument and computes a STRING
 * class StringifyFunction extends ScalarFunction {
 *   public String eval(@DataTypeHint(inputGroup = InputGroup.ANY) Object o) {
 *     return o.toString();
 *   }
 * }
 *
 * 一个函数接收任意数量的bigint值，并计算返回一个DECIMAL(10, 4)类型的值
 * class SumFunction extends ScalarFunction {
 *   public @DataTypeHint("DECIMAL(10, 4)") BigDecimal eval(Long... values) {
 *     // ...
 *   }
 * }
 * }</pre>
 *
 * 为了在catlog中存储用户自定义的函数，类必须具有默认的构造声明函数，并且在运行时是可以实例化的
 */
@PublicEvolving
public abstract class ScalarFunction extends UserDefinedFunction {

    /**
     * 返回一个具有指定结果类型的标志方法
     *
     * 该方法使用旧类型的系统，基于旧的反射提取逻辑。不推荐使用该方法，并且该方法将在以后的版本中删除，仅在调用TableEnvironment.registerFunction（…）方法时使用。新的反射提取逻辑的功能足够强大，可以覆盖大多数的用例。对与高级用户，可以重写UserDefinedFunction#getTypeInference(DataTypeFactory)来满足需要的功能。
     */
    @Deprecated
    public TypeInformation<?> getResultType(Class<?>[] signature) {
        return null;
    }

    /**
     * 返回一个具有指定结果类型的标志方法
     *
     * 该方法使用旧类型的系统，基于旧的反射提取逻辑。不推荐使用该方法，并且该方法将在以后的版本中删除，
     * 仅在调用TableEnvironment.registerFunction（…）方法时使用。新的反射提取逻辑的功能足够强大，可以覆盖大多数的用例。
     * 对与高级用户，可以重写UserDefinedFunction#getTypeInference(DataTypeFactory)来满足需要的功能。
     */
    @Deprecated
    public TypeInformation<?>[] getParameterTypes(Class<?>[] signature) {
        final TypeInformation<?>[] types = new TypeInformation<?>[signature.length];
        for (int i = 0; i < signature.length; i++) {
            try {
                types[i] = TypeExtractor.getForClass(signature[i]);
            } catch (InvalidTypesException e) {
                throw new ValidationException(
                        "Parameter types of scalar function "
                                + this.getClass().getCanonicalName()
                                + " cannot be automatically determined. Please provide type information manually.");
            }
        }
        return types;
    }

    @Override
    public final FunctionKind getKind() {
        return FunctionKind.SCALAR;
    }

    @Override
    public TypeInference getTypeInference(DataTypeFactory typeFactory) {
        return TypeInferenceExtractor.forScalarFunction(typeFactory, getClass());
    }
}

TableFunction

/**
 * 用户自定义函数的基类.用户定义的表函数将零个、一个或多个Scala函数值映射到零、一个或者多个行（或结构化类型）。如果输出记录只包含一个字段，则可以省略结构化记录，并且可以输出Scala值，该值将在运行时隐式包装到行中。
 *
 * TableFunction的方法可以通过一个自定义的求值方法来实现。求职方法必须是公开透明的，而不是静态的，并命名为<code>eval</code>。
 * 还可以通过实现多个名为<code>eval</code>的方法来重载求值的方法。
 *
 * 默认情况下，使用反射自动提取输入和输出数据类型。这包括输出类型的泛型参数<T>。输入类型派生子一个或多个eval（）方法。如果反射信息不够，可以通过使用DataTypeHint和FunctionHint注解获得支持。
 *
 * 以下示例演示如何制定一个表函数
 *
 * <pre>{@code
 * //指定一个函数，该函数接收任意数量int类型的参数，并将其作为隐式行<INT>
 * class FlattenFunction extends TableFunction<Integer> {
 *   public void eval(Integer... args) {
 *     for (Integer i : args) {
 *       collect(i);
 *     }
 *   }
 * }
 *
 * // 一个函数接收int或者string类型任何一个参数，并将作为隐式行<STRING>
 * class DuplicatorFunction extends TableFunction<String> {
 *   public void eval(Integer i) {
 *     eval(String.valueOf(i));
 *   }
 *   public void eval(String s) {
 *     collect(s);
 *     collect(s);
 *   }
 * }
 *
 * // 从参数中生成一行<i INT,s STRING>的函数，这个函数有助于声明行中的字段
 * @FunctionHint(output = @DataTypeHint("ROW< i INT, s STRING >"))
 * class DuplicatorFunction extends TableFunction<Row> {
 *   public void eval(Integer i, String s) {
 *     collect(Row.of(i, s));
 *     collect(Row.of(i, s));
 *   }
 * }
 *
 * // 一个函数接收INT或者DECIMAL(10,4),并使用函数提示声明输出类型，将他们作为隐式行<INT>或行<DECIMAL(10,4)>输出。
 * class DuplicatorFunction extends TableFunction<Object> {
 *   @FunctionHint(output = @DataTypeHint("INT"))
 *   public void eval(Integer i) {
 *     collect(i);
 *     collect(i);
 *   }
 *   @FunctionHint(output = @DataTypeHint("DECIMAL(10, 4)"))
 *   public void eval(@DataTypeHint("DECIMAL(10, 4)") BigDecimal d) {
 *     collect(d);
 *     collect(d);
 *   }
 * }
 * }</pre>
 *
 * 为了保证用户自定义的函数可以保存在catlog中，类必须具有默认的构造函数，并且在运行时是可以实例化的。
 *
 * 在API中，表函数的用法如下
 *
 * <pre>{@code
 * public class Split extends TableFunction<String> {
 *
 *   // 实现一个eval方法，包含你想要的任意多个参数
 *   public void eval(String str) {
 *     for (String s : str.split(" ")) {
 *       collect(s);   // use collect(...) to emit an output row
 *     }
 *   }
 *
 *   // 你可以在这里重载eval（）方法.
 * }
 *
 * TableEnvironment tEnv = ...
 * Table table = ...    // schema: ROW< a VARCHAR >
 *
 * // for Scala users
 * table.joinLateral(call(classOf[Split], $"a") as ("s")).select($"a", $"s")
 *
 * // for Java users
 * table.joinLateral(call(Split.class, $("a")).as("s")).select($("a"), $("s"));
 *
 * // for SQL users
 * tEnv.createTemporarySystemFunction("split", Split.class); // register table function first
 * tEnv.sqlQuery("SELECT a, s FROM MyTable, LATERAL TABLE(split(a)) as T(s)");
 * }</pre>
 *
 * @param <T> 输出行的数据类型.显式复合类型或者一个字段组成的行中的原子类型
 */
@PublicEvolving
public abstract class TableFunction<T> extends UserDefinedFunction {

    /** The code generated collector used to emit rows. */
    private Collector<T> collector;

    /** Internal use. Sets the current collector. */
    public final void setCollector(Collector<T> collector) {
        this.collector = collector;
    }

    /**
     * 返回求值方法的结果类型
     *
     * @deprecated 该方法采用旧类型系统，基于旧的反射提取逻辑。
     * 该刚发将在以后的版中删除掉，并且只是使用在不推荐使用的TableEnvironment.registerFunction（...）方法时调用。
     * 新的反射提取逻辑（可能包含了DataTypeHint}和FunctionHint}）应该足够强大，可以覆盖大多数用例。
     * 对于高级用户，可以重写UserDefinedFunction#getTypeInference（DataTypeFactory）来实现功能。
     */
    @Deprecated
    public TypeInformation<T> getResultType() {
        return null;
    }

    /**
     * Returns {@link TypeInformation} about the operands of the evaluation method with a given
     * signature.
     *
     * @deprecated 该方法采用旧类型系统，基于旧的反射提取逻辑。
     * 该刚发将在以后的版中删除掉，并且只是使用在不推荐使用的TableEnvironment.registerFunction（...）方法时调用。
     * 新的反射提取逻辑（可能包含了DataTypeHint}和FunctionHint}）应该足够强大，可以覆盖大多数用例。
     * 对于高级用户，可以重写UserDefinedFunction#getTypeInference（DataTypeFactory）来实现功能。
     */
    @Deprecated
    public TypeInformation<?>[] getParameterTypes(Class<?>[] signature) {
        final TypeInformation<?>[] types = new TypeInformation<?>[signature.length];
        for (int i = 0; i < signature.length; i++) {
            try {
                types[i] = TypeExtractor.getForClass(signature[i]);
            } catch (InvalidTypesException e) {
                throw new ValidationException(
                        "Parameter types of table function "
                                + this.getClass().getCanonicalName()
                                + " cannot be automatically determined. Please provide type information manually.");
            }
        }
        return types;
    }

    /**
     * 发出（隐式或显式）一个输出行
     *
     * 如果null作为显式行输出，则在运行时跳过他。对于隐式行，行的字段
     *
     * @param 输出行
     */
    protected final void collect(T row) {
        collector.collect(row);
    }

    @Override
    public final FunctionKind getKind() {
        return FunctionKind.TABLE;
    }

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public TypeInference getTypeInference(DataTypeFactory typeFactory) {
        return TypeInferenceExtractor.forTableFunction(typeFactory, (Class) getClass());
    }
}

AggregateFunction

/**
 * 
 *
 * <p>The behavior of an {@link AggregateFunction} is centered around the concept of an accumulator.
 * The accumulator is an intermediate data structure that stores the aggregated values until a final
 * aggregation result is computed.
 *
 * <p>For each set of rows that needs to be aggregated, the runtime will create an empty accumulator
 * by calling {@link #createAccumulator()}. Subsequently, the {@code accumulate()} method of the
 * function is called for each input row to update the accumulator. Once all rows have been
 * processed, the {@link #getValue(Object)} method of the function is called to compute and return
 * the final result.
 *
 * <p>The main behavior of an {@link AggregateFunction} can be defined by implementing a custom
 * accumulate method. An accumulate method must be declared publicly, not static, and named <code>
 * accumulate</code>. Accumulate methods can also be overloaded by implementing multiple methods
 * named <code>accumulate</code>.
 *
 * <p>By default, input, accumulator, and output data types are automatically extracted using
 * reflection. This includes the generic argument {@code ACC} of the class for determining an
 * accumulator data type and the generic argument {@code T} for determining an accumulator data
 * type. Input arguments are derived from one or more {@code accumulate()} methods. If the
 * reflective information is not sufficient, it can be supported and enriched with {@link
 * DataTypeHint} and {@link FunctionHint} annotations.
 *
 * <p>An {@link AggregateFunction} needs at least three methods:
 *
 * <ul>
 *   <li>{@code createAccumulator}
 *   <li>{@code accumulate}
 *   <li>{@code getValue}
 * </ul>
 *
 * <p>There are a few other methods that are optional:
 *
 * <ul>
 *   <li>{@code retract}
 *   <li>{@code merge}
 * </ul>
 *
 * <p>All these methods must be declared publicly, not static, and named exactly as the names
 * mentioned above to be called by generated code.
 *
 * <p>For storing a user-defined function in a catalog, the class must have a default constructor
 * and must be instantiable during runtime.
 *
 * <pre>{@code
 * Processes the input values and updates the provided accumulator instance. The method
 * accumulate can be overloaded with different custom types and arguments. An aggregate function
 * requires at least one accumulate() method.
 *
 * param: accumulator           the accumulator which contains the current aggregated results
 * param: [user defined inputs] the input value (usually obtained from new arrived data).
 *
 * public void accumulate(ACC accumulator, [user defined inputs])
 * }</pre>
 *
 * <pre>{@code
 * Retracts the input values from the accumulator instance. The current design assumes the
 * inputs are the values that have been previously accumulated. The method retract can be
 * overloaded with different custom types and arguments. This method must be implemented for
 * bounded OVER aggregates over unbounded tables.
 *
 * param: accumulator           the accumulator which contains the current aggregated results
 * param: [user defined inputs] the input value (usually obtained from new arrived data).
 *
 * public void retract(ACC accumulator, [user defined inputs])
 * }</pre>
 *
 * <pre>{@code
 * Merges a group of accumulator instances into one accumulator instance. This method must be
 * implemented for unbounded session window and hop window grouping aggregates and
 * bounded grouping aggregates. Besides, implementing this method will be helpful for optimizations.
 * For example, two phase aggregation optimization requires all the {@link AggregateFunction}s
 * support "merge" method.
 *
 * param: accumulator the accumulator which will keep the merged aggregate results. It should
 *                    be noted that the accumulator may contain the previous aggregated
 *                    results. Therefore user should not replace or clean this instance in the
 *                    custom merge method.
 * param: iterable    an java.lang.Iterable pointed to a group of accumulators that will be
 *                    merged.
 *
 * public void merge(ACC accumulator, java.lang.Iterable<ACC> iterable)
 * }</pre>
 *
 * <p>If this aggregate function can only be applied in an OVER window, this can be declared by
 * returning the requirement {@link FunctionRequirement#OVER_WINDOW_ONLY} in {@link
 * #getRequirements()}.
 *
 * <p>If an accumulator needs to store large amounts of data, {@link ListView} and {@link MapView}
 * provide advanced features for leveraging Flink's state backends in unbounded data scenarios.
 *
 * <p>The following examples show how to specify an aggregate function:
 *
 * <pre>{@code
 * // a function that counts STRING arguments that are not null and emits them as STRING
 * // the accumulator is BIGINT
 * public static class CountFunction extends AggregateFunction<String, CountFunction.MyAccumulator> {
 *   public static class MyAccumulator {
 *     public long count = 0L;
 *   }
 *
 *   {@literal @}Override
 *   public MyAccumulator createAccumulator() {
 *     return new MyAccumulator();
 *   }
 *
 *   public void accumulate(MyAccumulator accumulator, Integer i) {
 *     if (i != null) {
 *       accumulator.count += i;
 *     }
 *   }
 *
 *   {@literal @}Override
 *   public String getValue(MyAccumulator accumulator) {
 *     return "Result: " + accumulator.count;
 *   }
 * }
 *
 * // a function that determines the maximum of either BIGINT or STRING arguments
 * // the accumulator and the output is either BIGINT or STRING
 * public static class MaxFunction extends AggregateFunction<Object, Row> {
 *   {@literal @}Override
 *   public Row createAccumulator() {
 *     return new Row(1);
 *   }
 *
 *   {@literal @}FunctionHint(
 *     accumulator = {@literal @}DataTypeHint("ROW<max BIGINT>"),
 *     output = {@literal @}DataTypeHint("BIGINT")
 *   )
 *   public void accumulate(Row accumulator, Long l) {
 *     final Long max = (Long) accumulator.getField(0);
 *     if (max == null || l > max) {
 *       accumulator.setField(0, l);
 *     }
 *   }
 *
 *   {@literal @}FunctionHint(
 *     accumulator = {@literal @}DataTypeHint("ROW<max STRING>"),
 *     output = {@literal @}DataTypeHint("STRING")
 *   )
 *   public void accumulate(Row accumulator, String s) {
 *     final String max = (String) accumulator.getField(0);
 *     if (max == null || s.compareTo(max) > 0) {
 *       accumulator.setField(0, s);
 *     }
 *   }
 *
 *   {@literal @}Override
 *   public Object getValue(Row accumulator) {
 *     return accumulator.getField(0);
 *   }
 * }
 * }</pre>
 *
 * @param <T> final result type of the aggregation
 * @param <ACC> intermediate result type during the aggregation
 */
@PublicEvolving
public abstract class AggregateFunction<T, ACC> extends ImperativeAggregateFunction<T, ACC> {

    /**
     * Called every time when an aggregation result should be materialized. The returned value could
     * be either an early and incomplete result (periodically emitted as data arrives) or the final
     * result of the aggregation.
     *
     * @param accumulator the accumulator which contains the current intermediate results
     * @return the aggregation result
     */
    public abstract T getValue(ACC accumulator);

    @Override
    public final FunctionKind getKind() {
        return FunctionKind.AGGREGATE;
    }

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public TypeInference getTypeInference(DataTypeFactory typeFactory) {
        return TypeInferenceExtractor.forAggregateFunction(typeFactory, (Class) getClass());
    }
}