FlinkSQL系列07-表查询


Flink SQL除了支持基本查询外,还支持一些复杂的高阶聚合和关联。

高阶聚合

语法

group by cube(维度 1,维度 2,维度 3)
group by grouping sets( (维度 1,维度 2) ,(维度 1,维度 3), (维度 2)())
group by rollup(省,市,区)

示例

select
province, city, region, count(distinct uid) as u_cnt
from t
group by cube(province, city, region)
select
province, city, region, count(distinct uid) as u_cnt
from t
group by rollup(province, city, region)
select
province, city, region, count(distinct uid) as u_cnt
from t
group by grouping sets((province, city),(province, city, region))

时间窗口 TVF(表值函数)

Windowing table-valued functions (Windowing TVFs)
flink1.13 开始,提供了时间窗口聚合计算的 TVF 语法
表值函数的使用约束:

  1. 在窗口上做分组聚合,必须带上 window_start 和 window_end 作为分组 key。
  2. 在窗口上做 topN 计算 ,必须带上 window_start 和 window_end 作为 partition 的 key。
  3. 带条件的 join,必须包含 2 个输入表的 window start 和 window end 等值条件。

语法示例

select 
  ......
from table(
  tumble (table t, descriptor(rt), interval '10' minutes) 
)

支持的时间窗口类型

  • 滚动窗口 (Tumble Windows)
TUMBLE(TABLE t_action,DESCRIPTOR(时间属性字段), INTERVAL '10' SECONDS[窗口长度]
  • 滑动窗口(Hop Windows)
HOP (TABLE t_action, DESCRIPTOR(时间属性字段), INTERVAL '5' SECONDS[滑动步长], INTERVAL '10' SECONDS[窗口长度] )
  • 累计窗口(Cumulate Windows)
CUMULATE (TABLE t_action, DESCRIPTOR(时间属性字段), INTERVAL '5' SECONDS[更新最大步长] , INTERVAL '10' SECONDS[窗口最大长度] )
  • 会话窗口(Session Windows)
    暂不支持

window 聚合

案例

public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    StreamTableEnvironment tenv = StreamTableEnvironment.create(env);

    // bidtime | price | item | supplier_id |
    DataStreamSource<String> s1 = env.socketTextStream("hdp01", 9999);
    SingleOutputStreamOperator<Bid> s2 = s1.map(s -> {
        String[] split = s.split(",");
        return new Bid(split[0], Double.parseDouble(split[1]), split[2], split[3]);
    });

    // 把流变成表
    tenv.createTemporaryView("t_bid", s2, Schema.newBuilder()
            .column("bidtime", DataTypes.STRING())
            .column("price", DataTypes.DOUBLE())
            .column("item", DataTypes.STRING())
            .column("supplier_id", DataTypes.STRING())
            .columnByExpression("rt", $("bidtime").toTimestamp())
            .watermark("rt", "rt - interval '1' second")
            .build());

    // 查询
    // tenv.executeSql("select bidtime,price,item,supplier_id,current_watermark(rt) as wm from t_bid").print();

    // 每分钟,计算最近5分钟的交易总额
    tenv.executeSql(
            "select" +
            "  window_start," +
            "  window_end," +
            "  sum(price) as price_amt" +
            "from table(" +
            " hop(table t_bid,descriptor(rt), interval '1' minutes, interval '5' minutes)" +
            ")" +
            "group by window_start,window_end"
    )/*.print()*/;


    // 每2分钟计算最近2分钟的交易总额
    tenv.executeSql(
            "select" +
            "  window_start," +
            "  window_end," +
            "  sum(price) as price_amt" +
            "from table(" +
            " tumble(table t_bid, descriptor(rt), interval '2' minutes)" +
            ")" +
            "group by window_start, window_end"
    )/*.print()*/;


    // 每2分钟计算今天以来的总交易额
    tenv.executeSql(
            "select" +
            "  window_start," +
            "  window_end," +
            "  sum(price) as price_amt" +
            "from table(" +
            " cumulate(table t_bid,descriptor(rt),interval '2' minutes, interval '24' hour)" +
            ")" +
            "group by window_start,window_end"
    )/*.print()*/;

    // 时间窗口上聚合后求topN
    // 每10分钟计算一次,最近10分钟内交易总额最大的前2个供应商及其交易单数
    tenv.executeSql(
            "select * from" +
            "(select" +
            "     window_start,window_end," +
            "     supplier_id," +
            "     price_amt," +
            "     bid_cnt," +
            "     row_number() over(partition by window_start, window_end order by price_amt desc) as rn" +
            "   from (" +
            "      select" +
            "        window_start," +
            "        window_end," +
            "        supplier_id," +
            "        sum(price) as price_amt," +
            "        count(1) as bid_cnt" +
            "      from table(tumble(table t_bid, descriptor(rt), interval '10' minutes) ) " +
            "      group by window_start,window_end,supplier_id" +
            "   )" +
            ") " +
            "where rn<=2"
    )/*.print()*/;

    // 时间窗口上直接求topN
    // 每10分钟窗口中,交易金额最大的前2单
    tenv.executeSql("SELECT * from" +
            "(SELECT" +
            "   bidtime," +
            "   price," +
            "   item," +
            "   supplier_id," +
            "   row_number() over(partition by window_start, window_end order by price desc) as rn" +
            "FROM TABLE(TUMBLE(table t_bid, descriptor(rt), interval '10' minute))" +
            ")" +
            "WHERE rn<=2")
        .print();

    env.execute();
}

window join

在 TVF 上使用 join 要点

  1. 参与 join 的两个表都需要定义时间窗口
  2. join 的条件中必须包含两表的 window_start 和 window_end 的等值条件

支持的 join 方式

  • inner/left/right/full
  • semi(即:where id in … )
  • anti( 即:where id not in … )

示例

public static void main(String[] args) {

       StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
       env.setParallelism(1);

       StreamTableEnvironment tenv = StreamTableEnvironment.create(env);


       /**
        * 1,a,1000
        * 2,b,2000
        * 3,c,2500
        * 4,d,3000
        * 5,e,12000
        */
       DataStreamSource<String> s1 = env.socketTextStream("hdp01", 9998);
       SingleOutputStreamOperator<Tuple3<String, String,Long>> ss1 = s1.map(s -> {
           String[] arr = s.split(",");
           return Tuple3.of(arr[0], arr[1],Long.parseLong(arr[2]));
       }).returns(new TypeHint<Tuple3<String, String,Long>>() {
       });

       /**
        * 1,bj,1000
        * 2,sh,2000
        * 4,xa,2600
        * 5,yn,12000
        */
       DataStreamSource<String> s2 = env.socketTextStream("hdp01", 9999);
       SingleOutputStreamOperator<Tuple3<String, String,Long>> ss2 = s2.map(s -> {
           String[] arr = s.split(",");
           return Tuple3.of(arr[0], arr[1],Long.parseLong(arr[2]));
       }).returns(new TypeHint<Tuple3<String, String,Long>>() {
       });


       // 创建两个表
       tenv.createTemporaryView("t_left",ss1, Schema.newBuilder()
                       .column("f0", DataTypes.STRING())
                       .column("f1", DataTypes.STRING())
                       .column("f2", DataTypes.BIGINT())
                       .columnByExpression("rt","to_timestamp_ltz(f2,3)")
                       .watermark("rt","rt - interval '0' second")
               .build());

       tenv.createTemporaryView("t_right",ss2, Schema.newBuilder()
               .column("f0", DataTypes.STRING())
               .column("f1", DataTypes.STRING())
               .column("f2", DataTypes.BIGINT())
               .columnByExpression("rt","to_timestamp_ltz(f2,3)")
               .watermark("rt","rt - interval '0' second")
               .build());

       // 各类窗口join示例
       // INNER
       tenv.executeSql(
            "SELECT" +
            "  a.f0,a.f1,a.f2,b.f0,b.f1  " +
            "from " +
            "(  select * from table(tumble(table t_left,descriptor(rt),interval '10' second))  ) a" +
            "join " +
            "( select * from table(tumble(table t_right,descriptor(rt),interval '10' second))  ) b" +
            "on a.window_start=b.window_start and a.window_end = b.window_end and a.f0=b.f0"
       )/*.print()*/;


       // left / right / full outer
       tenv.executeSql(
           "SELECT" +
           "  a.f0,a.f1,a.f2,b.f0,b.f1" +
           "from \n" +
           "(  select * from table(tumble(table t_left,descriptor(rt),interval '10' second))  ) a" +
           "full join" +
           "( select * from table(tumble(table t_right,descriptor(rt),interval '10' second))  ) b" +
           "on a.window_start=b.window_start and a.window_end = b.window_end and a.f0=b.f0"
       )/*.print()*/;


       // semi Join ==> where  ... in  ....
       tenv.executeSql(
            "SELECT" +
            "  a.f0,a.f1,a.f2" +
            "from" +
            "(  select * from table(tumble(table t_left,descriptor(rt),interval '10' second))  ) a" +
            "where f0 in " +
            "(" +
            "  select f0 from " +
            "  (select * from table(tumble(table t_right,descriptor(rt),interval '10' second))) b" +
            "  where a.window_start=b.window_start and a.window_end=b.window_end " +
            ")"
       ).print();

       // semi Join ==> where ... not in  ....
       tenv.executeSql(
           "SELECT" +
             "  a.f0,a.f1,a.f2" +
             "from " +
             "(  select * from table(tumble(table t_left,descriptor(rt),interval '10' second))  ) a" +
             "where f0 not in " +
             "(" +
             "  select f0 from" +
             "  (select * from table(tumble(table t_right,descriptor(rt),interval '10' second))) b" +
             "  where a.window_start=b.window_start and a.window_end=b.window_end" +
             ")"
       ).print();
   }

FlinkSql 提供了非常丰富的 join 功能,为实现各类关联场景提供强大的功能支撑
在这里插入图片描述

regular join

常规 join,flink 底层是会对两个参与 join 的输入流中的数据进行状态存储的;
所以,随着时间的推进,状态中的数据量会持续膨胀,可能会导致过于庞大,从而降低系统的整体效
率;
如何去缓解呢:自己根据自己业务系统数据特性(估算能产生关联的左表数据和右表数据到达的最
大时间差),根据这个最大时间差,可以设置状态的 ttl 时长,来控制状态的体积上限。

StreamTableEnvironment tenv = StreamTableEnvironment.create(env);
// 设置 table 环境中的状态 ttl 时长
tenv.getConfig().getConfiguration().setLong("table.exec.state.ttl", 60*60*1000L);

interval join

tenv.executeSql("select a.f0, a.f1, a.f2, b.f0, b.f1 from t_left a join t_right b " +
        "on a.f0 = b.f0  " +
        "and a.rt between b.rt - interval '2' second and b.rt")
        .print();

lookup join

lookup join 为了提高性能,lookup 的连接器会将查询过的维表数据进行缓存(默认未开启此机制),
可以通过参数开启,比如 jdbc-connector 的 lookup 模式下,有如下参数:

  • lookup.cache.max-rows = (none) 未开启
  • lookup.cache.ttl = (none) ttl 缓存清除的时长
public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    StreamTableEnvironment tenv = StreamTableEnvironment.create(env);
    // 设置table环境中的状态ttl时长
    tenv.getConfig().getConfiguration().setLong("table.exec.state.ttl", 60 * 60 * 1000L);

    /**
     * 1,a
     * 2,b
     * 3,c
     * 4,d
     * 5,e
     */
    DataStreamSource<String> s1 = env.socketTextStream("hdp01", 9998);
    SingleOutputStreamOperator<Tuple2<Integer, String>> ss1 = s1.map(s -> {
        String[] arr = s.split(",");
        return Tuple2.of(Integer.parseInt(arr[0]), arr[1]);
    }).returns(new TypeHint<Tuple2<Integer, String>>() {
    });

    // 创建主表(需要声明处理时间属性字段)
    tenv.createTemporaryView("a", ss1, Schema.newBuilder()
            .column("f0", DataTypes.INT())
            .column("f1", DataTypes.STRING())
            .columnByExpression("pt", "proctime()")  // 定义处理时间属性字段
            .build());

    // 创建lookup维表(jdbc connector表)
    tenv.executeSql(
        "create table b(   " +
        "   id  int, " +
        "   name string, " +
        "   gender STRING, " +
        "   primary key(id) not enforced  " +
        ") with (" +
        "  'connector' = 'jdbc'," +
        "  'url' = 'jdbc:mysql://hdp01:3306/flinktest'," +
        "  'table-name' = 'stu'," +
        "  'username' = 'root'," +
        "  'password' = 'root' " +
        ")"
    );

    // lookup join 查询
    tenv.executeSql("select a.*, c.* from a JOIN b FOR SYSTEM_TIME AS OF a.pt AS c ON a.f0 = c.id").print();

    env.execute();
}

temporal join

左表的数据永远去关联右表数据的对应时间上的最新版本。

public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    StreamTableEnvironment tenv = StreamTableEnvironment.create(env);


    /**
     * 订单Id,币种,金额,订单时间
     * 1,a,100,167438436400
     */
    DataStreamSource<String> s1 = env.socketTextStream("hdp01", 9998);

    SingleOutputStreamOperator<Order> ss1 = s1.map(s -> {
        String[] arr = s.split(",");
        return new Order(Integer.parseInt(arr[0]), arr[1], Double.parseDouble(arr[2]), Long.parseLong(arr[3]));
    });


    // 创建主表(需要声明处理时间属性字段)
    tenv.createTemporaryView("orders", ss1, Schema.newBuilder()
            .column("orderId", DataTypes.INT())
            .column("currency", DataTypes.STRING())
            .column("price", DataTypes.DOUBLE())
            .column("orderTime", DataTypes.BIGINT())
            .columnByExpression("rt", "to_timestamp_ltz(orderTime,3)")  // 定义处理时间属性字段
            .watermark("rt","rt")
            .build());


    //tenv.executeSql("select orderId,currency,price,orderTime,rt from orders").print();

    // 创建 temporal 表
    tenv.executeSql("CREATE TABLE currency_rate (\n" +
            "      currency STRING, \n" +
            "      rate double ,  \n" +
            "      update_time bigint ,  \n" +
            "      rt  as to_timestamp_ltz(update_time,3)   ," +
            "      watermark for rt as rt - interval '0' second ," +
            "      PRIMARY KEY(currency) NOT ENFORCED\n" +
            "     ) WITH (  \n" +
            "     'connector' = 'mysql-cdc',\n" +
            "     'hostname' = 'hdp01',\n" +
            "     'port' = '3306',\n" +
            "     'username' = 'root',\n" +
            "     'password' = 'root',\n" +
            "     'database-name' = 'flinktest',\n" +
            "     'table-name' = 'currency_rate'\n" +
            ")");

    //tenv.executeSql("select * from currency_rate").print();

    // temporal 关联查询
    tenv.executeSql(
            "SELECT                  \n" +
            "     orders.orderId,      \n" +
            "     orders.currency,     \n" +
            "     orders.price,        \n" +
            "     orders.orderTime,    \n" +
            "     rate  \n" +
            "FROM orders   \n" +
            "LEFT JOIN currency_rate FOR SYSTEM_TIME AS OF orders.rt  \n" +
            "ON orders.currency = currency_rate.currency"
    ).print();

    env.execute();
}

array join

public static void main(String[] args) {
    TableEnvironment tenv = TableEnvironment.create(EnvironmentSettings.inStreamingMode());

    Table table = tenv.fromValues(DataTypes.ROW(
                    DataTypes.FIELD("id", DataTypes.INT()),
                    DataTypes.FIELD("name", DataTypes.STRING()),
                    DataTypes.FIELD("tags", DataTypes.ARRAY(DataTypes.STRING())))
            , row("1", "zs", array("stu", "child"))
            , row("2", "bb", array("miss"))
    );

    tenv.createTemporaryView("t",table);

    tenv.executeSql("select t.id,t.name,x.tag from t cross join unnest(tags) as x(tag)")/*.print()*/;

    tenv.createTemporarySystemFunction("mysplit",MySplit.class);
    tenv.executeSql("select t.id, t.name, tag    from t, lateral table(mysplit(tags)) ")/*.print()*/;
    tenv.executeSql("select t.id, t.name, x.tag2 from t, lateral table(mysplit(tags)) x(tag2)")/*.print()*/;
    tenv.executeSql("select t.id, t.name, tag    from t left join lateral table(mysplit(tags)) on true")/*.print()*/;
    tenv.executeSql("select t.id, t.name, x.tag2 from t left join lateral table(mysplit(tags)) x(tag2) on true").print();
}

@FunctionHint(output = @DataTypeHint("ROW<tag string>"))
public static class MySplit extends TableFunction<Row> {

    public void eval(String[] arr){
        for (String s : arr) {
            collect(Row.of(s));
        }
    }
}

over 窗口聚合

row_number( ) over ( )
flinksql 中,over 聚合时,指定聚合数据区间有两种方式

  • 方式 1,带时间设定区间

    RANGE BETWEEN INTERVAL '30' MINUTE PRECEDING AND CURRENT ROW
    
  • 方式 2,按行设定区间

    ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
    
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值