JOIN
JOIN用于将两张表关联起来。区别为实时计算关联的是两张动态表,关联的结果也会动态更新,以保证最终结果和批处理结果一致。
语法:
tableReference [, tableReference ]* | tableexpression
[ LEFT ] JOIN tableexpression [ joinCondition ];
- tableReference:表名称。
- tableexpression:表达式。
- joinCondition:JOIN条件。
注意:
只支持等值连接,不支持非等值连接。
只支持INNER JOIN和LEFT OUTER JOIN两种JOIN方式。
Orders JOIN Products表的数据示例:
测试数据
表 1. Orders
rowtime productId orderId units
10:17:00 30 5 4
10:17:05 10 6 1
10:18:05 20 7 2
10:18:07 30 8 20
11:02:00 10 9 6
11:04:00 10 10 1
11:09:30 40 11 12
11:24:11 10 12 4
表 2. Products
Productid name unitPrice
30 Cheese 17
10 Beer 0.25
20 Wine 6
30 Cheese 17
10 Beer 0.25
10 Beer 0.25
40 Bread 100
10 Beer 0.25
测试语句:
SELECT o.rowtime, o.productId, o.orderId, o.units,p.name, p.unitPrice
FROM Orders AS o
JOIN Products AS p
ON o.productId = p.productId;
测试结果:
o.rowtime o.productId o.orderId o.units p.name p.unitPrice
10:17:00 30 5 4 Cheese 17
10:17:05 10 6 1 Beer 0.25
10:18:05 20 7 2 Wine 6
10:18:07 30 8 20 Cheese 17
11:02:00 10 9 6 Beer 0.25
11:04:00 10 10 1 Beer 0.25
11:09:30 40 11 12 Bread 100
11:24:11 10 12 4 Beer 0.25
维表JOIN语句
维表JOIN语法:
SELECT column-names
FROM table1 [AS <alias1>]
[LEFT] JOIN table2 FOR SYSTEM_TIME AS OF PROCTIME() [AS <alias2>]
ON table1.column-name1 = table2.key-name1;
说明:
1.维表支持INNER JOIN和LEFT JOIN,不支持RIGHT JOIN或FULL JOIN。
2.必须加上FOR SYSTEM_TIME AS OF PROCTIME(),表示JOIN维表当前时刻所看到的每条数据。
3.源表后面进来的数据只会关联当时维表的最新信息,即JOIN行为只发生在处理时间(Processing Time)。如果JOIN行为发生后,维表中的数据发生了变化(新增、更新或删除),则已关联的维表数据不会被同步变化。
4.ON条件中必须包含维表所有的PRIMARY KEY的等值条件(且要求与真实表定义一致)。此外,ON条件中也可以有其他等值条件。
5.如果您有一对多JOIN需求,请在维表DDL INDEX中指定关联的KEY,详情请参见INDEX语法。
6.维表和维表不能进行JOIN。
7.ON条件中维表字段不能使用CAST等类型转换函数。如果您有类型转换需求,请在源表字段进行操作。
示例:
测试数据
表 1. datahub_input1
id(bigint) name(varchar) age(bigint)
1 lilei 22
2 hanmeimei 20
3 libai 28
表 2. phoneNumber
name(varchar) phoneNumber(bigint)
dufu 13900001111
baijuyi 13900002222
libai 13900003333
lilei 13900004444
测试语句:
CREATE TABLE datahub_input1 (
id BIGINT,
name VARCHAR,
age BIGINT
) WITH (
……
……
);
create table phoneNumber(
name VARCHAR,
phoneNumber bigint,
primary key(name),
PERIOD FOR SYSTEM_TIME
)with(
……
……
);
CREATE table result_infor(
id bigint,
phoneNumber bigint,
name VARCHAR
)with(
……
……
);
INSERT INTO result_infor
SELECT
t.id,
w.phoneNumber,
t.name
FROM datahub_input1 as t
JOIN phoneNumber FOR SYSTEM_TIME AS OF PROCTIME() as w
ON t.name = w.name;
测试结果:
id(bigint) phoneNumber(bigint) name(varchar)
1 13900004444 lilei
3 13900003333 libai
JAVA应用演示Oracle 两表inner join Sink ES:
package FlinkTableApi;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.SqlDialect;
import org.apache.flink.table.api.TableResult;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class CP_USER_JOIN_AC_SUB_REGIST_INFO {
private static final Logger log = LoggerFactory.getLogger(CP_USER_JOIN_AC_SUB_REGIST_INFO.class);
public static void main(String[] args) throws Exception {
EnvironmentSettings fsSettings = EnvironmentSettings.newInstance() //构建环境
.useBlinkPlanner()
.inStreamingMode()
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); //设置流的并行
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, fsSettings); //流表环境创造
tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
log.info("This message contains {} placeholders. {}", 2, "Yippie");
// log.info("-----------------> start"); // 打印日志
//配置检查点
env.enableCheckpointing(180000); // 开启checkpoint 每180000ms 一次
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(50000);// 确认 checkpoints 之间的时间会进行 50000 ms
env.getCheckpointConfig().setCheckpointTimeout(600000); //设置checkpoint的超时时间 即一次checkpoint必须在该时间内完成 不然就丢弃
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);//设置有且仅有一次模式 目前支持 EXACTLY_ONCE/AT_LEAST_ONCE
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);// 设置并发checkpoint的数目
env.getCheckpointConfig().setCheckpointStorage("hdfs:///flink-checkpoints/oracle/CP_USER_JOIN_AC_SUB_REGIST_INFO"); // 这个是存放到hdfs目录下
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);// 开启在 job 中止后仍然保留的 externalized checkpoints
env.getCheckpointConfig().enableUnalignedCheckpoints(); // 开启实验性的 unaligned checkpoints
String sourceDDL ="CREATE TABLE Oracle_Source (\n" +
" ID DECIMAL(12,0), \n" +
" HR_CODE STRING, \n" +
" MOBILE STRING, \n" +
" IDENTITY_NO STRING, \n" +
" STATUS STRING, \n" +
" COMPANY_CODE STRING, \n" +
" AREA STRING, \n" +
" SERVICE_AREA STRING, \n" +
" USER_NAME STRING, \n" +
" USER_TYPE STRING, \n" +
" EMAIL STRING, \n" +
" CHECK_TIME STRING, \n" +
" PASSWORD STRING, \n" +
" IS_SEND_MSG STRING, \n" +
" BACK_REASON STRING, \n" +
" WECHAT STRING, \n" +
" POINTS DECIMAL(10,2), \n" +
" REGISTER_TYPE STRING, \n" +
" REGISTER_TIME STRING, \n" +
" VERSION DOUBLE, \n" +
" PRIMARY KEY (ID) NOT ENFORCED \n" +
" ) WITH (\n" +
" 'connector' = 'oracle-cdc',\n" +
" 'hostname' = 'Oracle_IP地址',\n" +
" 'port' = '1521',\n" +
" 'username' = 'name',\n" +
" 'password' = 'passwd',\n" +
" 'database-name' = 'ORCL',\n" +
" 'schema-name' = 'Oracle_schema_name',\n" + // 注意这里要大写
" 'table-name' = 'CP_USER',\n" +
" 'debezium.log.mining.continuous.mine'='true',\n" + //oracle11G可以设置此参数,19c会报错
" 'debezium.log.mining.strategy'='online_catalog',\n" + //只读oracle日志不会参数新的归档日志文件
" 'debezium.log.mining.sleep.time.increment.ms'='5000',\n" + //设置睡眠时间可以降低Oracle连接进行内存上涨速度
" 'debezium.log.mining.batch.size.max'='50000000000000',\n" + //如果此值太小会造成SCN追不上,而导致任务失败
" 'debezium.log.mining.batch.size.min'='10000',\n" +
" 'debezium.log.mining.session.max.ms'='1200000',\n" + //设置会话连接时长,如果您的重做日志不经常切换,您可以通过指定 Oracle 切换日志的频率来避免 ORA-04036 错误
" 'scan.startup.mode' = 'initial' \n" + //全量模式,先全量后自动记录增量
" )";
String sourceDDL2 = "CREATE TABLE Oracle_Source2 (\n" +
" SUB_REGIST_ID BIGINT, \n" +
" REGIST_ID BIGINT, \n" +
" SUB_REGIST_TYPE STRING, \n" +
" CUST_ID STRING, \n" +
" SCHEDULE_TASK_ID STRING, \n" +
" TASK_ID STRING, \n" +
" VERSION DOUBLE, \n" +
" INSERT_TIME_HIS STRING, \n" +
" UPDATE_TIME_HIS STRING, \n" +
" STAFF_ID STRING, \n" +
" SUBMIT_COMCODE STRING, \n" +
" SUBMIT_USER_ID STRING, \n" +
" TASK_NO STRING, \n" +
" REGIST_NO STRING, \n" +
" TASK_SOURCE STRING, \n" +
" SOURCE_FLAG STRING, \n" +
" APP_STATUS STRING, \n" +
" REMARK STRING, \n" +
" BACK_REASON STRING, \n" +
" QUALITY DOUBLE, \n" +
" EFFICIENCY DOUBLE, \n" +
" SERVICE DOUBLE, \n" +
" APPRAISE STRING, \n" +
" READ_FLAG STRING, \n" +
" CONTENT STRING, \n" +
" STATUS STRING, \n" +
" PRIMARY KEY (SUB_REGIST_ID) NOT ENFORCED \n" +
" ) WITH (\n" +
" 'connector' = 'oracle-cdc',\n" +
" 'hostname' = 'Oracle_IP地址',\n" +
" 'port' = '1521',\n" +
" 'username' = 'name',\n" +
" 'password' = 'passwd',\n" +
" 'database-name' = 'ORCL',\n" +
" 'schema-name' = 'Oracle_schema-name',\n" + // 注意这里要大写
" 'table-name' = 'AC_SUB_REGIST_INFO',\n" +
" 'debezium.log.mining.continuous.mine'='true',\n" + //oracle11G可以设置此参数,19c会报错
" 'debezium.log.mining.strategy'='online_catalog',\n" + //只读oracle日志不会参数新的归档日志文件
" 'debezium.log.mining.sleep.time.increment.ms'='5000',\n" + //设置睡眠时间可以降低Oracle连接进行内存上涨速度
" 'debezium.log.mining.batch.size.max'='50000000000000',\n" + //如果此值太小会造成SCN追不上,而导致任务失败
" 'debezium.log.mining.batch.size.min'='10000',\n" +
" 'debezium.log.mining.session.max.ms'='1200000',\n" + //设置会话连接时长,如果您的重做日志不经常切换,您可以通过指定 Oracle 切换日志的频率来避免 ORA-04036 错误
" 'scan.startup.mode' = 'initial' \n" + //全量模式,先全量后自动记录增量
" )";
// 创建一张用于输出的表
String sinkDDL = "CREATE TABLE SinkTable (\n" +
" ID DECIMAL(12,0),\n" +
" HR_CODE STRING,\n" +
" MOBILE STRING,\n" +
" IDENTITY_NO STRING,\n" +
" STATUS STRING,\n" +
" COMPANY_CODE STRING,\n" +
" AREA STRING,\n" +
" SERVICE_AREA STRING,\n" +
" USER_NAME STRING,\n" +
" USER_TYPE STRING,\n" +
" EMAIL STRING,\n" +
" CHECK_TIME BIGINT,\n" +
" PASSWORD STRING,\n" +
" IS_SEND_MSG STRING,\n" +
" BACK_REASON STRING,\n" +
" WECHAT STRING,\n" +
" POINTS DECIMAL(10,2),\n" +
" REGISTER_TYPE STRING,\n" +
" REGISTER_TIME BIGINT,\n" +
" VERSION STRING,\n" +
" SUB_REGIST_ID DECIMAL(12,0),\n" +
" REGIST_ID BIGINT,\n" +
" SUB_REGIST_TYPE STRING,\n" +
" CUST_ID STRING,\n" +
" SCHEDULE_TASK_ID STRING,\n" +
" TASK_ID STRING,\n" +
" VERSION0 STRING,\n" +
" INSERT_TIME_HIS0 BIGINT,\n" +
" UPDATE_TIME_HIS0 BIGINT,\n" +
" STAFF_ID STRING,\n" +
" SUBMIT_COMCODE STRING,\n" +
" SUBMIT_USER_ID STRING,\n" +
" TASK_NO STRING,\n" +
" REGIST_NO STRING,\n" +
" TASK_SOURCE STRING,\n" +
" SOURCE_FLAG STRING,\n" +
" APP_STATUS STRING,\n" +
" REMARK STRING,\n" +
" BACK_REASON0 STRING,\n" +
" QUALITY STRING,\n" +
" EFFICIENCY STRING,\n" +
" SERVICE STRING,\n" +
" APPRAISE STRING,\n" +
" READ_FLAG STRING,\n" +
" CONTENT STRING,\n" +
" STATUS0 STRING,\n" +
" PRIMARY KEY (SUB_REGIST_ID) NOT ENFORCED \n" +
") WITH (\n" +
" 'connector' = 'elasticsearch-7',\n" +
" 'hosts' = 'http://ES_IP地址:9200',\n" +
" 'format' = 'json',\n" +
" 'index' = 'cp_user_join_ac_sub_regist_info_ES_inner',\n" +
" 'username' = 'userxxx',\n" +
" 'password' = 'pwdXXXX',\n" +
" 'failure-handler' = 'ignore',\n" +
" 'sink.flush-on-checkpoint' = 'true' ,\n"+
" 'sink.bulk-flush.max-actions' = '20000' ,\n"+
" 'sink.bulk-flush.max-size' = '2mb' ,\n"+
" 'sink.bulk-flush.interval' = '1000ms' ,\n"+
" 'sink.bulk-flush.backoff.strategy' = 'CONSTANT',\n"+
" 'sink.bulk-flush.backoff.max-retries' = '3',\n"+
" 'connection.max-retry-timeout' = '3153600000000',\n"+ //设置es连接时间,太短的话会自动断连
" 'sink.bulk-flush.backoff.delay' = '100ms'\n"+
" )";
String transformSQL =
" INSERT INTO SinkTable SELECT\n" +
"u.ID,\n" +
"u.HR_CODE,\n" +
"u.MOBILE,\n" +
"u.IDENTITY_NO,\n" +
"u.STATUS,\n" +
"u.COMPANY_CODE,\n" +
"u.AREA,\n" +
"u.SERVICE_AREA,\n" +
"u.USER_NAME,\n" +
"u.USER_TYPE,\n" +
"u.EMAIL,\n" +
"(CAST(u.CHECK_TIME AS BIGINT) - 8 * 60 * 60 * 1000 ) as CHECK_TIME,\n" +
"u.PASSWORD,\n" +
"u.IS_SEND_MSG,\n" +
"u.BACK_REASON,\n" +
"u.WECHAT,\n" +
"u.POINTS,\n" +
"u.REGISTER_TYPE,\n" +
"(CAST(u.REGISTER_TIME AS BIGINT) - 8 * 60 * 60 * 1000 ) as REGISTER_TIME,\n" +
"CAST(u.VERSION AS STRING) as VERSION,\n" +
"a.SUB_REGIST_ID,\n" +
"a.REGIST_ID,\n" +
"a.SUB_REGIST_TYPE,\n" +
"a.CUST_ID,\n" +
"a.SCHEDULE_TASK_ID,\n" +
"a.TASK_ID,\n" +
"CAST(a.VERSION AS STRING) as VERSION0,\n" +
"(CAST(a.INSERT_TIME_HIS AS BIGINT) - 8 * 60 * 60 * 1000 ) as INSERT_TIME_HIS0,\n" +
"(CAST(a.UPDATE_TIME_HIS AS BIGINT) - 8 * 60 * 60 * 1000 ) as UPDATE_TIME_HIS0,\n" +
"a.STAFF_ID,\n" +
"a.SUBMIT_COMCODE,\n" +
"a.SUBMIT_USER_ID,\n" +
"a.TASK_NO,\n" +
"a.REGIST_NO,\n" +
"a.TASK_SOURCE,\n" +
"a.SOURCE_FLAG,\n" +
"a.APP_STATUS,\n" +
"a.REMARK,\n" +
"a.BACK_REASON,\n" +
"CAST(a.QUALITY AS STRING) as QUALITY,\n" +
"CAST(a.EFFICIENCY AS STRING) as EFFICIENCY,\n" +
"CAST(a.SERVICE AS STRING) as SERVICE,\n" +
"a.APPRAISE,\n" +
"a.READ_FLAG,\n" +
"a.CONTENT,\n" +
"a.STATUS\n" +
"FROM Oracle_Source u\n" +
"inner join Oracle_Source2 a on u.USER_CODE_YH = a.USER_CODE" ;
//执行source表ddl
tableEnv.executeSql(sourceDDL);
tableEnv.executeSql(sourceDDL2);
//执行sink表ddl
tableEnv.executeSql(sinkDDL);
//执行逻辑sql语句
TableResult tableResult = tableEnv.executeSql(transformSQL);
tableResult.print();
env.execute();
}
}