一、实验环境
环境 & 软件 | 版本 |
---|---|
LinuxOS | CentOS 7 |
Flink | 1.13.6 |
kafka | 0.11.0 |
Zookeeper | 3.4.5 |
二、实验原理
通过Flink1.13中 FlinkSQL 客户端构建2张虚拟表:
- son
- father
来映射到kafka中与之对应的topic上:
- son
- father
通过一个简单的join来模拟一个虚拟视图实时展示数据join之后的效果,从而简单模拟实时报表或实时数仓的join操作。
FlinkSQL 虚拟表结构如下:
Flink SQL> desc son;
+-----------+--------+------+-----+--------+-----------+
| name | type | null | key | extras | watermark |
+-----------+--------+------+-----+--------+-----------+
| id | INT | true | | | |
| name | STRING | true | | | |
| age | INT | true | | | |
| address | STRING | true | | | |
| father_id | INT | true | | | |
+-----------+--------+------+-----+--------+-----------+
5 rows in set
Flink SQL> desc father;
+------+--------+------+-----+--------+-----------+
| name | type | null | key | extras | watermark |
+------+--------+------+-----+--------+-----------+
| id | INT | true | | | |
| name | STRING | true | | | |
+------+--------+------+-----+--------+-----------+
2 rows in set
三、建表语句
3.1 构建子表
1)、Flink SQL虚拟表建表语句
DROP TABLE IF EXISTS `son`;
CREATE TABLE `son` (
`id` INT,
`name` STRING,
`age` INT ,
`address` STRING,
`father_id` INT
) WITH (
'connector' = 'kafka',
'topic' = 'son',
'properties.bootstrap.servers' = 'dn5:9092',
'properties.group.id' = 'testGroup',
'format' = 'csv' ,
'scan.startup.mode' = 'earliest-offset'
);
2)、csv格式的数据准备
1,张三,21,shanghai,1001
2,李四,22,beijing,1002
3,王五,23,guangzhou,1003
4,赵六,24,shenzhen,1004
3.2 构建父表
1)、Flink SQL虚拟表建表语句
DROP TABLE IF EXISTS `father`;
CREATE TABLE `father` (
`id` INT,
`name` STRING
) WITH (
'connector' = 'kafka',
'topic' = 'father',
'properties.bootstrap.servers' = 'dn5:9092',
'properties.group.id' = 'testGroup',
'format' = 'csv' ,
'scan.startup.mode' = 'earliest-offset'
);
2)、csv格式的数据准备
1001,张杰
1003,王杰
1005,钱杰
1007,李杰
四、数据验证
4.1 数据FlinkSQL虚拟表
启动FlinkSql客户端:
[bigdata_admin@dn5 bin]$ ./sql-client.sh embedded
Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR or HADOOP_CLASSPATH was set.
Setting HBASE_CONF_DIR=/etc/hbase/conf because no HBASE_CONF_DIR was set.
No default environment specified.
Searching for '/data/flink-1.13.6/conf/sql-client-defaults.yaml'...not found.
Command history file path: /home/bigdata_admin/.flink-sql-history
▒▓██▓██▒
▓████▒▒█▓▒▓███▓▒
▓███▓░░ ▒▒▒▓██▒ ▒
░██▒ ▒▒▓▓█▓▓▒░ ▒████
██▒ ░▒▓███▒ ▒█▒█▒
░▓█ ███ ▓░▒██
▓█ ▒▒▒▒▒▓██▓░▒░▓▓█
█░ █ ▒▒░ ███▓▓█ ▒█▒▒▒
████░ ▒▓█▓ ██▒▒▒ ▓███▒
░▒█▓▓██ ▓█▒ ▓█▒▓██▓ ░█░
▓░▒▓████▒ ██ ▒█ █▓░▒█▒░▒█▒
███▓░██▓ ▓█ █ █▓ ▒▓█▓▓█▒
░██▓ ░█░ █ █▒ ▒█████▓▒ ██▓░▒
███░ ░ █░ ▓ ░█ █████▒░░ ░█░▓ ▓░
██▓█ ▒▒▓▒ ▓███████▓░ ▒█▒ ▒▓ ▓██▓
▒██▓ ▓█ █▓█ ░▒█████▓▓▒░ ██▒▒ █ ▒ ▓█▒
▓█▓ ▓█ ██▓ ░▓▓▓▓▓▓▓▒ ▒██▓ ░█▒
▓█ █ ▓███▓▒░ ░▓▓▓███▓ ░▒░ ▓█
██▓ ██▒ ░▒▓▓███▓▓▓▓▓██████▓▒ ▓███ █
▓███▒ ███ ░▓▓▒░░ ░▓████▓░ ░▒▓▒ █▓
█▓▒▒▓▓██ ░▒▒░░░▒▒▒▒▓██▓░ █▓
██ ▓░▒█ ▓▓▓▓▒░░ ▒█▓ ▒▓▓██▓ ▓▒ ▒▒▓
▓█▓ ▓▒█ █▓░ ░▒▓▓██▒ ░▓█▒ ▒▒▒░▒▒▓█████▒
██░ ▓█▒█▒ ▒▓▓▒ ▓█ █░ ░░░░ ░█▒
▓█ ▒█▓ ░ █░ ▒█ █▓
█▓ ██ █░ ▓▓ ▒█▓▓▓▒█░
█▓ ░▓██░ ▓▒ ▓█▓▒░░░▒▓█░ ▒█
██ ▓█▓░ ▒ ░▒█▒██▒ ▓▓
▓█▒ ▒█▓▒░ ▒▒ █▒█▓▒▒░░▒██
░██▒ ▒▓▓▒ ▓██▓▒█▒ ░▓▓▓▓▒█▓
░▓██▒ ▓░ ▒█▓█ ░░▒▒▒
▒▓▓▓▓▓▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░▓▓ ▓░▒█░
______ _ _ _ _____ ____ _ _____ _ _ _ BETA
| ____| (_) | | / ____|/ __ \| | / ____| (_) | |
| |__ | |_ _ __ | | __ | (___ | | | | | | | | |_ ___ _ __ | |_
| __| | | | '_ \| |/ / \___ \| | | | | | | | | |/ _ \ '_ \| __|
| | | | | | | | < ____) | |__| | |____ | |____| | | __/ | | | |_
|_| |_|_|_| |_|_|\_\ |_____/ \___\_\______| \_____|_|_|\___|_| |_|\__|
Welcome! Enter 'HELP;' to list all available commands. 'QUIT;' to exit.
Flink SQL>
Flink SQL> DROP TABLE IF EXISTS `son`;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `son` (
> `id` INT,
> `name` STRING,
> `age` INT ,
> `address` STRING,
> `father_id` INT
> ) WITH (
> 'connector' = 'kafka',
> 'topic' = 'son',
> 'properties.bootstrap.servers' = 'dn5:9092',
> 'properties.group.id' = 'testGroup',
> 'format' = 'csv' ,
> 'scan.startup.mode' = 'earliest-offset'
> );
[INFO] Execute statement succeed.
Flink SQL> DROP TABLE IF EXISTS `father`;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `father` (
> `id` INT,
> `name` STRING
> ) WITH (
> 'connector' = 'kafka',
> 'topic' = 'father',
> 'properties.bootstrap.servers' = 'dn5:9092',
> 'properties.group.id' = 'testGroup',
> 'format' = 'csv' ,
> 'scan.startup.mode' = 'earliest-offset'
> );
[INFO] Execute statement succeed.
4.2 构建FlinkSQL虚拟视图
通过该虚拟视图,监控数据实时变化:
Flink SQL>
> select
> t1.id as `子ID`,
> t1.name as `子名称`,
> t1.age as `子年龄`,
> t1.address as `子地址`,
> t2.id as `父ID`,
> t2.name as `父名称`
> from son t1
> join father t2 on t1.father_id = t2.id;
4.3 模拟kafka数据
1)子类数据构建:
[bigdata_admin@dn5 ~]$ kafka-console-producer --broker-list dn3:9092,dn4:9092,dn5:9092 --topic son
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/KAFKA-3.0.0-1.3.0.0.p0.40/lib/kafka/libs/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/KAFKA-3.0.0-1.3.0.0.p0.40/lib/kafka/libs/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/05/11 23:20:40 INFO producer.ProducerConfig: ProducerConfig values:
acks = 1
batch.size = 16384
bootstrap.servers = [dn3:9092, dn4:9092, dn5:9092]
buffer.memory = 33554432
client.id = console-producer
compression.type = none
connections.max.idle.ms = 540000
enable.idempotence = false
interceptor.classes = null
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 1000
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 1500
retries = 3
retry.backoff.ms = 100
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 102400
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = null
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
22/05/11 23:20:40 INFO utils.AppInfoParser: Kafka version : 0.11.0-kafka-3.0.0
22/05/11 23:20:40 INFO utils.AppInfoParser: Kafka commitId : unknown
>1,张三,21,shanghai,1001
>2,李四,22,beijing,1002
>3,王五,23,guangzhou,1003
>4,赵六,24,shenzhen,1004
2)父类数据构建
[bigdata_admin@dn5 ~]$ kafka-console-producer --broker-list dn3:9092,dn4:9092,dn5:9092 --topic father
>1001,张杰
>1003,王杰
>1005,钱杰
>1007,李杰
五、效果演示
5.1)2张虚拟表Inner Join效果
SQL Query Result (Table)
Refresh: 1 s Page: Last of 1 Updated: 23:27:07.649
子ID 子名称 子年龄 子地址 父ID 父名称
1 张三 21 shanghai 1001 张杰
3 王五 23 guangzhou 1003 王杰
5.2)2张虚拟表Left Join效果
子ID 子名称 子年龄 子地址 父ID 父名称
1 张三 21 shanghai 1001 张杰
3 王五 23 guangzhou 1003 王杰
2 李四 22 beijing (NULL) (NULL)
4 赵六 24 shenzhen (NULL) (NULL)
5.3)2张虚拟表Right Join效果
子ID 子名称 子年龄 子地址 父ID 父名称
(NULL) (NULL) (NULL) (NULL) 1005 钱杰
(NULL) (NULL) (NULL) (NULL) 1007 李杰
1 张三 21 shanghai 1001 张杰
3 王五 23 guangzhou 1003 王杰
5.4)聚合操作效果
select count(1) as `总量` from(
select
t1.id as `子ID`,
t1.name as `子名称`,
t1.age as `子年龄`,
t1.address as `子地址`,
t2.id as `父ID`,
t2.name as `父名称`
from son t1
left join father t2 on t1.father_id = t2.id
);
SQL Query Result (Table)
Refresh: 1 s Page: Last of 1 Updated: 23:51:55.149
总量
4