问题点,yarn上任务的ApplicationMaster点击没有反应
首先在yarn-site.xml添加如下配置
<property>
<name>yarn.web-proxy.address</name>
<value>master01:8089</value>
</property>
cd /opt/installs/hadoop3.1.4/
./sbin/yarn-daemon.sh start proxyserver
hive的后台挂起命令
nohup /opt/installs/hive3.1.2/bin/hive --service metastore >> /opt/installs/hive3.1.2/logs/metastore.log 2>&1 &
nohup /opt/installs/hive3.1.2/bin/hive --service hiveserver2 >> /opt/installs/hive3.1.2/logs/hiveserver2.log 2>&1 &
nohup hiveserver2 > /tmp/hivesql2.log 2>&1 &
flink后端运行一个yarn-session
cd /opt/installs/flink-1.17.0
./bin/yarn-session.sh -d
进入flinksql进行paimon的演示
./bin/sql-client.sh -s yarn-session
结果显示模式的转换为图表类型
SET 'sql-client.execution.result-mode' = 'tableau';
文件系统作为Catalog的存储位置
create catalog fs_catalog with(
'type'='paimon',
'warehouse'='hdfs://192.168.130.200:8020/paimon/fs'
);
使用hive的catalog作为元数据的管理,首先要启动Hive的metaStore
自己搭建的hive使用的metastore一般都是内置的,生产中一般换成外置的如下
cd /opt/installs/hive/conf/
vim hive-site.xml
<!--外置的Metastore-->
<property>
<name>hive.metastore.uris</name>
<value>thrift://pa:9083</value>
</property>
后台启动hive的metastore
nohup hive --service metastore &
查看是否启动
[root@pa flink-1.17.0]# netstat -anp| grep 9083
tcp6 0 0 :::9083 :::* LISTEN 16327/java
创建hive的catalog
create catalog hive_catalog with(
'type'='paimon'
,'metastore'='hive'
,'uri'='thrift://pa:9083'
,'hive-conf-dir'='/opt/installs/hive/conf'
,'warehouse'='hdfs://pa:8020/paimon/hive'
);
use catalog hive_catalog;
上述的所有文件,在sql客户端关闭之后,在次登录就会查询不到,但是真实的数据依然存在与fs,此时需要创建一个初始化文件
cd /opt/installs/flink-1.17.0/conf
vim sql-client-init.sql
将创建catalog的语句放在此文件下,每次启动都会先加载catalog
create catalog fs_catalog with(
'type'='paimon',
'warehouse'='hdfs://192.168.130.200:8020/paimon/fs'
);
create catalog hive_catalog with(
'type'='paimon'
,'metastore'='hive'
,'uri'='thrift://pa:9083'
,'hive-conf-dir'='/opt/installs/hive/conf'
,'warehouse'='hdfs://pa:8020/paimon/hive'
);
use catalog fs_catalog;
SET 'sql-client.execution.result-mode' = 'tableau';
再次启动sql客户端的命令为
./bin/sql-client.sh -s yarn-session -i conf/sql-client-init.sql
paimon的DDL
Flink SQL> show current catalog;
+----------------------+
| current catalog name |
+----------------------+
| fs_catalog |
+----------------------+
表分为管理表【类似hive的内部表】,外部表吗,临时表
cts,创建的表只会复制表的schema和value,但是类似于主键和分区等信息不会被复制,需要使用with属性自己声明
Flink SQL> desc test;
+----------+--------+-------+----------------------+--------+-----------+
| name | type | null | key | extras | watermark |
+----------+--------+-------+----------------------+--------+-----------+
| user_id | BIGINT | FALSE | PRI(dt, hh, user_id) | | |
| item_id | BIGINT | TRUE | | | |
| behavior | STRING | TRUE | | | |
| dt | STRING | FALSE | PRI(dt, hh, user_id) | | |
| hh | STRING | FALSE | PRI(dt, hh, user_id) | | |
+----------+--------+-------+----------------------+--------+-----------+
create table test_n1 with ('primary-key'='user_id,dt','partition'='dt','file.format'='parquet') as select * from test;
Flink SQL> desc test_n1;
+----------+--------+-------+------------------+--------+-----------+
| name | type | null | key | extras | watermark |
+----------+--------+-------+------------------+--------+-----------+
| user_id | BIGINT | FALSE | PRI(user_id, dt) | | |
| item_id | BIGINT | TRUE | | | |
| behavior | STRING | TRUE | | | |
| dt | STRING | FALSE | PRI(user_id, dt) | | |
| hh | STRING | FALSE | | | |
+----------+--------+-------+------------------+--------+-----------+
上述的CTS会复制数据和基本的表结构,但是主键和其他信息不会复制,使用LIKE会复制表的所有结构信息但是不会复制数据。
Flink SQL> create table test_like1 LIKE test;
[INFO] Execute statement succeed.
Flink SQL> desc test_like1;
+------+--------+-------+-----------------+--------+-----------+
| name | type | null | key | extras | watermark |
+------+--------+-------+-----------------+--------+-----------+
| id | BIGINT | FALSE | PRI(dt, hh, id) | | |
| name | STRING | TRUE | | | |
| age | BIGINT | TRUE | | | |
| dt | STRING | FALSE | PRI(dt, hh, id) | | |
| hh | STRING | FALSE | PRI(dt, hh, id) | | |
+------+--------+-------+-----------------+--------+-----------+
5 rows in set
PAIMON创建外部表【与HIVE一样】
CREATE TABLE ex (
user_id BIGINT,
item_id BIGINT,
behavior STRING,
dt STRING,
hh STRING,
PRIMARY KEY (dt, hh, user_id) NOT ENFORCED
) WITH (
'connector' = 'paimon',
'path' = 'hdfs://pa:8020/paimon/external/ex',
'auto-create' = 'true' --如果文件不存在就直接创建文件路径
);
insert into ex VALUES(1001,1001,'搭建','2023/03/18','21:56:23');
创建临时表
CREATE TEMPORARY TABLE temp (
k INT,
v STRING
) WITH (
'connector' = 'filesystem',
'path' = 'hdfs://hadoop102:8020/temp.csv',
'format' = 'csv'
);
insert overwrite只适合批
RESET 'execution.checkpointing.interval'; --重置checkpoint
SET 'execution.runtime-mode' = 'batch'; --设置运行模式为批
插入没有分区的 表
select * from test;
+---------+---------+----------+------------+----------+
| user_id | item_id | behavior | dt | hh |
+---------+---------+----------+------------+----------+
| 1001 | 1001 | 喝酒 | 2023-12-21 | 12:30:21 |
| 1001 | 1001 | 喝酒 | 2023-12-22 | 12:30:21 |
| 1004 | 1004 | 喝酒 | 2024-12-22 | 12:30:21 |
+---------+---------+----------+------------+----------+
insert OVERWRITE test values(2001,2001,'搭建','2023-12-31','12:32:45');
+---------+---------+----------+------------+----------+
| user_id | item_id | behavior | dt | hh |
+---------+---------+----------+------------+----------+
| 2001 | 2001 | 搭建 | 2023-12-31 | 12:32:45 |
+---------+---------+----------+------------+----------+
插入有分区的表
select * from test_p;
+---------+---------+----------+------------+----------+
| user_id | item_id | behavior | dt | hh |
+---------+---------+----------+------------+----------+
| 1004 | 1004 | 喝酒 | 2024-12-22 | 12:30:21 |
| 1001 | 1001 | 喝酒 | 2023-12-22 | 12:30:21 |
| 1001 | 1001 | 喝酒 | 2023-12-21 | 12:30:21 |
+---------+---------+----------+------------+----------+
可以不用写分区的字段,会自动识别
insert OVERWRITE test values(2001,2001,'搭建','2023-12-22','12:32:45');
paimon的merge into 方法
CREATE TABLE ws1 (
id INT,
ts BIGINT,
vc INT,
PRIMARY KEY (id) NOT ENFORCED
);
INSERT INTO ws1 VALUES(1,1,1),(2,2,2),(3,3,3);
CREATE TABLE ws_t (
id INT,
ts BIGINT,
vc INT,
PRIMARY KEY (id) NOT ENFORCED
);
INSERT INTO ws_t VALUES(2,2,2),(3,3,3),(4,4,4),(5,5,5);
案例一: ws_t与ws1匹配id,将ws_t中ts>2的vc改为10,ts<=2的删除
bin/flink run \
/opt/jar/paimon-flink-action-0.5-20230703.002437-53.jar \
merge-into \
--warehouse hdfs://pa:8020/paimon/hive \
--database test \
--table ws_t \
--source-table test.ws1 \
--on "ws_t.id = ws1.id" \
--merge-actions matched-upsert,matched-delete \
--matched-upsert-condition "ws_t.ts > 2" \
--matched-upsert-set "vc = 10" \
--matched-delete-condition "ws_t.ts<=2"
案例二: ws_t与ws1匹配id,匹配上的将ws_t中vc加10,ws1中没匹配上的插入ws_t中
bin/flink run \
/opt/jar/paimon-flink-action-0.5-20230703.002437-53.jar \
merge-into \
--warehouse hdfs://pa:8020/paimon/hive \
--database test \
--table ws_t \
--source-table test.ws1 \
--on "ws_t.id = ws1.id" \
--merge-actions matched-upsert,not-matched-insert \
--matched-upsert-set "vc = ws_t.vc + 10" \
--not-matched-insert-values "*"
案例三: ws_t与ws1匹配id,ws_t中没匹配上的,ts大于4则vc加20,ts=4则删除
bin/flink run \
/opt/jar/paimon-flink-action-0.5-20230703.002437-53.jar \
merge-into \
--warehouse hdfs://pa:8020/paimon/hive \
--database test \
--table ws_t \
--source-table test.ws1 \
--on "ws_t.id = ws1.id" \
--merge-actions not-matched-by-source-upsert,not-matched-by-source-delete \
--not-matched-by-source-upsert-condition "ws_t.id > 4" \
--not-matched-by-source-upsert-set "vc = ws_t.vc + 20" \
--not-matched-by-source-delete-condition "ws_t.id = 4"
案例四: 使用--source-sql创建新catalog下的源表,匹配ws_t的id,没匹配上的插入ws_t
bin/flink run \
/opt/jar/paimon-flink-action-0.5-20230703.002437-53.jar \
merge-into \
--warehouse hdfs://pa:8020/paimon/hive \
--database test \
--table ws_t \
--source-sql "CREATE CATALOG fs2 WITH ('type' = 'paimon','warehouse' = 'hdfs://pa:8020/paimon/fs2')" \
--source-sql "CREATE DATABASE IF NOT EXISTS fs2.test" \
--source-sql "CREATE TEMPORARY VIEW fs2.test.ws2 AS SELECT id+10 as id,ts,vc FROM test.ws1" \
--source-table fs2.test.ws2 \
--on "ws_t.id = ws2. id" \
--merge-actions not-matched-insert\
--not-matched-insert-values "*"
时间旅行【批查询】

选择对应的快照,点击进去获取时间戳

消费指定版本的数据
根据id读取快照数据
select * from ws_t /*+ OPTIONS('scan.snapshot-id'='1') */;
根据时间戳读取快照数据
select * from ws_t /* + OPTIONS('scan.timestamp-millis'='1711379413843')*/;
查询系统表:
select * from ws_t$snapshots;
查询系统的标签表
select * from ws_t$tags;
增量查询【未掌握】
读取 (开始快照,结束快照] 之间的增量更改。例如读取第三个快照到第五个快照之间变更的数据
select * from ws_t /* + OPTIONS('incremental-between'='1,8' )*/; --查询到的数据是变更的数据
流式查询
读取最新的快照数据
select * from sxc /* +OPTIONS('scan.mode'='latest')*/;
从最新开始读取,设置扫描参数
select * from ws_t /* +OPTIONS('scan.mode'='latest')*/;
流式查询也有时间旅行,操作指令与批的时间旅行一样
读取指定快照的全量数据并继续读取变化
select * from ws_t /*+ OPTIONS('scan.mode'='from-snapshot-full','scan.snapshot-id'='3') */;
--返回的数据包括第三次的全量数据和后续的数据变化
Consumer_id[流] 流式查询的消费id功能
指定consumer_id,类似与kafka的消费者,在此消费的时候通过consumer_id会继续上次的数据消费
演示案例:
select * from ws_t /*+ OPTIONS('consumer-id'='aa') */;
+----+-------------+----------------------+-------------+
| op | id | ts | vc |
+----+-------------+----------------------+-------------+
| +I | 1 | 1 | 1 |
| +I | 2 | 2 | 2 |
| +I | 3 | 3 | 20 |
| +I | 5 | 5 | 25 |
| +I | 11 | 1 | 1 |
| +I | 12 | 2 | 2 |
| +I | 13 | 3 | 3 |
insert into ws_t values(9,9,9);
+----+-------------+----------------------+-------------+
| op | id | ts | vc |
+----+-------------+----------------------+-------------+
| +I | 9 | 9 | 9 |
查看系统表
-- 查看快照信息
select * from ws_t$snapshots;
--查看schema的信息
select * from ws_t$schemas;
--查看表的参数
select * from ws_t$schemas; --不显示都代表为默认参数
--查看审计日志(数据的变化过程,批)
select * from ws_t$audit_log;
--查询最新的快照对应的文件
select * from ws_t$files;
--查询指定的快照对应的文件
select * from ws_t$files /* + OPTIONS('scan.snapshot-id'='1') */ ;
LookUP Join 维表JOIN
CREATE TABLE customers (
id INT PRIMARY KEY NOT ENFORCED,
name STRING,
country STRING,
zip STRING
);
INSERT INTO customers VALUES(1,'zs','ch','123'),(2,'ls','ch','456'), (3,'ww','ch','789');
CREATE TEMPORARY TABLE Orders (
order_id INT,
total INT,
customer_id INT,
proc_time AS PROCTIME()
) WITH (
'connector' = 'datagen',
'rows-per-second'='1',
'fields.order_id.kind'='sequence',
'fields.order_id.start'='1',
'fields.order_id.end'='1000000',
'fields.total.kind'='random',
'fields.total.min'='1',
'fields.total.max'='1000',
'fields.customer_id.kind'='random',
'fields.customer_id.min'='1',
'fields.customer_id.max'='3'
);
lookup join
select * from Orders as o
JOIN customers
for SYSTEM_TIME AS OF o.proc_time as c on o.customer_id = c.id;
CDC
MysqlCDC
将flinkcdc的jar包拷贝到flink的lib包下:
paimon官网操作API
https://paimon.apache.org/docs/master/how-to/altering-tables/
hive配置metastore
无需什么特别的操作,只要在hive安装成功后,在hive-site.xml 上添加如下两个配置:
<!-- 配置metaStore -->
<property>
<name>hive.metastore.local</name><!--是否开启本地模式 false为开启远程-->
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name><!--hive元数据客户端地址及端口 多个用逗号隔开,-->
<value>thrift://master:9083</value>
</property>
1141

被折叠的 条评论
为什么被折叠?



