flink1.14.4+iceberg0.14+hive2.3.9湖仓一体
1. 环境
环境准备及下载路径(这里只给jar包路径),默认hadoop集群无问题,hive最好开启metastore。
硬件 | 软件 | jar 包 | 路径 |
---|---|---|---|
centos6.8 3台 | hadoop2.7.2 | iceberg-flink-runtime-1.14-0.14.0.jar | https://iceberg.apache.org/releases/ |
hadoop100 | hive 2.3.9 | iceberg-hive-runtime-0.14.0.jar | https://iceberg.apache.org/releases/ |
hadoop101 | flink 1.14.4 | flink-sql-connector-kafka_2.11-1.14.4.jar | https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-kafka_2.11/1.14.4/ |
hadoop102 | kafka 2.4.1 | flink-sql-connector-hive-2.3.6_2.11-1.14.4.jar | https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.6_2.11/1.14.4/ |
2. 模块测试
2.1 flink demo
2.1.1 修改 flink 配置环境
- 测试环境,这里只准备了单机登录,修改了$FLINK_HOME/conf/flink-conf.yaml文件
taskmanager.numberOfTaskSlots
参数,避免task不够,TASKMANAGER挂机。
taskmanager.numberOfTaskSlots: 30
- $FLINK_HOME/conf/flink-conf.yaml ,增加checkpoint参数
restart-strategy fixed-delay
restart-strategy.fixed-delay.attempts 3
restart-strategy.fixed-delay.delay 30s
execution.checkpointing.interval 1min
execution.checkpointing.externalized-checkpoint-retention RETAIN_ON_CANCELLATION
state.checkpoints.dir hdfs:///flink/checkpoints
state.backend filesystem
加入目的防止kafka写入iceberg时,写不进去,又不报错,这里多谢 https://blog.csdn.net/spark_dev/article/details/122581841 这位朋友的文档,不然还得找好久bug。
2.1.2 上传jar包
上传jar包
-rw-rw-r-- 1 liuser liuser 27213997 9月 12 10:46 iceberg-flink-runtime-1.14-0.14.0.jar
-rw-rw-r-- 1 liuser liuser 42147220 9月 14 05:11 flink-sql-connector-hive-2.3.6_2.11-1.14.4.jar
-rw-rw-r-- 1 liuser liuser 3704559 9月 14 05:11 flink-sql-connector-kafka_2.11-1.14.4.jar
2.1.3 启动flink 创建iceberg 表
- 进入sqlclient
$FLINK_HOME/bin/start-cluster.sh
$FLINK_HOME/bin/sql_client.sh
- 创建catalog
--创建hadoop catalog
CREATE CATALOG hadoop_catalog WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://hadoop101:38020/user/iceberg/hadoop_catalog',
'property-version'='1'
);
--查看catalogs
show catalogs;
--创建database
CREATE DATABASE if not exists iceberg_db;
- 创建iceberg表
CREATE TABLE hadoop_catalog.iceberg_db.t_iceberg_sample_1 (
id BIGINT COMMENT 'unique id',
data STRING
)WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://hadoop101:38020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1',
'property-version'='1'
);
删除iceberg表,注意 因为hadoop_catalog表为实体表,删除后,hdfs对应icberg_db下的表目录会被删除.
DROP TABLE hadoop_catalog.iceberg_db.t_iceberg_sample_1;
insert values 插入数据
insert into hadoop_catalog.iceberg_db.t_iceberg_sample_1(id, data) values
(10, 'ff'),(13, 'ff'),(12, 'ff');
- 查看数据
insert select 插入数据
drop table hadoop_catalog.iceberg_db.sample_iceberg_partition;
CREATE TABLE hadoop_catalog.iceberg_db.sample_iceberg_partition (
id BIGINT COMMENT 'unique id',
data STRING
) PARTITIONED BY (data)
WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db/sample_iceberg_partition',
'property-version'='1'
);
INSERT into hadoop_catalog.iceberg_db.sample_iceberg_partition PARTITION(data='aa') SELECT id FROM hadoop_catalog.iceberg_db.t_iceberg_sample_1 WHERE data = 'aa';
- 查看数据 原表删除过,没有更新,只有开始的插入记录的数据
2.2 hive demo
2.2.1 修改hive-site.xml配置文件
添加一下配置,开启iceberg支持
<property>
<name>iceberg.engine.hive.enabled</name>
<value>true</value>
<description>Hive是否开启Iceberg的支持</description>
</property>
2.2.2 上传jar包
-rw-r--r-- 1 liuser liuser 872303 9月 13 05:35 mysql-connector-java-5.1.27-bin.jar
-rw-rw-r-- 1 liuser liuser 27304277 9月 13 06:15 iceberg-hive-runtime-0.14.0.jar
2.2.3 启动hive 创建catalog
#添加jar包
add jar /opt/module/hive/lib/iceberg-hive-runtime-0.14.0.jar;
#创建 hive-catalog
set iceberg.catalog.hive_catalog.type=hive;
set iceberg.catalog.hive_catalog.uri=thrift://hadoop100:9083;
set iceberg.catalog.hive_catalog.clients=5;
set iceberg.catalog.hive_catalog.warehouse=hdfs://hadoop100:9000/user/iceberg/hive_catalog;
#创建hadoop-catalog
set iceberg.engine.hive.enabled=true;
set iceberg.catalog.hadoop_catalog.type=hadoop;
set iceberg.catalog.hadoop_catalog.warehouse=hdfs://hadoop100:9000/user/iceberg/hadoop_catalog;
#创建 database schema 并指定路径
create schema iceberg_db location 'hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db';
#创建外部表加载flinksql 创建的 iceberg表 *****需要指定hadoop_catalog路径******
create external table iceberg_db.t_iceberg_sample_1(
id bigint, data string
)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1'
tblproperties('iceberg.catalog'='hadoop_catalog');
查询hive表数据, 可以读到iceberg表数据
0: jdbc:hive2://hadoop100:10000> select * from t_iceberg_sample_1;
+------------------------+--------------------------+
| t_iceberg_sample_1.id | t_iceberg_sample_1.data |
+------------------------+--------------------------+
| 12 | ff |
| 13 | ff |
+------------------------+--------------------------+
2 rows selected (0.491 seconds)
hive 创建iceberg表 并插入数据, 默认地址为 hadoop_catalog路径,前面创建 iceberg schema时说到了
create table iceberg_db.ods_cust(
cust_id string,
cust_name string
) partitioned by (channel string)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
insert into iceberg_db.ods_cust(cust_id, cust_name, channel)
values(1001, 'z3', 'beijing'),(1002, 'l4', 'shandong');
hdfs目录文件
查询结果
0: jdbc:hive2://hadoop100:10000> select * from ods_cust;
+-------------------+---------------------+-------------------+
| ods_cust.cust_id | ods_cust.cust_name | ods_cust.channel |
+-------------------+---------------------+-------------------+
| 1002 | l4 | shandong |
| 1001 | z3 | beijing |
+-------------------+---------------------+-------------------+
2 rows selected (2.882 seconds)
0: jdbc:hive2://hadoop100:10000>
2. 实时入湖
2.1 kafka 测试
- 相关命令 测试kafka功能性
2.4.1版本
cd $KAFKA_HOME
##创建topic
bin/kafka-topics.sh --bootstrap-server hadoop100:9092 --create --replication-factor 2 --partitions 12 --topic source_kafka01
##查看topic
bin/kafka-topics.sh --bootstrap-server hadoop100:9092 --list
##模拟生成者
bin/kafka-console-producer.sh --broker-list hadoop100:9092 --topic source_kafka01
##模拟消费者
bin/kafka-console-consumer.sh --bootstrap-server hadoop100:9092 --from-beginning --topic source_kafka01
老版本 以kafka_0.11.1 举例
1、创建topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
2、查看topic
bin/kafka-topics.sh --list --zookeeper localhost:2181
3、生产
bin/kafka-console-producer.sh --broker-list 192.168.91.231:9092,192.168.91.231:9093,192.168.91.231:9094 --topic test
4、消费
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
- 测试数据
{"user_id":"a1111","order_amount":11.0,"log_ts":"2020-06-29 12:15:00"}
{"user_id":"a1111","order_amount":11.0,"log_ts":"2020-06-29 12:20:00"}
{"user_id":"a1111","order_amount":11.0,"log_ts":"2020-06-29 12:30:00"}
{"user_id":"a1111","order_amount":13.0,"log_ts":"2020-06-29 12:32:00"}
{"user_id":"a1112","order_amount":15.0,"log_ts":"2020-11-26 12:12:12"}
{"user_id":"a1113","order_amount":20.0,"log_ts":"2020-12-26 12:12:12"}
{"user_id":"a1113","order_amount":21.0,"log_ts":"2021-12-26 12:12:12"}
{"user_id":"liu123","order_amount":21.0,"log_ts":"2022-09-16 12:12:12"}
{"user_id":"xin123","order_amount":21.0,"log_ts":"2022-09-16 12:12:12"}
{"user_id":"l3","order_amount":21.0,"log_ts":"2022-09-16 12:12:12"}
{"user_id":"mingyu","order_amount":21.0,"log_ts":"2022-09-16 12:12:12"}
{"user_id":"pinyang","order_amount":21.0,"log_ts":"2022-09-16 16:12:12"}
{"user_id":"yaxin","order_amount":21.0,"log_ts":"2022-09-16 16:12:12"}
2.2 flinksql读取kafka数据测试
- 测试flinksql消费kafka数据 这里使用默认catalog即可,使用hadoop_catalog可能会读不到数据。(这里博主就遇到了坑)。
create table kafka_log_test
(
log STRING
) WITH (
'connector' = 'kafka',
'topic' = 'source_kafka01',
'properties.bootstrap.servers' = 'hadoop100:9092,hadoop101:9092,hadoop102:9092',
'properties.group.id' = 'testgroup',
'scan.startup.mode' = 'earliest-offset',
'format' = 'raw'
);
select * from kafka_log_test;
flinksql 查询结果如下图所示
2.3 创建流数据任务
注意:1. 保证flinksql可以正常读取 kafka 数据;
2. 保证checkpoint已经设置 参考 2.1.1配置。
-- SET execution.checkpointing.interval = 60000;
CREATE CATALOG hadoop_catalog WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='hdfs://hadoop100:9000/user/iceberg/hadoop_catalog',
'property-version'='1'
);
---------------------开启流查询,不然会查不到结果集的数据--------------------
SET execution.result-mode=stream;
-----------------------------------------------------------------------
show catalogs;
CREATE DATABASE if not exists iceberg_db;
drop table hadoop_catalog.iceberg_db.iceberg01;
--创建sink表
CREATE TABLE hadoop_catalog.iceberg_db.iceberg01 (
user_id STRING COMMENT 'user_id',
order_amount DOUBLE COMMENT 'order_amount',
log_ts STRING
);
--创建source表
CREATE TABLE source_kafka02 (
user_id STRING,
order_amount DOUBLE,
log_ts TIMESTAMP(3)
) WITH (
'connector'='kafka',
'topic'='source_kafka01',
'scan.startup.mode'='latest-offset',
'properties.bootstrap.servers'='hadoop100:9092,hadoop101:9092,hadoop102:9092',
'properties.group.id' = 'testGroup',
'format'='json'
);
--插入sink数据 数据 类型转换和hive差不多
insert into hadoop_catalog.iceberg_db.iceberg01 select user_id,order_amount,cast(log_ts as STRING) as log_ts from source_kafka02;
2.4 flinkweb查看任务执行状态
2.5 hdfs数据目录,查看数据变化
2.6 查看hive表,我这里一条一条添加测试的.
+--------------------+-------------------------+--------------------------+
| iceberg01.user_id | iceberg01.order_amount | iceberg01.log_ts |
+--------------------+-------------------------+--------------------------+
| z3 | 21.0 | 2022-09-16 12:12:12.000 |
| mingyu | 21.0 | 2022-09-16 12:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:20:00.000 |
| a1112 | 15.0 | 2020-11-26 12:12:12.000 |
| a1111 | 13.0 | 2020-06-29 12:32:00.000 |
| a1111 | 11.0 | 2020-06-29 12:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:15:00.000 |
| a1113 | 20.0 | 2020-12-26 12:12:12.000 |
| a1113 | 21.0 | 2021-12-26 12:12:12.000 |
| l3 | 21.0 | 2022-09-16 12:12:12.000 |
+--------------------+-------------------------+--------------------------+
10 rows selected (0.489 seconds)
0: jdbc:hive2://hadoop100:10000> select * from iceberg_db.iceberg01;
+--------------------+-------------------------+--------------------------+
| iceberg01.user_id | iceberg01.order_amount | iceberg01.log_ts |
+--------------------+-------------------------+--------------------------+
| mingyu | 21.0 | 2022-09-16 12:12:12.000 |
| mingyu | 21.0 | 2022-09-16 16:12:12.000 |
| pinyang | 21.0 | 2022-09-16 16:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:20:00.000 |
| a1112 | 15.0 | 2020-11-26 12:12:12.000 |
| a1111 | 13.0 | 2020-06-29 12:32:00.000 |
| a1111 | 11.0 | 2020-06-29 12:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:15:00.000 |
| a1113 | 20.0 | 2020-12-26 12:12:12.000 |
| a1113 | 21.0 | 2021-12-26 12:12:12.000 |
| z3 | 21.0 | 2022-09-16 12:12:12.000 |
| l3 | 21.0 | 2022-09-16 12:12:12.000 |
+--------------------+-------------------------+--------------------------+
12 rows selected (0.502 seconds)
0: jdbc:hive2://hadoop100:10000> select * from iceberg_db.iceberg01;
+--------------------+-------------------------+--------------------------+
| iceberg01.user_id | iceberg01.order_amount | iceberg01.log_ts |
+--------------------+-------------------------+--------------------------+
| mingyu | 21.0 | 2022-09-16 12:12:12.000 |
| mingyu | 21.0 | 2022-09-16 16:12:12.000 |
| pinyang | 21.0 | 2022-09-16 16:12:12.000 |
| yaxin | 21.0 | 2022-09-16 16:12:12.000 |
| z3 | 21.0 | 2022-09-16 12:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:20:00.000 |
| a1112 | 15.0 | 2020-11-26 12:12:12.000 |
| a1111 | 13.0 | 2020-06-29 12:32:00.000 |
| a1111 | 11.0 | 2020-06-29 12:12:12.000 |
| a1111 | 11.0 | 2020-06-29 12:15:00.000 |
| a1113 | 20.0 | 2020-12-26 12:12:12.000 |
| a1113 | 21.0 | 2021-12-26 12:12:12.000 |
| l3 | 21.0 | 2022-09-16 12:12:12.000 |
+--------------------+-------------------------+--------------------------+
13 rows selected (0.694 seconds)
3. 经过的坑
3.1 环境
第一套基础环境采用 hadoop3.1.3+hive3.1.2+flink1.14.4,看文档就知道,失败了,踩了好几天的坑还是没解决掉,放弃了,博主编码能力太差,改不了源码。呜呜(file:///C:\Users\ADMINI~1\AppData\Local\Temp\SGPicFaceTpBq\11208\0A80DA50.gif)]
3.1.1 hadoop3环境
1)flinkHive com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
guava-27.0-jre.jar 包与 flink-sql-connector-hive-3.1.2_2.11-1.14.4.jar 冲突 因为后者内部嵌套了guava版本,而flink内部不允许有两个版本的guava,很多博主说是hive和hadoop的guava版本不匹配,但是我看了一下我的都是guava-27.0-jre.jar,试坑第一步,下载flink1.14源码和hive3.1.2,注释 guava,重新编译生成hive-exec.jar包,放入maven,重新编译flink-sql-connector-hive-3.1.2_2.11-1.14.4.jar,启动flink,执行flinksql,报错 无法加载 .mr.inputformat,即无法读取iceberg的文件。
第二步怀疑编译注释掉guava,导致hadoop不能正常读取iceberg的文件(现在想想好蠢),然后把/opt/module/hadoop-3.1.3/share/hadoop/common/lib/guava-27.0-jre.jar 放置flinklib中,发现彻底报错,序列化编码直接不一样。
小结:重新编译能解决乱码问题,单没有解决hadoop的inputformt问题。
2)继续搜索相关文档
解决方法:
Remove all guava-*.jar from $HADOOP_HOME/share/hadoop/commom/lib and $HIVE_HOME/lib.
Put the guava-27.0-jre.jar to both $HADOOP_HOME/share/hadoop/commom/lib and $HIVE_HOME/lib.
Put the guava-27.0-jre.jar to $FLINK_HOME/lib, and rename it to a_guava-27.0.jre.jar.
翻译:
1.从$HADOOP_HOME/share/hadoop/commom/lib 和 $HIVE_HOME/lib 移除 guava-*.jar
2.把guava-27.0-jre.jar 包放入 $HADOOP_HOME/share/hadoop/commom/lib ,$HIVE_HOME/lib
3.把guava-27.0-jre.jar 包放入 $FLINK_HOME/lib, 并重命名 a_guava-27.0.jre.jar (很重要)
参考文档:https://blog.csdn.net/young_0609/article/details/122120690
小结: 同样解决了(ZLjava/lang/String;Ljava/lang/Object;)V 问题,但是没有解决mr.file.inputformat
3) 猜测hadoop环境存在问题,鉴于编码能力有限,改不了源码,直接换环境了。
3.1.2 hadoop2环境
centos6.8+hadoop2.7.2+hive1.2.1+kafka0.11+flink1.14.4
-
flink模块测试:把iceberg-flink-runtime-1.14-0.14.0.jar 丢进flink/lib目录,创建catalog,创建database,创建表,插入数据,查询数据,出奇的顺利,整理hadoop3环境一整天换了hadoop2环境没有一点问题,一时都觉得不对劲。
-
hive模块测试:add jar后 仍然无法创建iceberg类型的表,建表失败,
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnNames(Lorg/apache/hadoop/conf/Configuration;)[Ljava/lang/String;
去官网看了下,只介绍了hive2和hive3的说法,没有其他解决办法的情况下,只能升级hive了。
-
hive升级
hive包下载地址:阿里云hive下载 国内阿里云
第一步:先将hive1.2.1的版本进行备份 虚拟机也可以打个镜像
cd /opt/module
mv hive hive_back
第二步:将hive-2.3.9安装到自己定义的目录下边
tar -zxvf apache-hive-2.3.9-bin (1).tar.gz -C /opt/module/
mv apache-hive-2.3.9-bin hive
第三步:将hive-1.2.1版本里的conf下的hive-site.xml 发送到hive-2.3.9下的conf下边
cp /opt/module/back_hive/conf/hive-site.xml /opt/module/hive
第四步:将hive-1.2.1版本里的lib下边的mysql jar包发送到hive-2.3.9下边的lib目录下
cp /opt/module/back_hive/lib/mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib
第五步:上传 jar包 iceberg-hive-runtime-0.14.0.jar
第六步:进入hive2.3.9目录下的scripts/metastore/upgrade/mysql目录下
最下边可以看到这些(这里展示部分),即要执行的sql
-rw-r--r-- 1 liuser liuser 243 8月 18 2020 upgrade-1.1.0-to-1.2.0.mysql.sql
-rw-r--r-- 1 liuser liuser 703 6月 2 2021 upgrade-1.2.0-to-1.3.0.mysql.sql
-rw-r--r-- 1 liuser liuser 638 6月 2 2021 upgrade-1.2.0-to-2.0.0.mysql.sql
-rw-r--r-- 1 liuser liuser 343 6月 2 2021 upgrade-2.0.0-to-2.1.0.mysql.sql
-rw-r--r-- 1 liuser liuser 343 6月 2 2021 upgrade-2.1.0-to-2.2.0.mysql.sql
-rw-r--r-- 1 liuser liuser 277 6月 2 2021 upgrade-2.2.0-to-2.3.0.mysql.sql
第七步 进入 /opt/module/hive/scripts/metastore/upgrade/mysql目录
以次执行
source upgrade-1.2.0-to-2.0.0.mysql.sql
source upgrade-2.1.0-to-2.2.0.mysql.sql
source upgrade-2.2.0-to-2.3.0.mysql.sql
我的在执行第一个升级文件的时候,mysql有报错,可能我的hive版本太老了,缺少部分表,这里直接重新初始化了metastore表
,即删除metastore表, 然后 hive-schema-2.3.0.mysql.sql, 不建议这么弄,会丢失元数据。时间问题没去研究其他方法怎么处理,感觉这几个文件可以研究下。
-rw-r--r-- 1 liuser liuser 2845 8月 18 2020 hive-txn-schema-0.13.0.mysql.sql
-rw-r--r-- 1 liuser liuser 2845 8月 18 2020 hive-txn-schema-0.14.0.mysql.sql
-rw-r--r-- 1 liuser liuser 4139 6月 2 2021 hive-txn-schema-1.3.0.mysql.sql
-rw-r--r-- 1 liuser liuser 3770 6月 2 2021 hive-txn-schema-2.0.0.mysql.sql
-rw-r--r-- 1 liuser liuser 4165 6月 2 2021 hive-txn-schema-2.1.0.mysql.sql
-rw-r--r-- 1 liuser liuser 4165 6月 2 2021 hive-txn-schema-2.2.0.mysql.sql
-rw-r--r-- 1 liuser liuser 4224 6月 2 2021 hive-txn-schema-2.3.0.mysql.sql
- 创建表 成功!!! 总算解决
add jar /opt/module/hive/lib/iceberg-hive-runtime-0.14.0.jar;
set iceberg.engine.hive.enabled=true;
set iceberg.catalog.hadoop_catalog.type=hadoop;
set iceberg.catalog.hadoop_catalog.warehouse=hdfs://hadoop100:9000/user/iceberg/hadoop_catalog;
create schema iceberg_db location 'hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db';
create external table iceberg_db.t_iceberg_sample_1(
id bigint, data string
)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1'
tblproperties('iceberg.catalog'='hadoop_catalog');
分区表看不到分区
create table iceberg_db.employee(
id bigint,
name string
) partitioned by (birthday date, country string)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
location 'hdfs://hadoop100:9000/user/iceberg/hadoop_catalog/iceberg_db/employee'
tblproperties('iceberg.catalog'='hadoop_catalog');
end