CDH服务器环境变量/etc/profile:
export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera/
export JRE_HOME=$JAVA_HOME/jre
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export HBASE_CONF_DIR=/etc/hbase/conf
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JRE_HOME/lib:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:${HIVE_HOME}/bin:$PATH
export PATH HADOOP_HOME HADOOP_CLASSPATH JAVA_HOME HIVE_HOME
Flink安装的服务器为CDH服务器中任意一台
一、编译Hudi1.10.1
1、自己将Hudi的项目down下来,切到1.10.1的分支
2、其中我修改了两个pom文件:hudi/pom.xml修改如下
注释以下两行
<!-- <module>hudi-integ-test</module>
<module>packaging/hudi-integ-test-bundle</module>-->
修改Flink版本:
<flink.version>1.13.3</flink.version>
修改:hudi\packaging\hudi-flink-bundle\pom.xml如下:
注释以下4行:
<!-- <include>flink-sql-connector-hive-2.3.6_${scala.binary.version}</include>-->
<!-- <relocation>
<pattern>org.apache.hadoop.hive.metastore.</pattern>
<shadedPattern>${flink.bundle.shade.prefix}org.apache.hadoop.hive.metastore.</shadedPattern>
</relocation>-->
<!-- <relocation>
<pattern>org.apache.hadoop.hive.conf.</pattern>
<shadedPattern>${flink.bundle.shade.prefix}org.apache.hadoop.hive.conf.</shadedPattern>
</relocation>-->
<!-- <relocation>
<pattern>org.apache.hadoop.hive.ql.metadata.</pattern>
<shadedPattern>${flink.bundle.shade.prefix}org.apache.hadoop.hive.ql.metadata.</shadedPattern>
</relocation>-->
修改一行:将下面的hive.version修改成正确的cdh中的版本
<profile>
<id>flink-bundle-shade-hive2</id>
<properties>
<hive.version>2.1.1-cdh6.3.2</hive.version>
<flink.bundle.hive.scope>compile</flink.bundle.hive.scope>
</properties>
<dependencies>
<dependency>
<groupId>${hive.groupid}</groupId>
<artifactId>hive-service-rpc</artifactId>
<version>${hive.version}</version>
<scope>${flink.bundle.hive.scope}</scope>
</dependency>
</dependencies>
</profile>
在以下路径:\hudi\packaging\hudi-flink-bundle 开始执行mvn编译:
\hudi\packaging\hudi-flink-bundle>mvn clean package -DskipITs -Dmaven.test.skip=true -Dhadoop.version=3.0.0 -Pflink-bundle-shade-hive2
编译完成的包应该叫这个:hudi-flink-bundle_2.11-0.10.1.jar 这个。将这个包上传到${FLINK_HOME}/lib目录下
这是我lib目录下的所有jar,可以参考一下:
[root@lo-t-bd-nn lib]# ll
total 245808
-rw-r--r-- 1 root root 92313 Oct 13 2021 flink-csv-1.13.3.jar
-rw-r--r-- 1 root root 115418686 Oct 13 2021 flink-dist_2.11-1.13.3.jar
-rw-r--r-- 1 root root 148127 Oct 13 2021 flink-json-1.13.3.jar
-rwxrwxrwx 1 root root 7709740 Jun 8 2021 flink-shaded-zookeeper-3.4.14.jar
-rw-rw-r-- 1 stack stack 3674114 Sep 27 09:46 flink-sql-connector-kafka_2.11-1.13.3.jar
-rw-r--r-- 1 root root 36453353 Oct 13 2021 flink-table_2.11-1.13.3.jar
-rw-r--r-- 1 root root 41061738 Oct 13 2021 flink-table-blink_2.11-1.13.3.jar
-rw-r--r-- 1 root root 9927052 Sep 26 11:16 hadoop-hdfs-3.0.0-cdh6.3.2.jar
-rw-r--r-- 1 root root 770504 Sep 26 11:12 hadoop-mapreduce-client-common-3.0.0-cdh6.3.2.jar
-rw-r--r-- 1 root root 1644597 Sep 26 11:12 hadoop-mapreduce-client-core-3.0.0-cdh6.3.2.jar
-rw-r--r-- 1 root root 51302 Sep 26 11:12 hadoop-mapreduce-client-jobclient-3.0.0-cdh6.3.2.jar
-rw-rw-r-- 1 stack stack 32441255 Sep 26 11:54 hive-exec-2.1.1.jar
-rw-rw-r-- 1 stack stack 234121 Sep 26 15:30 libthrift-0.9.3-1.jar
-rwxrwxrwx 1 root root 67114 Mar 31 2021 log4j-1.2-api-2.12.1.jar
-rwxrwxrwx 1 root root 276771 Mar 31 2021 log4j-api-2.12.1.jar
-rwxrwxrwx 1 root root 1674433 Mar 31 2021 log4j-core-2.12.1.jar
-rwxrwxrwx 1 root root 23518 Mar 31 2021 log4j-slf4j-impl-2.12.1.jar
二:执行Flink-SQL(都是使用root用户执行)
如果之后报缺失haoop的包,可能加上这一行:
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
1、启动Flink:${FLINK_HOME}/bin/start-cluster.sh
2、然后在启动SQL:./sql-client.sh embedded -j ../hudi-flink-bundle_2.11-0.10.1.jar shell
hudi-flink-bundle_2.11-0.10.1.jar 这个如果放在了lib目录下,就不用加:-j ../hudi-flink-bundle_2.11-0.10.1.jar ,第1步启动的时候自动会加载进去的。
3、执行sql:
CREATE TABLE paat_hudi_flink_test(
id bigint ,
name string,
birthday TIMESTAMP(3),
ts TIMESTAMP(3),
`partition` VARCHAR(20),
primary key(id) not enforced --必須指定uuid 主鍵
)
PARTITIONED BY (`partition`)
with(
'connector'='hudi',
'path' = 'hdfs://paat-dev/user/hive/hudi/warehouse/paat_ods_hudi1.db/' --路径,自己改
, 'hoodie.datasource.write.recordkey.field' = 'id'
, 'write.precombine.field' = 'ts'
, 'write.tasks' = '1'
, 'compaction.tasks' = '1'
, 'write.rate.limit' = '2000'
, 'table.type' = 'MERGE_ON_READ'
, 'compaction.async.enable' = 'true'
, 'compaction.trigger.strategy' = 'num_commits'
, 'compaction.max_memory' = '1024'
, 'changelog.enable' = 'true'
, 'read.streaming.enable' = 'true'
, 'read.streaming.check-interval' = '4'
, 'hive_sync.enable' = 'true'
, 'hive_sync.mode'= 'hms'
, 'hive_sync.metastore.uris' = 'thrift://****:9083'
, 'hive_sync.jdbc_url' = 'jdbc:hive2://*****:10000'
, 'hive_sync.table' = 'paat_hudi_flink_test'
, 'hive_sync.db' = 'paat_ods_hudi'
, 'hive_sync.username' = 'root'
, 'hive_sync.password' = '*****'
, 'hive_sync.support_timestamp' = 'true'
);
1、上面 'path' = 'hdfs://paat-dev/user/hive/hudi/warehouse/paat_ods_hudi1.db/' 自己百度,如果不是使用的高可用的,直接把paat-dev改成NameNode地址:端口,反正就是给个HDFS的存储路径,跟HIVE里面的show create table 里面的路径同一个意思。
2、'hive_sync.mode'= 'hms' 还有另一个模式,jdbc
3、其他参数自己百度一下就知道了,
'hive_sync.username' = 'root'
, 'hive_sync.password' = '*****' 这些是Hive元数据的用户名和密码,一般Hive的元数据存在mysql,就是他的用户密码
这时候hdfs://paat-dev/user/hive/hudi/warehouse/paat_ods_hudi1.db/ 这个路径下还没出来,你得插一条数据:insert into paat_hudi_flink_test select 184,'test12',TIMESTAMP '1970-01-01 00:00:01',TIMESTAMP '1974-01-01 00:00:01','part2';
同步会在Hive中创建hudi的外部表。
3、创建Kafka表
把flink-sql-connector-kafka.jar放到Flink lib下;
CREATE TABLE t_source (
id bigint ,
name string,
birthday TIMESTAMP(3),
ts TIMESTAMP(3),
`partition` string
) WITH (
'connector' = 'kafka', -- 使用 kafka connector
'topic' = 'paat_sync_flink_hudi_test', -- kafka topic名称exit;
'scan.startup.mode' = 'earliest-offset', -- 从起始 offset 开始读取
'properties.bootstrap.servers' = '****:9092,****:9092,****:9092', -- kafka broker 地址
'properties.group.id' = 'paat_sync_flink_hudi',
'value.format' = 'json',
'value.json.fail-on-missing-field' = 'true',
'value.fields-include' = 'ALL'
);
insert into paat_hudi_flink_test select id,name,birthday,ts,`partition` from t_source;
写入Kafka的消息体:
{"id":0,"name":"我是测试210","birthday":"2022-09-27 10:10:44","ts":"2022-09-27 10:10:44","partition":"part3"}
好了,应该没啥问题了