摘 要
1.使用最新Flink1.14.3版本集成HBase,很多老的API已经过时。
2.Flink1.14.3与HBase集成有很多兼容性问题
3.Flink1.14.3与HBase集成实现代码比较稀缺,即使官方文档也有很多问题没有交代清楚,有很多坑。
1 需求
需求:Flink Table API从HBase NOSQL数据库读取表数据。
2 添加Maven依赖
FlinkTable集成HBase需引⼊如下依赖:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
<version>3.5</version>
</dependency>
#如果hbase集群版本为1.x系列,此时后缀为1.4
#如果hbase集群版本为2.x系列,此时后缀为2.2
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-hbase-1.4_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-math3</artifactId>
<groupId>org.apache.commons</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
备注:注意需要排除commons-math3 依赖冲突。
3 准备HBase数据源
#删除clicklog表
disable "clicklog"
drop "clicklog"
#创建clicklog表
hbase(main):003:0> create 'clicklog','f'
#插入数据集
put 'clicklog','001','f:username','Mary'
put 'clicklog','001','f:url','./home'
put 'clicklog','001','f:cTime','2022-02-02 12:00:00'
put 'clicklog','002','f:username','Bob'
put 'clicklog','002','f:url','./prod?id=1'
put 'clicklog','002','f:cTime','2022-02-02 12:00:05'
put 'clicklog','003','f:username','Liz'
put 'clicklog','003','f:url','./prod?id=7'
put 'clicklog','003','f:cTime','2022-02-02 12:01:45'
备注:在HBase中,字段user需要替换为username,因为user是关键字,查询会报错。
4 代码实现
Flink Table API读取HBase的完整代码如下所示。
package com.bigdata.chap02;
import org.apache.flink.table.api.*;
public class FlinkTableAPIFromHBase {
public static void main(String[] args) {
//1、创建TableEnvironment
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inBatchMode()
.build();
TableEnvironment tEnv = TableEnvironment.create(settings);
//2、创建source table
String sql = "CREATE TABLE sourceTable (" +
" rowkey STRING," +
" f ROW<username STRING,url STRING,cTime STRING>," +
" PRIMARY KEY (rowkey) NOT ENFORCED" +
" ) WITH (" +
" 'connector' = 'hbase-1.4' ," +
" 'table-name' = 'clicklog' ," +
" 'zookeeper.quorum' = 'hadoop1:2181'" +
" )";
tEnv.executeSql(sql);
//3、创建sink table
final Schema schema = Schema.newBuilder()
.column("rowkey", DataTypes.STRING())
.column("username", DataTypes.STRING())
.column("url", DataTypes.STRING())
.column("cTime", DataTypes.STRING())
.build();
tEnv.createTemporaryTable("sinkTable", TableDescriptor.forConnector("print")
.schema(schema)
.build());
//4、Flink SQL查询
Table resultTable = tEnv.sqlQuery("select rowkey,f.username,f.url, f.cTime from sourceTable limit 10");
//5、输出(包括执行,不需要单独在调用tEnv.execute("job"))
resultTable.executeInsert("sinkTable");
}
}
5 关键代码解释
#创建source table
String sql = "CREATE TABLE sourceTable (" +
" rowkey STRING," +
" f ROW<username STRING,url STRING,cTime STRING>," +
" PRIMARY KEY (rowkey) NOT ENFORCED" +
" ) WITH (" +
" 'connector' = 'hbase-1.4' ," +
" 'table-name' = 'clicklog' ," +
" 'zookeeper.quorum' = 'hadoop1:2181'" +
" )";
tEnv.executeSql(sql);
解释说明:
(1)rowkey为hbase主键
(2)f为clicklog表的列簇名称
(3)username、url、cTime为f列簇的列名称
(4)connector有两个取值:hbase-1.4和hbase-2.2。如果HBase部署版本为hbase1.x,此时connector的取值为hbase-1.4;如果HBase部署版本为hbase2.x,此时connector的取值为hbase-2.2;
(5)table-name指定查询hbase的表名称为clicklog
(6)zookeeper.quorum指定zookeeper的集群地址
#Flink SQL查询
Table resultTable = tEnv.sqlQuery("select rowkey,f.username,f.url, f.cTime from sourceTable limit 10");
解释说明:
(1)sourceTable 为前面Flink Table注册的表名称
(2)如果查询HBase列簇下面的列,那么列名称前面需要加上列簇前缀,如f.username,f.url, f.cTime。
6 测试运行
在idea工具中,右键项目选择Run运行Flink Table,如果能在控制台看到打印如下结果,说明Flink Table API能成功读取HBase中的数据。
+I[001, Mary, ./home, 2022-02-02 12:00:00]
+I[002, Bob, ./prod?id=1, 2022-02-02 12:00:05]
+I[003, Liz, ./prod?id=7, 2022-02-02 12:01:45]