Flink HiveCatalog

最新推荐文章于 2024-01-12 09:34:10 发布

sunrising_hill

最新推荐文章于 2024-01-12 09:34:10 发布

阅读量3.1k

点赞数

分类专栏： Flink

本文链接：https://blog.csdn.net/sunrising_hill/article/details/104979146

版权

Flink 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Hive Metastore作为一个元数据管理的标准在Hadoop生态系统中已经成为公认的事实，因此Flink也采用HiveCatalog作为表元数据持久化的介质。对于同时部署了Hive和Flink的公司来说，可以方便管理元数据，而对于只部署了Flink的公司来说，HiveCatalog也是Flink唯一支持的元数据持久化的介质。不将元数据持久化的时候，开发过程中的每个地方都需要使用DDL重新将Kafka等数据源的数据注册到临时的Catalog中，浪费了很多精力和时间。

1、使用Hive作为元数据存储介质，需要制定hive-site.xml配置文件的位置，里面制定了存储hive-metadata的MySQL或其他关系型数据库的连接信息。

2、sql-client配置访问HiveCatalog，修改flink安装目录下conf/sql-client-defaults.yaml文件，将catalogs模块的内容改为：

#==============================================================================
# Catalogs
#==============================================================================

# Define catalogs here.

catalogs:
# A typical catalog definition looks like:
 - name: myhive
   type: hive
   hive-conf-dir: /apps/hive/conf
   hive-version: 2.3.5
   default-database: gmall

executions模块也需要修改：

execution:
  # select the implementation responsible for planning table programs
  # possible values are 'blink' (used by default) or 'old'
  planner: blink
  # 'batch' or 'streaming' execution
  type: streaming
  # allow 'event-time' or only 'processing-time' in sources
  time-characteristic: event-time
  # interval in ms for emitting periodic watermarks
  periodic-watermarks-interval: 200
  # 'changelog' or 'table' presentation of results
  result-mode: table
  # maximum number of maintained rows in 'table' presentation of results
  max-table-result-rows: 1000
  # parallelism of the program
  parallelism: 4
  # maximum parallelism
  max-parallelism: 128
  # minimum idle state retention in ms
  min-idle-state-retention: 0
  # maximum idle state retention in ms
  max-idle-state-retention: 0
  # current catalog ('default_catalog' by default)
  current-catalog: hive
  # current database of the current catalog (default database of the catalog by default)
  current-database: flink
  # controls how table programs are restarted in case of a failures
  restart-strategy:
    # strategy type
    # possible values are "fixed-delay", "failure-rate", "none", or "fallback" (default)
    type: fallback

hive-conf-dir目录需要根据实际情况填写。

此时，通过sql-client.sh embedded启动sql-client：


Flink SQL> use catalog myhive;    # 选择Catalog

Flink SQL> show databases;
default
gmall
test

Flink SQL> create database flink;
[INFO] Database has been created.

Flink SQL> use flink;

Flink SQL> CREATE TABLE log_user_cart (name STRING, logtime BIGINT, etime STRING, prod_url STRING) WITH (
>    'connector.type' = 'kafka',
>    'connector.version' = 'universal',
>    'connector.topic' = 'test_log_user_cart',
>    'connector.properties.zookeeper.connect' = 'm.hadoop.com:2181,s1.hadoop.com:2181,s2.hadoop.com:2181',
>    'connector.properties.bootstrap.servers' = 'm.hadoop.com:9092,s1.hadoop.com:9092,s2.hadoop.com:9092',
>    'format.type' = 'csv'
> );
[INFO] Table has been created.

Flink SQL>

切换到别的窗口打开hive的终端，即可看到该数据库：

hive (flink)> show databases;
OK
database_name
default
flink
gmall
test
Time taken: 0.019 seconds, Fetched: 4 row(s)
hive (flink)> use flink;
OK
Time taken: 0.025 seconds
hive (flink)> show tables;
OK
tab_name
log_user_cart
Time taken: 0.026 seconds, Fetched: 1 row(s)
hive (flink)> select * from log_user_cart;
FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF'
hive (flink)>

在Hive中可以看到sql-client中创建的数据库和表，但是无法查询该数据表。

在sql-client中查询该数据表：

3、代码中访问HiveCatalog：

val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val tableEnv = TableEnvironment.create(settings)

val name            = "myhive"
val defaultDatabase = "gmall"
val hiveConfDir     = "/apps/hive/conf" // a local path
val version         = "2.3.5"

val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog("myhive", hive)

注意：使用时，需要从hive安装目录的lib下拷贝jar包到flink的lib目录，需要拷贝的jar包包括：

    167761 Mar 19 20:50 antlr-runtime-3.5.2.jar
    366748 Mar 19 20:48 datanucleus-api-jdo-4.2.4.jar
   2016766 Mar 19 20:48 datanucleus-core-4.1.17.jar
   1908681 Mar 19 20:48 datanucleus-rdbms-4.1.19.jar
     74416 Mar 20 20:57 flink-avro-1.10.0.jar
    357555 Mar 19 20:33 flink-connector-hive_2.11-1.10.0.jar
     81579 Mar 20 20:54 flink-connector-kafka_2.11-1.10.0.jar
    107027 Mar 20 21:01 flink-connector-kafka-base_2.11-1.10.0.jar
     37311 Mar 20 20:55 flink-csv-1.10.0.jar
 110055308 Feb  8 02:54 flink-dist_2.11-1.10.0.jar
     37966 Mar 20 20:56 flink-hadoop-fs-1.10.0.jar
     43007 Mar 18 23:45 flink-json-1.10.0.jar
  43317025 Mar 18 23:45 flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
  19301237 Feb  8 02:54 flink-table_2.11-1.10.0.jar
     55339 Mar 19 20:34 flink-table-api-scala-bridge_2.11-1.10.0.jar
  22520058 Feb  8 02:54 flink-table-blink_2.11-1.10.0.jar
    241622 Mar 19 00:28 gson-2.8.5.jar
  34271938 Mar 19 20:32 hive-exec-2.3.5.jar
    348625 Mar 19 22:27 jackson-core-2.10.1.jar
    249790 Mar 19 20:50 javax.jdo-3.2.0-m3.jar
   2736313 Mar 18 23:45 kafka-clients-2.3.0.jar
    489884 Sep  2  2019 log4j-1.2.17.jar
   1007502 Mar 19 20:49 mysql-connector-java-5.1.47.jar
      9931 Sep  2  2019 slf4j-log4j12-1.7.15.jar

有些jar包需要手动去maven官网下载，如gson-2.8.5.jar，只需在百度输入maven gson搜索即可