Flink HiveCatalog

Hive Metastore作为一个元数据管理的标准在Hadoop生态系统中已经成为公认的事实,因此Flink也采用HiveCatalog作为表元数据持久化的介质。对于同时部署了Hive和Flink的公司来说,可以方便管理元数据,而对于只部署了Flink的公司来说,HiveCatalog也是Flink唯一支持的元数据持久化的介质。不将元数据持久化的时候,开发过程中的每个地方都需要使用DDL重新将Kafka等数据源的数据注册到临时的Catalog中,浪费了很多精力和时间。

1、使用Hive作为元数据存储介质,需要制定hive-site.xml配置文件的位置,里面制定了存储hive-metadata的MySQL或其他关系型数据库的连接信息。

2、sql-client配置访问HiveCatalog,修改flink安装目录下conf/sql-client-defaults.yaml文件,将catalogs模块的内容改为:

#==============================================================================
# Catalogs
#==============================================================================

# Define catalogs here.

catalogs:
# A typical catalog definition looks like:
 - name: myhive
   type: hive
   hive-conf-dir: /apps/hive/conf
   hive-version: 2.3.5
   default-database: gmall

executions模块也需要修改:

execution:
  # select the implementation responsible for planning table programs
  # possible values are 'blink' (used by default) or 'old'
  planner: blink
  # 'batch' or 'streaming' execution
  type: streaming
  # allow 'event-time' or only 'processing-time' in sources
  time-characteristic: event-time
  # interval in ms for emitting periodic watermarks
  periodic-watermarks-interval: 200
  # 'changelog' or 'table' presentation of results
  result-mode: table
  # maximum number of maintained rows in 'table' presentation of results
  max-table-result-rows: 1000
  # parallelism of the program
  parallelism: 4
  # maximum parallelism
  max-parallelism: 128
  # minimum idle state retention in ms
  min-idle-state-retention: 0
  # maximum idle state retention in ms
  max-idle-state-retention: 0
  # current catalog ('default_catalog' by default)
  current-catalog: hive
  # current database of the current catalog (default database of the catalog by default)
  current-database: flink
  # controls how table programs are restarted in case of a failures
  restart-strategy:
    # strategy type
    # possible values are "fixed-delay", "failure-rate", "none", or "fallback" (default)
    type: fallback

hive-conf-dir目录需要根据实际情况填写。

此时,通过sql-client.sh embedded启动sql-client:


Flink SQL> use catalog myhive;    # 选择Catalog

Flink SQL> show databases;
default
gmall
test

Flink SQL> create database flink;
[INFO] Database has been created.

Flink SQL> use flink;

Flink SQL> CREATE TABLE log_user_cart (name STRING, logtime BIGINT, etime STRING, prod_url STRING) WITH (
>    'connector.type' = 'kafka',
>    'connector.version' = 'universal',
>    'connector.topic' = 'test_log_user_cart',
>    'connector.properties.zookeeper.connect' = 'm.hadoop.com:2181,s1.hadoop.com:2181,s2.hadoop.com:2181',
>    'connector.properties.bootstrap.servers' = 'm.hadoop.com:9092,s1.hadoop.com:9092,s2.hadoop.com:9092',
>    'format.type' = 'csv'
> );
[INFO] Table has been created.

Flink SQL> 

切换到别的窗口打开hive的终端,即可看到该数据库:

hive (flink)> show databases;
OK
database_name
default
flink
gmall
test
Time taken: 0.019 seconds, Fetched: 4 row(s)
hive (flink)> use flink;
OK
Time taken: 0.025 seconds
hive (flink)> show tables;
OK
tab_name
log_user_cart
Time taken: 0.026 seconds, Fetched: 1 row(s)
hive (flink)> select * from log_user_cart;
FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF'
hive (flink)> 

在Hive中可以看到sql-client中创建的数据库和表,但是无法查询该数据表。

在sql-client中查询该数据表:

3、代码中访问HiveCatalog:

val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val tableEnv = TableEnvironment.create(settings)

val name            = "myhive"
val defaultDatabase = "gmall"
val hiveConfDir     = "/apps/hive/conf" // a local path
val version         = "2.3.5"

val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog("myhive", hive)

 

注意:使用时,需要从hive安装目录的lib下拷贝jar包到flink的lib目录,需要拷贝的jar包包括:

    167761 Mar 19 20:50 antlr-runtime-3.5.2.jar
    366748 Mar 19 20:48 datanucleus-api-jdo-4.2.4.jar
   2016766 Mar 19 20:48 datanucleus-core-4.1.17.jar
   1908681 Mar 19 20:48 datanucleus-rdbms-4.1.19.jar
     74416 Mar 20 20:57 flink-avro-1.10.0.jar
    357555 Mar 19 20:33 flink-connector-hive_2.11-1.10.0.jar
     81579 Mar 20 20:54 flink-connector-kafka_2.11-1.10.0.jar
    107027 Mar 20 21:01 flink-connector-kafka-base_2.11-1.10.0.jar
     37311 Mar 20 20:55 flink-csv-1.10.0.jar
 110055308 Feb  8 02:54 flink-dist_2.11-1.10.0.jar
     37966 Mar 20 20:56 flink-hadoop-fs-1.10.0.jar
     43007 Mar 18 23:45 flink-json-1.10.0.jar
  43317025 Mar 18 23:45 flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
  19301237 Feb  8 02:54 flink-table_2.11-1.10.0.jar
     55339 Mar 19 20:34 flink-table-api-scala-bridge_2.11-1.10.0.jar
  22520058 Feb  8 02:54 flink-table-blink_2.11-1.10.0.jar
    241622 Mar 19 00:28 gson-2.8.5.jar
  34271938 Mar 19 20:32 hive-exec-2.3.5.jar
    348625 Mar 19 22:27 jackson-core-2.10.1.jar
    249790 Mar 19 20:50 javax.jdo-3.2.0-m3.jar
   2736313 Mar 18 23:45 kafka-clients-2.3.0.jar
    489884 Sep  2  2019 log4j-1.2.17.jar
   1007502 Mar 19 20:49 mysql-connector-java-5.1.47.jar
      9931 Sep  2  2019 slf4j-log4j12-1.7.15.jar

有些jar包需要手动去maven官网下载,如gson-2.8.5.jar,只需在百度输入maven gson搜索即可

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
下面是一个使用Flink将数据写入到Hive表中的示例代码: ```java import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.TableEnvironment; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.table.catalog.hive.HiveCatalog; import org.apache.flink.table.descriptors.Schema; import org.apache.flink.table.descriptors.ConnectorDescriptor; import org.apache.flink.table.descriptors.FormatDescriptor; import org.apache.flink.table.descriptors.FileSystem; import org.apache.flink.table.descriptors.OldCsv; import org.apache.flink.table.sinks.TableSink; import org.apache.flink.table.sinks.hive.HiveTableSink; public class FlinkHiveSinkDemo { public static void main(String[] args) throws Exception { // 创建Flink的StreamExecutionEnvironment对象 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // 创建TableEnvironment对象 EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); // 创建Hive Catalog String catalogName = "myhive"; String defaultDatabase = "default"; String hiveConfDir = "/path/to/hive/conf"; HiveCatalog hiveCatalog = new HiveCatalog(catalogName, defaultDatabase, hiveConfDir); tableEnv.registerCatalog(catalogName, hiveCatalog); // 创建Hive表 String tableName = "mytable"; String[] fieldNames = {"name", "age", "gender"}; String[] fieldTypes = {"STRING", "INT", "STRING"}; tableEnv.sqlUpdate(String.format("CREATE TABLE %s (%s) PARTITIONED BY (dt STRING)", tableName, getFields(fieldNames, fieldTypes))); // 将DataStream转换为Table DataStream<Person> stream = env.fromElements(new Person("Alice", 18, "F"), new Person("Bob", 20, "M")); Table table = tableEnv.fromDataStream(stream, "name, age, gender"); // 将Table写入Hive表 TableSink sink = new HiveTableSink(tableName, catalogName, getFields(fieldNames, fieldTypes), new String[]{"dt"}); tableEnv.registerTableSink("hiveSink", sink); table.insertInto("hiveSink"); // 执行任务 env.execute("Flink Hive Sink Demo"); } private static String getFields(String[] fieldNames, String[] fieldTypes) { StringBuilder sb = new StringBuilder(); for (int i = 0; i < fieldNames.length; i++) { sb.append(fieldNames[i]).append(" ").append(fieldTypes[i]); if (i < fieldNames.length - 1) { sb.append(","); } } return sb.toString(); } public static class Person { public String name; public int age; public String gender; public Person() {} public Person(String name, int age, String gender) { this.name = name; this.age = age; this.gender = gender; } } } ``` 这个示例代码中,先创建了一个Hive Catalog,然后创建了一个Hive表。将一个DataStream转换为Table,并通过HiveTableSink将Table写入到Hive表中。在实际使用中,需要根据具体的业务场景进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值