demo-flink1.11.2实现数据写入hive

12 篇文章 0 订阅
  • 环境准备
1. hadoop 集群的开启,hive metastore 服务开启
2. flink-conf.yaml, sql-client-defaults.yaml 配置
    注意: 必须开启checkpoint ,flink 才可提交分区操作
3. flink 集群的开启
    启动:yarn-session.sh -n 3 -s 3 -nm flink-session -d 
    关闭:yarn application -kill applicationId
4. KafKa集群开启   
    
  • flink-conf.yaml 配置
# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
 state.backend: filesystem

# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
 state.checkpoints.dir: hdfs://hadoop001:9000/flink-checkpoints

# Default target directory for savepoints, optional.
#
state.savepoints.dir: hdfs://hadoop001:9000/flink-savepoints

# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend).
#
# state.backend.incremental: false

#设置任务取消后保留hdfs上的checkpoint文件
execution.checkpointing.externalized-checkpoint-retention:RETAIN_ON_CANCELLATION
# checkpoint的时间间隔
execution.checkpointing.interval: 60000

execution.checkpointing.mode: EXACTLY_ONCE

# 启动
./bin/sql-client.sh embedded
# 进入hive catalog
use catalog myhive
# 使用 hive方言
SET table.sql-dialect=hive;
CREATE TABLE hive_table (
  user_id STRING,
  order_amount DOUBLE
) PARTITIONED BY (dt STRING, hr STRING) STORED AS parquet TBLPROPERTIES (
  # 配置hour级别的partition时间抽取策略
  'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
  # 使用partition中抽取时间,加上watermark决定partiton commit的时机
  'sink.partition-commit.trigger'='partition-time',
  # 配置dalay为小时级,当 watermark > partition时间 + 1小时,会commit这个partition
  'sink.partition-commit.delay'='1 h',
  # partitiion commit的策略是:先更新metastore(addPartition),再写SUCCESS文件
  'sink.partition-commit.policy.kind'='metastore,success-file'
);

# 使用 默认的

SET table.sql-dialect=default;
CREATE TABLE kafka_table (
  user_id STRING,
  order_amount DOUBLE,
  log_ts TIMESTAMP(3),
  WATERMARK FOR log_ts AS log_ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'kafka_table', 
'scan.startup.mode' = 'earliest-offset',
'properties.group.id' = 'group1', 
'properties.bootstrap.servers' = 'hadoop001:9092,hadoop002:9092,hadoop003:9092', 
'format' = 'json', 
'json.fail-on-missing-field' = 'true',
'json.ignore-parse-errors' = 'false'

);

# streaming sql, insert into hive table 执行流任务
INSERT INTO  hive_table SELECT user_id, order_amount, DATE_FORMAT(log_ts, 'yyyy-MM-dd'), DATE_FORMAT(log_ts, 'HH') FROM kafka_table;

# batch sql, select with partition pruning 查询
SELECT * FROM hive_table WHERE dt='2020-05-20' and hr='12';
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是使用 Apache Flink 1.16 将数据写入 Hive 的样例代码: ```java // 导入相关依赖 import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.table.catalog.hive.HiveCatalog; import org.apache.flink.types.Row; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.SinkFunction; // 创建 Flink 流处理环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // 创建 Flink 表环境 EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); // 创建 HiveCatalog String catalogName = "my_catalog"; String defaultDatabase = "my_database"; String hiveConfDir = "path/to/hive/conf/dir"; HiveCatalog hiveCatalog = new HiveCatalog(catalogName, defaultDatabase, hiveConfDir); tableEnv.registerCatalog(catalogName, hiveCatalog); tableEnv.useCatalog(catalogName); // 创建表并注册到 HiveCatalog 中 String createTable = "CREATE TABLE my_table (id INT, name STRING) PARTITIONED BY (dt STRING) STORED AS PARQUET"; tableEnv.executeSql(createTable); // 将数据写入 Hive 表 tableEnv.toRetractStream(tableEnv.sqlQuery("SELECT * FROM my_source"), Row.class) .addSink(new SinkFunction<Tuple2<Boolean, Row>>() { @Override public void invoke(Tuple2<Boolean, Row> value) throws Exception { if (value.f0) { tableEnv.executeSql("INSERT INTO my_table SELECT *, DATE_FORMAT(NOW(), 'yyyy-MM-dd') AS dt FROM my_source"); } } }); // 执行任务 env.execute(); ``` 其中,`my_catalog`、`my_database`、`my_table`、`my_source` 都需要替换成你自己的表名称和数据源。此外,`hiveConfDir` 需要替换成你的 Hive 配置文件所在的路径。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值