flink - sink - hive

依赖

以下依赖均可以放到flink lib中,然后在pom中声明为provided

flink-connector-hive

flink对hive的核心依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-hive_${scala.version}</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
flink-shaded-hadoop

没有hadoop环境时可以用此依赖代替

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-shaded-hadoop-3</artifactId>
    <version>${hadoop_version}</version>
    <scope>provided</scope>
</dependency>
hive-exec

hive的依赖,此依赖应该放在flink-shaded-hadoop后面,让工程优先访问flink-shaded-hadoop的依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>${hive.version}</version>
    <exclusions>
    <!-- remove conflict dependencies -->
    <exclusion>
            <groupId>org.apache.calcite</groupId>
            <artifactId>calcite-core</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.calcite</groupId>
            <artifactId>calcite-druid</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.calcite.avatica</groupId>
            <artifactId>avatica</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.calcite</groupId>
            <artifactId>calcite-avatica</artifactId>
        </exclusion>
    </exclusions>
    <scope>provided</scope>
</dependency>
思路

dataStream转为flink table,再通过hive catalog写入到hive表中

写入到hive非分区表

val streamEnv = ...
val dataStream = ...
val streamTableEnv = ...
streamTableEnv.createTemporaryView("自定义catalog表名", dataStream, *fields) # 当前flink存在bug,转换时必须指定fields或者schema,否则watermark无法流入table
val catalog = ...
streamTableEnv.registerCatalog("hive", catalog)
streamTableEnv.useCatalog("hive")
streamTableEnv.executeSql("insert sql").print()
写入到hive分区表
  1. streamEnv需要开启checkpoint,保证flink写入hive分区表的写入一致性
  2. hive表ddl中需要指定以下TBLPROPERTIES:
  • sink.partition-commit.trigger:分区提交触发器,单选,可选值为partition-time、process-time(默认), 其中==partition-time需要根据当前数据的watermark来判断分区是否需要提交,当watermark + delay大于等于分区上的时间时就会提交该分区元数据==;process-time的话根据当前系统处理时间来判断分区是否需要提交,当系统处理时间大于等于分区上的时间就会提交该分区元数据
  • partition.time-extractor.timestamp-pattern:使用partition-time触发器时使用该配置项。表示从表字段中提取出表达某个分区的时间的格式,==需要提取到的时间必须为yyyy-MM-dd HH:mm:ss的格式==。比如字段dt的格式为yyyy-MM-dd,则配置为$dt 00:00:00则表示分区时间取值为dt的value的0点0分0秒,可以选择多个表字段组合。当表字段无法抽取出符合的格式时,则使用自定义提取器partition.time-extractor.class。
  • sink.partition-commit.delay: 表示watermark允许event time的最大乱序时间,使用partition-time触发器时可以使用,默认为0s
  • sink.partition-commit.policy.kind:分区提交方式,多选,可选值为metastore、success-file、custom,metastore表示写入元数据库,success-file表示往hdfs分区目录写入一个标志文件,custom表示使用自定义提交方式,通常使用metastore,success-file组合
  • partition.time-extractor.kind:当要使用自定义分区时间提取器时需要配置此项,值配置为custom
  • partition.time-extractor.class:当要使用自定义分区时间提取器时需要配置此项,值配置为自定义提取器的类路径。在集群中运行时,需要把该类打成jar包放到flink lib目录下。
  • 某个分区触发提交后,后续再有此分区的数据进来,仍然会写入hive该分区。

# 按process-time触发分区提交
# hive表ddl
create table xxx(...)
partitioned by(...)
stored as orc
tblproperties(
    'sink.partition-commit.trigger' = 'process-time',
    'sink.partition-commit.policy.kind'='metastore,success-file'
)

# flink code
val streamEnv = ...
streamEnv.enableCheckpointing(...)
val dataStream = ...
val streamTableEnv = ...
streamTableEnv.createTemporaryView("自定义catalog表名", dataStream)
val catalog = ...
streamTableEnv.registerCatalog("hive", catalog)
streamTableEnv.useCatalog("hive")
streamTableEnv.executeSql("insert sql").print()

# 按partition-time触发分区提交
# hive表ddl, 使用默认分区时间提取器
create table xxx(...)
partitioned by(...)
stored as orc
tblproperties(
    'sink.partition-commit.trigger' = 'partition-time',
    'sink.partition-commit.policy.kind'='metastore,success-file',
    'partition.time-extractor.timestamp-pattern' = ...,
    'sink.partition-commit.delay' = ...,
)

# flink code
val streamEnv = ...
streamEnv.enableCheckpointing(...)
val dataStream = ...  # 已添加watermark
val streamTableEnv = ...
streamTableEnv.createTemporaryView("自定义catalog表名", dataStream)
val catalog = ...
streamTableEnv.registerCatalog("hive", catalog)
streamTableEnv.useCatalog("hive")
streamTableEnv.executeSql("insert sql").print()

# 按partition-time触发分区提交
# hive表ddl, 使用自定义分区时间提取器
create table xxx(...)
partitioned by(...)
stored as orc
tblproperties(
    'sink.partition-commit.trigger' = 'partition-time',
    'sink.partition-commit.policy.kind'='metastore,success-file',
    'partition.time-extractor.timestamp-pattern' = ...,
    'sink.partition-commit.delay' = ...,
    'partition.time-extractor.kind' = 'custom',
    'partition.time-extractor.class' = '自定义分区时间提取器类路径'
)

# flink code
# 以下自定义提取器实现从时间戳中提取出LocalDateTime,以符合partition.time-extractor的yyyy-MM-dd HH:mm:ss的格式要求
class ExtractPartitionTimeFromUnixTimeStamp extends PartitionTimeExtractor {
    # 参数curPartitionValueList对应的是待作为分区时间提取的表分区字段中某个分区的值,每个分区都会调用一次此函数用来提取分区时间,来判断是否达到提交分区的条件
    # ConvertTimestampToDateTimeFunc返回的格式为yyyy-MM-dd HH:mm:ss的字符串
    # extract返回的LocalDateTime格式需要与分区粒度配合,例如按天分区时,LocalDateTime.parse中的参数格式可以为yyyy-MM-dd 00:00:00
    override def extract(curPartitionKeyList: util.List[String], curPartitionValueList: util.List[String]): LocalDateTime = {
      LocalDateTime.parse(ConvertTimestampToDateTimeFunc(curPartitionValueList.get(0)))
    }
}

val streamEnv = ...
streamEnv.enableCheckpointing(...)
val dataStream = ...  # 已添加watermark
val streamTableEnv = ...
streamTableEnv.createTemporaryView("自定义catalog表名", dataStream)
val catalog = ...
streamTableEnv.registerCatalog("hive", catalog)
streamTableEnv.useCatalog("hive")
streamTableEnv.executeSql("insert sql").print()



 

下面是一个使用Flink将数据写入到Hive表中的示例代码: ```java import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.TableEnvironment; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.table.catalog.hive.HiveCatalog; import org.apache.flink.table.descriptors.Schema; import org.apache.flink.table.descriptors.ConnectorDescriptor; import org.apache.flink.table.descriptors.FormatDescriptor; import org.apache.flink.table.descriptors.FileSystem; import org.apache.flink.table.descriptors.OldCsv; import org.apache.flink.table.sinks.TableSink; import org.apache.flink.table.sinks.hive.HiveTableSink; public class FlinkHiveSinkDemo { public static void main(String[] args) throws Exception { // 创建Flink的StreamExecutionEnvironment对象 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // 创建TableEnvironment对象 EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); // 创建Hive Catalog String catalogName = "myhive"; String defaultDatabase = "default"; String hiveConfDir = "/path/to/hive/conf"; HiveCatalog hiveCatalog = new HiveCatalog(catalogName, defaultDatabase, hiveConfDir); tableEnv.registerCatalog(catalogName, hiveCatalog); // 创建Hive表 String tableName = "mytable"; String[] fieldNames = {"name", "age", "gender"}; String[] fieldTypes = {"STRING", "INT", "STRING"}; tableEnv.sqlUpdate(String.format("CREATE TABLE %s (%s) PARTITIONED BY (dt STRING)", tableName, getFields(fieldNames, fieldTypes))); // 将DataStream转换为Table DataStream<Person> stream = env.fromElements(new Person("Alice", 18, "F"), new Person("Bob", 20, "M")); Table table = tableEnv.fromDataStream(stream, "name, age, gender"); // 将Table写入Hive表 TableSink sink = new HiveTableSink(tableName, catalogName, getFields(fieldNames, fieldTypes), new String[]{"dt"}); tableEnv.registerTableSink("hiveSink", sink); table.insertInto("hiveSink"); // 执行任务 env.execute("Flink Hive Sink Demo"); } private static String getFields(String[] fieldNames, String[] fieldTypes) { StringBuilder sb = new StringBuilder(); for (int i = 0; i < fieldNames.length; i++) { sb.append(fieldNames[i]).append(" ").append(fieldTypes[i]); if (i < fieldNames.length - 1) { sb.append(","); } } return sb.toString(); } public static class Person { public String name; public int age; public String gender; public Person() {} public Person(String name, int age, String gender) { this.name = name; this.age = age; this.gender = gender; } } } ``` 这个示例代码中,先创建了一个Hive Catalog,然后创建了一个Hive表。将一个DataStream转换为Table,并通过HiveTableSink将Table写入到Hive表中。在实际使用中,需要根据具体的业务场景进行调整。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值