Flink 使用 BucketingSink 分桶写入HSDFS 方便Hive查询

需求:

用BucketingSink进行分桶sink,按照event time每小时一个分桶,即一个文件夹,方便Hive查询

文件分桶说明

  • 在每个分桶文件夹内有若干文件,文件名为_part-8-0.in-progress_part-81-0.pendingpart-8-0,分别代表处于in-progresspendingfinish状态。
  • 文件关闭后就由in-progress转变到pending,关闭的条件是多久没有往文件中写入数据了,通过setInactiveBucketThreshold()设置,默认1分钟,检查这个条件的周期间隔由setInactiveBucketCheckInterval设置。
  • part-8-0中横线分隔的第二个数字由并发度决定,100个并发度的话这个数字就从0到99
  • part-8-0中横线分隔的第三个数字代表这个并发算子(slot,就是中间的8)写个多少个文件了,切分新文件的条件有两个:
    • 写入文件尺寸达到一定的量,由setBatchSize(1024 * 1024 * 500); // this is 500M设置
    • 文件写入了一定时间,由setBatchRolloverInterval(20 * 60 * 1000); // this is 20 mins设置
  • 文件由pending转变为finish由checkpoint触发

checkpoint不设置state的保存位置的话默认在JobManager的内存里保存,很快就撑爆,所以要通过配置文件改到hdfs上,新的checkpoint文件生成后会自动删除旧的。

配置 checkpoint

flink-conf

state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-host/tmp/flink-checkpoints

设置 BucketingSink 的代码

    DataStream<Tuple4<String, String, String, String>> input = ...
    
    BucketingSink<Tuple4<String, String, String, String>> bucketingSink
                = new BucketingSink<>("hdfs://hdfs-host/sink-location");
        bucketingSink.setBucketer(new HourBucketer());
        bucketingSink.setWriter(new Tuple4Writer());
        bucketingSink.setBatchSize(1024 * 1024 * 500); // 500M 一个文件
        sink.setInactiveBucketCheckInterval(1000); // 1秒钟检查一次多久没有写入了,用于判断是否从 in-progress 转变为 pending
        sink.setInactiveBucketThreshold(1000); // 多久不写入就从 in-progress 转变为 pending
        
    input.addSink(BucketingSink);
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
以下是使用Java编写Flink消费Kafka写入Hive的示例代码: 1. 导入依赖 ```java import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper; import org.apache.flink.streaming.util.serialization.SimpleStringSchema; ``` 2. 配置Kafka连接 ```java String kafkaBootstrapServers = "localhost:9092"; String kafkaTopic = "test"; Properties kafkaProps = new Properties(); kafkaProps.setProperty("bootstrap.servers", kafkaBootstrapServers); kafkaProps.setProperty("group.id", "flink-group"); ``` 3. 创建 Flink 环境和 Kafka 消费者 ```java StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> kafkaStream = env.addSource(new FlinkKafkaConsumer<>(kafkaTopic, new SimpleStringSchema(), kafkaProps)); ``` 4. 对收到的消息进行处理 ```java DataStream<String> processedStream = kafkaStream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 在这里对数据进行处理,返回处理后的数据 return value; } }); ``` 5. 将处理后的数据写入 Hive ```java String hiveTableName = "test"; String hiveMetastoreUri = "thrift://localhost:9083"; String hiveDbName = "default"; String hivePartitionColumn = "dt"; String hivePartitionValue = "20220101"; String hiveOutputPath = "/user/hive/warehouse/" + hiveDbName + ".db/" + hiveTableName + "/" + hivePartitionColumn + "=" + hivePartitionValue; DataStream<String> hiveDataStream = processedStream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 在这里将数据转换为 Hive 表的格式,返回转换后的数据 return value; } }); // 将数据写入 Hive hiveDataStream.addSink(new FlinkHiveOutputFormat<>(new Path(hiveOutputPath), new org.apache.hadoop.hive.ql.io.orc.OrcSerde(), new Object[]{})); ``` 6. 将处理后的数据写回 Kafka ```java String kafkaOutputTopic = "output"; FlinkKafkaProducer<String> kafkaProducer = new FlinkKafkaProducer<>(kafkaBootstrapServers, kafkaOutputTopic, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), kafkaProps); // 将数据写回 Kafka processedStream.addSink(kafkaProducer); ``` 完整示例代码: ```java import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper; import org.apache.flink.streaming.util.serialization.SimpleStringSchema; import java.util.Properties; public class FlinkKafkaToHiveDemo { public static void main(String[] args) throws Exception { String kafkaBootstrapServers = "localhost:9092"; String kafkaTopic = "test"; Properties kafkaProps = new Properties(); kafkaProps.setProperty("bootstrap.servers", kafkaBootstrapServers); kafkaProps.setProperty("group.id", "flink-group"); String hiveTableName = "test"; String hiveMetastoreUri = "thrift://localhost:9083"; String hiveDbName = "default"; String hivePartitionColumn = "dt"; String hivePartitionValue = "20220101"; String hiveOutputPath = "/user/hive/warehouse/" + hiveDbName + ".db/" + hiveTableName + "/" + hivePartitionColumn + "=" + hivePartitionValue; StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> kafkaStream = env.addSource(new FlinkKafkaConsumer<>(kafkaTopic, new SimpleStringSchema(), kafkaProps)); DataStream<String> processedStream = kafkaStream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 在这里对数据进行处理,返回处理后的数据 return value; } }); DataStream<String> hiveDataStream = processedStream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 在这里将数据转换为 Hive 表的格式,返回转换后的数据 return value; } }); DataStream<String> kafkaOutputStream = processedStream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 在这里对数据进行处理,返回处理后的数据 return value; } }); FlinkKafkaProducer<String> kafkaProducer = new FlinkKafkaProducer<>(kafkaBootstrapServers, kafkaOutputTopic, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), kafkaProps); processedStream.addSink(kafkaProducer); hiveDataStream.addSink(new FlinkHiveOutputFormat<>(new Path(hiveOutputPath), new org.apache.hadoop.hive.ql.io.orc.OrcSerde(), new Object[]{})); env.execute("FlinkKafkaToHiveDemo"); } } ```
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值