Flink DataStream之Kafka数据写入HDFS,并分区到Hive

Flink DataStream之Kafka数据写入HDFS,并分区到Hive

因业务要求,我们需要从Kafka中读取数据,变换后最终Sink到业务的消息队列中,为保证数据的可靠性,我们同时对Sink的结果数据,进行保存。最终选择将流数据Sink到HDFS上,在Flink中,同时也提供了HDFS Connector。下面就介绍如何将流式数据写入HDFS,同时将数据load到Hive表中。

一、pom.xml文件配置
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-filesystem_2.11</artifactId>
  <version>1.8.0</version>
</dependency>
二、Flink DataStream代码

代码是将最后的结果数据,拼接成CSV格式,最后写入HDFS中。下面的逻辑在真实地业务中删除许多。仅保留有用对大家的代码。

public class RMQAndBucketFileConnectSink {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);


        Properties p = new Properties();
        p.setProperty("bootstrap.servers", "localhost:9092");
        SingleOutputStreamOperator<String> ds = env.addSource(new FlinkKafkaConsumer010<String>("user", new SimpleStringSchema(), p))
                .map(new MapFunction<String, User>() {
                    @Override
                    public User map(String value) throws Exception {
                        return new Gson().fromJson(value, User.class);
                    }
                })
                .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<User>() {
                    @Override
                    public long extractAscendingTimestamp(User element) {
                        return element.createTime;
                    }
                })
                .map(new MapFunction<User, String>() {
                    @Override
                    public String map(User value) throws Exception {
                        return value.userId + "," + value.name + "," + value.age + "," + value.sex + "," + value.createTime + "," + value.updateTime;
                    }
                });


        // 写入RabbitMQ
        RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder()
                .setHost("localhost")
                .setVirtualHost("/")
                .setPort(5672)
                .setUserName("admin")
                .setPassword("admin")
                .build();

        // 写入RabbitMQ,如果队列是持久化的,需要重写RMQSink的 setupQueue方法
        RMQSink<String> rmqSink = new RMQSink<>(rmqConnectionConfig, "queue_name", new SimpleStringSchema());
        ds.addSink(rmqSink);


        // 写入HDFS
        BucketingSink<String> bucketingSink = new BucketingSink<>("/apps/hive/warehouse/users");
        // 设置以yyyyMMdd的格式进行切分目录,类似hive的日期分区
        bucketingSink.setBucketer(new DateTimeBucketer<>("yyyyMMdd", ZoneId.of("Asia/Shanghai")));
        // 设置文件块大小128M,超过128M会关闭当前文件,开启下一个文件
        bucketingSink.setBatchSize(1024 * 1024 * 128L);
        // 设置一小时翻滚一次
        bucketingSink.setBatchRolloverInterval(60 * 60 * 1000L);
        // 设置等待写入的文件前缀,默认是_
        bucketingSink.setPendingPrefix("");
        // 设置等待写入的文件后缀,默认是.pending
        bucketingSink.setPendingSuffix("");
        //设置正在处理的文件前缀,默认为_
        bucketingSink.setInProgressPrefix(".");

        ds.addSink(bucketingSink);


        env.execute("RMQAndBucketFileConnectSink");
    }
}

写入的HDFS文件目录如下:

/apps/hive/warehouse/users/20190708
/apps/hive/warehouse/users/20190709
/apps/hive/warehouse/users/20190710
...
三、Hive表的创建以及导入

创建hive表

create external table default.users(
	`userId` string,
	`name` string,
	`age` int,
	`sex` int,
	`ctime` string,
	`utime` string,
)
partitioned by(dt string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

创建定时任务,每天凌晨导入HDFS文件到Hive,导入Hive脚本。

load_hive.sh如下:

#!/usr/bin/env bash

d=`date -d "-1 day" +%Y%m%d`

# 每天HDFS的数据导入hive分区中
/usr/hdp/2.6.3.0-235/hive/bin/hive -e "alter table default.users add partition (dt='${d}') location '/apps/hive/warehouse/users/${d}'"

使用crontab每天凌晨调度就行。

以下是一个简单的Java实现,使用Flink消费Kafka数据写入Hive集群。请根据实际情况进行修改适当添加错误处理。 ```java import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema; import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hive.conf.HiveConf; import org.apache.hadoop.hive.metastore.api.FieldSchema; import org.apache.hadoop.hive.metastore.api.Table; import org.apache.hadoop.hive.ql.metadata.Hive; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.lib.NullOutputFormat; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.common.serialization.StringSerializer; import java.io.IOException; import java.util.List; import java.util.Properties; public class FlinkKafkaHiveDemo { private static final String KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"; private static final String KAFKA_TOPIC = "test"; private static final String HIVE_METASTORE_URI = "thrift://localhost:9083"; private static final String HIVE_DATABASE = "default"; private static final String HIVE_TABLE = "test"; public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.disableOperatorChaining(); Properties kafkaProps = new Properties(); kafkaProps.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, KAFKA_BOOTSTRAP_SERVERS); kafkaProps.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "flink_consumer"); FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(KAFKA_TOPIC, new SimpleStringSchema(), kafkaProps); kafkaConsumer.setStartFromEarliest(); DataStream<String> input = env.addSource(kafkaConsumer); DataStream<String> transformed = input.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 进行数据转换 return value; } }); Properties hiveProps = new Properties(); hiveProps.setProperty("hive.metastore.uris", HIVE_METASTORE_URI); HiveConf hiveConf = new HiveConf(); hiveConf.addResource(hiveProps); Hive hive = Hive.get(hiveConf); try { Table table = new Table(); table.setDbName(HIVE_DATABASE); table.setTableName(HIVE_TABLE); table.setTableType("EXTERNAL_TABLE"); List<FieldSchema> columns = List.of(new FieldSchema("col1", TypeInfoFactory.stringTypeInfo.getTypeName(), "")); table.setFields(columns); table.getParameters().put("EXTERNAL", "TRUE"); table.getParameters().put("LOCATION", "/user/hive/warehouse/" + HIVE_DATABASE + ".db/" + HIVE_TABLE); hive.createTable(table); } catch (HiveException e) { e.printStackTrace(); } Configuration hadoopConf = new Configuration(); hadoopConf.set("fs.defaultFS", "hdfs://localhost:9000"); Path outputPath = new Path("/user/hive/warehouse/" + HIVE_DATABASE + ".db/" + HIVE_TABLE); FileSystem fs = FileSystem.get(hadoopConf); if (fs.exists(outputPath)) { fs.delete(outputPath, true); } Properties kafkaProducerProps = new Properties(); kafkaProducerProps.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, KAFKA_BOOTSTRAP_SERVERS); FlinkKafkaProducer<String> kafkaProducer = new FlinkKafkaProducer<>(KAFKA_TOPIC, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), kafkaProducerProps, FlinkKafkaProducer.Semantic.AT_LEAST_ONCE); transformed.writeAsText("/tmp/flink-hive-output", org.apache.flink.core.fs.FileSystem.WriteMode.OVERWRITE).setParallelism(1); transformed.writeUsingOutputFormat(new HiveOutputFormat(hiveConf, HIVE_DATABASE, HIVE_TABLE)).setParallelism(1); env.execute("Flink Kafka Hive Demo"); } private static class HiveOutputFormat extends org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat<String> { private final HiveConf hiveConf; private final String database; private final String table; public HiveOutputFormat(HiveConf hiveConf, String database, String table) { super(); this.hiveConf = hiveConf; this.database = database; this.table = table; } @Override public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, org.apache.hadoop.mapred.JobConf jobConf, String name, org.apache.hadoop.util.Progressable progressable) throws IOException { try { return new HiveRecordWriter(hiveConf, database, table); } catch (HiveException e) { throw new IOException(e); } } } private static class HiveRecordWriter implements org.apache.hadoop.mapred.RecordWriter<LongWritable, Text> { private final HiveConf hiveConf; private final String database; private final String table; private final org.apache.hadoop.hive.ql.metadata.Table hiveTable; private final TextInputFormat inputFormat; private final NullOutputFormat<Text, Text> outputFormat; public HiveRecordWriter(HiveConf hiveConf, String database, String table) throws HiveException { this.hiveConf = hiveConf; this.database = database; this.table = table; this.hiveTable = Hive.get(hiveConf).getTable(database, table); this.inputFormat = new TextInputFormat(); this.outputFormat = new NullOutputFormat<>(); } @Override public void write(LongWritable key, Text value) throws IOException { try { inputFormat.addInputPath(new org.apache.hadoop.mapred.FileSplit(new Path(value.toString()), 0, Long.MAX_VALUE, new String[0])); org.apache.hadoop.mapred.RecordReader<LongWritable, Text> reader = inputFormat.getRecordReader(new org.apache.hadoop.mapred.FileSplit(new Path(value.toString()), 0, Long.MAX_VALUE, new String[0]), new org.apache.hadoop.mapred.JobConf(hiveConf), null); org.apache.hadoop.mapred.OutputCollector<Text, Text> collector = outputFormat.getRecordWriter(new org.apache.hadoop.mapred.JobConf(hiveConf), null, null, null); LongWritable keyWritable = reader.createKey(); Text valueWritable = reader.createValue(); while (reader.next(keyWritable, valueWritable)) { collector.collect(null, valueWritable); } reader.close(); } catch (Exception e) { throw new IOException(e); } } @Override public void close(org.apache.hadoop.mapred.Reporter reporter) throws IOException { } } } ```
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值