Flink实时读取kafka数据写入到hdfs和es

一、前言

 软件如下:Flink 1.7.2 、elasticsearch 5.2.2 、hadoop 2.7.2 、 kafka 0.10.2.1、hive 、scala 2.11.11、springboot  需要提前按照好。

完整代码已提交到github上:https://github.com/tianyadaochupao/crawler/tree/master

(1)流程图:

       流程图跟上一篇文章相似 https://blog.csdn.net/m0_37592814/article/details/105027815,主要是把sparkstreaming 替换成flink 

(2)Flink官网

   以下Flink代码主要是参看Flink官方文档:https://flink.apache.org/

在首页处点以下进去文档

选择相应的flink版本

最后到达 flink 1.7.* 版本的文档   https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/

此次代码主要参考 ConnectorsOperatorsEvent Time

二、主要流程

1.数据采集

 数据采集主要是 电商页面上的数据转发到java程序处理后,发送往kafka,flink从kafka中消费数据,数据格式如下:

{"bulletin":"1.请提前半小时或一小时订餐;若遇下雨天气,送餐时间会延迟哦","couponList":[],"createTime":"2020-05-24 21:12:30","deliveryFee":8,"deliveryMsg":"","deliveryTime":"0","deliveryType":0,"dpShopId":0,"itemList":[{"activityTag":"","activityType":0,"bigImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","cActivityTag":"","cTag":"131843184","categoryName":"火锅锅底","categoryType":"0","createTime":"2020-05-24 21:12:30","currentPrice":20.0,"dpShopId":0,"iconUrl":"","littleImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","mtWmPoiId":"97226154**","originPrice":20.0,"praiseNum":0,"sellStatus":0,"shopName":"八合里海记","skuList":[{"activityStock":0,"boxFee":0.0,"count":0,"currentPrice":20.0,"minPurchaseNum":-1,"originPrice":20.0,"realStock":-1,"restrict":-1,"skuId":2418651365,"skuPromotionInfo":"","soldStatus":0,"spec":""}],"spuDesc":"玉米80g、白萝卜100g、1.55L怡宝纯净水2瓶","spuId":2122835340,"spuName":"怡宝纯净水汤底","spuPromotionInfo":"","statusDesc":"","tag":"131843184","unit":""}],"minFee":20.0,"mtWmPoiId":"972261542993051","onlinePay":1,"shipping_time":"","shopName":"八合里海记","shopPic":"http://p0.meituan.net/waimaipoi/720fd28a978fa675c524c1ba4caa471673621.jpg","shopStatus":0}

2.flink 消费kafka数据保存到hdfs和es

 flink主要有 source、算计、sink组成的

2.1 source--从kafak消费数据

   //获取flink执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 //kafka属性
    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "ELK01:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("auto.offset.reset", "latest")

    //flink 一般由三部分组成 1.source 2.算子 3.sink

    //1.source输入---kafka作为source
    //入参 topic SimpleStringSchema--读取kafka消息是string格式 properties kafka的配置
    val inputStream = env.addSource(new FlinkKafkaConsumer011[String]("shop1", new SimpleStringSchema(), properties))

2.2 算子

2.2.1 把kafka数据转化为样例类(hdfs和es通用的计算部分)

 //2.1算子--处理数据
    val stream = inputStream.map(new MapFunction[String,JSONObject] {
      override def map(value: String): JSONObject = {
        val jsonObject: JSONObject = JSON.parseObject(value)
        jsonObject
      }
    }).filter(new FilterFunction[JSONObject]() {
      @throws[Exception]
      override def filter(value: JSONObject): Boolean = {
        value.containsKey("mtWmPoiId")
      }
    }).map(new MapFunction[JSONObject,Shop] {
      override def map(value: JSONObject): Shop = {
        var shopBean: Shop = null
        try {
          shopBean = dealShop(value)
        } catch {
          case e: JSONException => {
          }
        }
        shopBean
      }
    }).filter(new FilterFunction[Shop] {
      override def filter(t: Shop): Boolean = {
        null!=t
      }
    }).assignTimestampsAndWatermarks(new MyCustomerAssigner())
.assignTimestampsAndWatermarks(new MyCustomerAssigner())
MyCustomerAssigner是AssignerWithPeriodicWatermarks的子类,主要是指定样例类中的create_time字段作为 eventTime
 //自定义eventTime
  class MyCustomerAssigner() extends AssignerWithPeriodicWatermarks[Shop]{

    val maxOutOfOrderness = 3500L // 3.5 seconds

    var currentMaxTimestamp: Long = _

    val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    override def getCurrentWatermark: Watermark = {
      // return the watermark as current highest timestamp minus the out-of-orderness bound
      new Watermark(currentMaxTimestamp - maxOutOfOrderness)
    }

    def max(timestamp: Long, currentMaxTimestamp: Long): Long = {
      math.max(timestamp,currentMaxTimestamp)
    }

    override def extractTimestamp(shop: Shop, previousElementTimestamp: Long): Long = {
      var create_time = shop.create_time
      if(null==create_time){
        create_time=getNowDate("yyyy-MM-dd HH:mm:ss")
      }
      val timestamp = dateFormat.parse(create_time).getTime
      currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
      timestamp
    }
  }

2.2.2 处理成保存进es所需要的json格式

// es 计算 处理成能保存进es的数据格式流
    val esStream = stream.map(new MapFunction[Shop,String] {
      override def map(shop: Shop): String = {
        val conf = new SerializeConfig(true)
        val shopJson = JSON.toJSONString(shop, conf)
        shopJson
      }
    })

2.2.3 处理成保存进hdfs一行行数据

 // hdfs 计算 --处理成能保存hdfs的数据格式流
    val hdfsStream = stream.map(new MapFunction[Shop,String] {
        override def map(shopBean:Shop): String = {
          val shopLine:String = dealShopToLineString(shopBean)
          shopLine
        }
      })

2.3 sink--分别sink进hdfs和es

2.3.1 sink进入es中:

 //3.sink输出---es作为sink
    //es配置属性
    val config = new java.util.HashMap[String,String]()
    //集群名称
    config.put("cluster.name", "elk")
    // This instructs the sink to emit after every element, otherwise they would be buffered
    config.put("bulk.flush.max.actions", "1")
    //地址
    val transportAddresses = new java.util.ArrayList[InetSocketAddress]()
    transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300))
    transportAddresses.add(new InetSocketAddress(InetAddress.getByName("localhost"), 9300))
    val currentDate = getNowDate("yyyy-MM-dd")
    esStream.addSink(new ElasticsearchSink[String](config,transportAddresses,new ElasticsearchSinkFunction[String] {
      override def process(shopJson: String, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
        val request = Requests.indexRequest().index("shop").`type`(currentDate).source(shopJson)
        requestIndexer.add(request)
      }
    }))

结果如图:

2.3.2  sink进入hdfs

主要是通过sink 进入hdfs表中,通过修改sink生成的文件所在目录,把数据保存到hive表数据目录中,把hdfs目录上的数据加载到hive表中

创建hive表

drop table if exists ods_shop_flink;
create table ods_shop_flink(
dp_shop_id bigint,
wm_poi_id string,
shop_name string,
shop_status string,
shop_pic string,
delivery_fee string,
delivery_time string,
delivery_type string,
min_fee double,
online_pay int,
shipping_time string,
bulletin string,
create_time string
)
partitioned by (batch_date string)
row format delimited fields terminated by '\t';

找到hive表保存数据的hdfs文件目录:

sink数据到hdfs文件

//3.sink输出---hdfs作为sink
    val outputPath ="hdfs://ELK01:9000/user/hive/warehouse/wm.db/ods_shop_flink";

    // BucketAssigner --分桶策略 默认每小时生成一个文件夹
    // RollingPolicy -- 分件滚动策略
    // --withInactivityInterval --最近30分钟没有收到新的记录  withRolloverInterval --它至少包含 60 分钟的数据
    // --withMaxPartSize 文件大小达到多少
    val sink = StreamingFileSink.forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
      // 采用的是自定义的分桶类,把文件保存到某个表的目录下,按照hive分区目录命名方式生成文件名
      .withBucketAssigner(new BatchDateBucketAssigner[String])
      .withRollingPolicy(DefaultRollingPolicy.create()
      .withRolloverInterval(TimeUnit.MINUTES.toMillis(1))
      .withInactivityInterval(TimeUnit.MINUTES.toMillis(1))
      .withMaxPartSize(1024 * 1024 )
      .build())
      .build()
    hdfsStream.addSink(sink)
BatchDateBucketAssigner 为自定义的分桶策略,默认是按每小时生成一个文件夹,这里自定义作用主要是按照每天生成一个文件夹,相对于每天一个hive表的分区,以及改写文件命名方式如:batch_date=2020-05-24 
package com.tang.crawler.flink;

import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import org.apache.flink.util.Preconditions;

import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;

@PublicEvolving
public class BatchDateBucketAssigner<IN> implements BucketAssigner<IN, String> {

    private static final long serialVersionUID = 1L;
    private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd";
    private final String formatString;
    private final ZoneId zoneId;
    private transient DateTimeFormatter dateTimeFormatter;

    public BatchDateBucketAssigner() {
        this("yyyy-MM-dd");
    }

    public BatchDateBucketAssigner(String formatString) {
        this(formatString, ZoneId.systemDefault());
    }

    public BatchDateBucketAssigner(ZoneId zoneId) {
        this("yyyy-MM-dd", zoneId);
    }

    public BatchDateBucketAssigner(String formatString, ZoneId zoneId) {
        this.formatString = (String) Preconditions.checkNotNull(formatString);
        this.zoneId = (ZoneId)Preconditions.checkNotNull(zoneId);
    }

    @Override
    public String getBucketId(IN in, Context context) {
        if (this.dateTimeFormatter == null) {
            this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
        }
        return "batch_date="+this.dateTimeFormatter.format(Instant.ofEpochMilli(context.timestamp()));
    }

    @Override
    public SimpleVersionedSerializer<String> getSerializer() {
         return SimpleVersionedStringSerializer.INSTANCE;
    }

    @Override
    public String toString() {
        return "BatchDateBucketAssigner{" +
                "formatString='" + formatString + '\'' +
                ", zoneId=" + zoneId +
                ", dateTimeFormatter=" + dateTimeFormatter +
                '}';
    }
}

结果如:hdfs上生成文件内容

 

  • 1
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Apache Flink 是一个流处理框架,支持实时数据处理和批处理。Flink 可以轻松地与 Apache Kafka 集成,实现从 Kafka读取数据并将其写入 HDFS。 下面是实现实时同步 Kafka 数据HDFS 的基本步骤: 1. 在 Flink 中引入 KafkaHDFS 的依赖。 2. 创建一个 Flink StreamExecutionEnvironment 对象,并设置相关参数。 3. 创建一个 Kafka 数据源,并从 Kafka读取数据。 4. 对读取数据进行转换和处理。 5. 将处理后的数据写入 HDFS 中。 以下是一个基本的示例代码: ```java import org.apache.flink.api.common.functions.MapFunction; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.core.fs.FileSystem.WriteMode; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer09; public class KafkaToHDFS { public static void main(String[] args) throws Exception { // 从命令行参数中读取参数 final ParameterTool params = ParameterTool.fromArgs(args); // 创建一个 Flink StreamExecutionEnvironment 对象,并设置相关参数 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(params.getInt("parallelism", 1)); // 设置 Kafka 数据源 Properties props = new Properties(); props.setProperty("bootstrap.servers", "localhost:9092"); props.setProperty("group.id", "test"); FlinkKafkaConsumer09<String> consumer = new FlinkKafkaConsumer09<>( params.getRequired("topic"), new SimpleStringSchema(), props); // 从 Kafka读取数据 DataStream<String> stream = env.addSource(consumer); // 对读取数据进行转换和处理 DataStream<String> transformed = stream.map(new MapFunction<String, String>() { @Override public String map(String value) throws Exception { // 进行转换和处理 return value.toUpperCase(); } }); // 将处理后的数据写入 HDFS 中 transformed.writeAsText(params.getRequired("output"), WriteMode.OVERWRITE); // 执行任务 env.execute("KafkaToHDFS"); } } ``` 在执行上述代码之前,需要先将 Flink 的依赖添加到项目中,并修改示例代码中的相关配置参数,如 Kafka 的连接地址、topic 名称和 HDFS 的输出路径等。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值