Flink实时读取kafka数据写入到hdfs和es

最新推荐文章于 2024-08-06 10:03:20 发布

天涯到处跑

最新推荐文章于 2024-08-06 10:03:20 发布

阅读量3.1k

点赞数 1

本文链接：https://blog.csdn.net/m0_37592814/article/details/106320348

版权

一、前言

软件如下：Flink 1.7.2 、elasticsearch 5.2.2 、hadoop 2.7.2 、 kafka 0.10.2.1、hive 、scala 2.11.11、springboot 需要提前按照好。

完整代码已提交到github上：https://github.com/tianyadaochupao/crawler/tree/master

（1）流程图：

流程图跟上一篇文章相似 https://blog.csdn.net/m0_37592814/article/details/105027815，主要是把sparkstreaming 替换成flink

（2）Flink官网

以下Flink代码主要是参看Flink官方文档：https://flink.apache.org/

在首页处点以下进去文档

选择相应的flink版本

最后到达 flink 1.7.* 版本的文档 https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/

此次代码主要参考 Connectors、Operators、Event Time

二、主要流程

1.数据采集

数据采集主要是电商页面上的数据转发到java程序处理后，发送往kafka,flink从kafka中消费数据，数据格式如下：

{"bulletin":"1.请提前半小时或一小时订餐；若遇下雨天气，送餐时间会延迟哦","couponList":[],"createTime":"2020-05-24 21:12:30","deliveryFee":8,"deliveryMsg":"","deliveryTime":"0","deliveryType":0,"dpShopId":0,"itemList":[{"activityTag":"","activityType":0,"bigImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","cActivityTag":"","cTag":"131843184","categoryName":"火锅锅底","categoryType":"0","createTime":"2020-05-24 21:12:30","currentPrice":20.0,"dpShopId":0,"iconUrl":"","littleImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","mtWmPoiId":"97226154**","originPrice":20.0,"praiseNum":0,"sellStatus":0,"shopName":"八合里海记","skuList":[{"activityStock":0,"boxFee":0.0,"count":0,"currentPrice":20.0,"minPurchaseNum":-1,"originPrice":20.0,"realStock":-1,"restrict":-1,"skuId":2418651365,"skuPromotionInfo":"","soldStatus":0,"spec":""}],"spuDesc":"玉米80g、白萝卜100g、1.55L怡宝纯净水2瓶","spuId":2122835340,"spuName":"怡宝纯净水汤底","spuPromotionInfo":"","statusDesc":"","tag":"131843184","unit":""}],"minFee":20.0,"mtWmPoiId":"972261542993051","onlinePay":1,"shipping_time":"","shopName":"八合里海记","shopPic":"http://p0.meituan.net/waimaipoi/720fd28a978fa675c524c1ba4caa471673621.jpg","shopStatus":0}

2.flink 消费kafka数据保存到hdfs和es

flink主要有 source、算计、sink组成的

2.1 source--从kafak消费数据

   //获取flink执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 //kafka属性
    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "ELK01:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("auto.offset.reset", "latest")

    //flink 一般由三部分组成 1.source 2.算子 3.sink

    //1.source输入---kafka作为source
    //入参 topic SimpleStringSchema--读取kafka消息是string格式 properties kafka的配置
    val inputStream = env.addSource(new FlinkKafkaConsumer011[String]("shop1", new SimpleStringSchema(), properties))

2.2 算子

2.2.1 把kafka数据转化为样例类(hdfs和es通用的计算部分)

 //2.1算子--处理数据
    val stream = inputStream.map(new MapFunction[String,JSONObject] {
      override def map(value: String): JSONObject = {
        val jsonObject: JSONObject = JSON.parseObject(value)
        jsonObject
      }
    }).filter(new FilterFunction[JSONObject]() {
      @throws[Exception]
      override def filter(value: JSONObject): Boolean = {
        value.containsKey("mtWmPoiId")
      }
    }).map(new MapFunction[JSONObject,Shop] {
      override def map(value: JSONObject): Shop = {
        var shopBean: Shop = null
        try {
          shopBean = dealShop(value)
        } catch {
          case e: JSONException => {
          }
        }
        shopBean
      }
    }).filter(new FilterFunction[Shop] {
      override def filter(t: Shop): Boolean = {
        null!=t
      }
    }).assignTimestampsAndWatermarks(new MyCustomerAssigner())

.assignTimestampsAndWatermarks(new MyCustomerAssigner())
MyCustomerAssigner是AssignerWithPeriodicWatermarks的子类,主要是指定样例类中的create_time字段作为 eventTime

 //自定义eventTime
  class MyCustomerAssigner() extends AssignerWithPeriodicWatermarks[Shop]{

    val maxOutOfOrderness = 3500L // 3.5 seconds

    var currentMaxTimestamp: Long = _

    val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    override def getCurrentWatermark: Watermark = {
      // return the watermark as current highest timestamp minus the out-of-orderness bound
      new Watermark(currentMaxTimestamp - maxOutOfOrderness)
    }

    def max(timestamp: Long, currentMaxTimestamp: Long): Long = {
      math.max(timestamp,currentMaxTimestamp)
    }

    override def extractTimestamp(shop: Shop, previousElementTimestamp: Long): Long = {
      var create_time = shop.create_time
      if(null==create_time){
        create_time=getNowDate("yyyy-MM-dd HH:mm:ss")
      }
      val timestamp = dateFormat.parse(create_time).getTime
      currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
      timestamp
    }
  }

2.2.2 处理成保存进es所需要的json格式

// es 计算 处理成能保存进es的数据格式流
    val esStream = stream.map(new MapFunction[Shop,String] {
      override def map(shop: Shop): String = {
        val conf = new SerializeConfig(true)
        val shopJson = JSON.toJSONString(shop, conf)
        shopJson
      }
    })

2.2.3 处理成保存进hdfs一行行数据

 // hdfs 计算 --处理成能保存hdfs的数据格式流
    val hdfsStream = stream.map(new MapFunction[Shop,String] {
        override def map(shopBean:Shop): String = {
          val shopLine:String = dealShopToLineString(shopBean)
          shopLine
        }
      })

2.3 sink--分别sink进hdfs和es

2.3.1 sink进入es中：

 //3.sink输出---es作为sink
    //es配置属性
    val config = new java.util.HashMap[String,String]()
    //集群名称
    config.put("cluster.name", "elk")
    // This instructs the sink to emit after every element, otherwise they would be buffered
    config.put("bulk.flush.max.actions", "1")
    //地址
    val transportAddresses = new java.util.ArrayList[InetSocketAddress]()
    transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300))
    transportAddresses.add(new InetSocketAddress(InetAddress.getByName("localhost"), 9300))
    val currentDate = getNowDate("yyyy-MM-dd")
    esStream.addSink(new ElasticsearchSink[String](config,transportAddresses,new ElasticsearchSinkFunction[String] {
      override def process(shopJson: String, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
        val request = Requests.indexRequest().index("shop").`type`(currentDate).source(shopJson)
        requestIndexer.add(request)
      }
    }))

结果如图：

2.3.2 sink进入hdfs

主要是通过sink 进入hdfs表中，通过修改sink生成的文件所在目录，把数据保存到hive表数据目录中，把hdfs目录上的数据加载到hive表中

创建hive表

drop table if exists ods_shop_flink;
create table ods_shop_flink(
dp_shop_id bigint,
wm_poi_id string,
shop_name string,
shop_status string,
shop_pic string,
delivery_fee string,
delivery_time string,
delivery_type string,
min_fee double,
online_pay int,
shipping_time string,
bulletin string,
create_time string
)
partitioned by (batch_date string)
row format delimited fields terminated by '\t';

找到hive表保存数据的hdfs文件目录：

sink数据到hdfs文件

//3.sink输出---hdfs作为sink
    val outputPath ="hdfs://ELK01:9000/user/hive/warehouse/wm.db/ods_shop_flink";

    // BucketAssigner --分桶策略 默认每小时生成一个文件夹
    // RollingPolicy -- 分件滚动策略
    // --withInactivityInterval --最近30分钟没有收到新的记录  withRolloverInterval --它至少包含 60 分钟的数据
    // --withMaxPartSize 文件大小达到多少
    val sink = StreamingFileSink.forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
      // 采用的是自定义的分桶类，把文件保存到某个表的目录下，按照hive分区目录命名方式生成文件名
      .withBucketAssigner(new BatchDateBucketAssigner[String])
      .withRollingPolicy(DefaultRollingPolicy.create()
      .withRolloverInterval(TimeUnit.MINUTES.toMillis(1))
      .withInactivityInterval(TimeUnit.MINUTES.toMillis(1))
      .withMaxPartSize(1024 * 1024 )
      .build())
      .build()
    hdfsStream.addSink(sink)

BatchDateBucketAssigner 为自定义的分桶策略，默认是按每小时生成一个文件夹，这里自定义作用主要是按照每天生成一个文件夹，相对于每天一个hive表的分区，以及改写文件命名方式如：batch_date=2020-05-24

package com.tang.crawler.flink;

import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import org.apache.flink.util.Preconditions;

import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;

@PublicEvolving
public class BatchDateBucketAssigner<IN> implements BucketAssigner<IN, String> {

    private static final long serialVersionUID = 1L;
    private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd";
    private final String formatString;
    private final ZoneId zoneId;
    private transient DateTimeFormatter dateTimeFormatter;

    public BatchDateBucketAssigner() {
        this("yyyy-MM-dd");
    }

    public BatchDateBucketAssigner(String formatString) {
        this(formatString, ZoneId.systemDefault());
    }

    public BatchDateBucketAssigner(ZoneId zoneId) {
        this("yyyy-MM-dd", zoneId);
    }

    public BatchDateBucketAssigner(String formatString, ZoneId zoneId) {
        this.formatString = (String) Preconditions.checkNotNull(formatString);
        this.zoneId = (ZoneId)Preconditions.checkNotNull(zoneId);
    }

    @Override
    public String getBucketId(IN in, Context context) {
        if (this.dateTimeFormatter == null) {
            this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
        }
        return "batch_date="+this.dateTimeFormatter.format(Instant.ofEpochMilli(context.timestamp()));
    }

    @Override
    public SimpleVersionedSerializer<String> getSerializer() {
         return SimpleVersionedStringSerializer.INSTANCE;
    }

    @Override
    public String toString() {
        return "BatchDateBucketAssigner{" +
                "formatString='" + formatString + '\'' +
                ", zoneId=" + zoneId +
                ", dateTimeFormatter=" + dateTimeFormatter +
                '}';
    }
}

结果如：hdfs上生成文件内容