一、前言
软件如下:Flink 1.7.2 、elasticsearch 5.2.2 、hadoop 2.7.2 、 kafka 0.10.2.1、hive 、scala 2.11.11、springboot 需要提前按照好。
完整代码已提交到github上:https://github.com/tianyadaochupao/crawler/tree/master
(1)流程图:
流程图跟上一篇文章相似 https://blog.csdn.net/m0_37592814/article/details/105027815,主要是把sparkstreaming 替换成flink
(2)Flink官网
以下Flink代码主要是参看Flink官方文档:https://flink.apache.org/
在首页处点以下进去文档
选择相应的flink版本
最后到达 flink 1.7.* 版本的文档 https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/
此次代码主要参考 Connectors、Operators、Event Time
二、主要流程
1.数据采集
数据采集主要是 电商页面上的数据转发到java程序处理后,发送往kafka,flink从kafka中消费数据,数据格式如下:
{"bulletin":"1.请提前半小时或一小时订餐;若遇下雨天气,送餐时间会延迟哦","couponList":[],"createTime":"2020-05-24 21:12:30","deliveryFee":8,"deliveryMsg":"","deliveryTime":"0","deliveryType":0,"dpShopId":0,"itemList":[{"activityTag":"","activityType":0,"bigImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","cActivityTag":"","cTag":"131843184","categoryName":"火锅锅底","categoryType":"0","createTime":"2020-05-24 21:12:30","currentPrice":20.0,"dpShopId":0,"iconUrl":"","littleImageUrl":"http://p1.meituan.net/wmproduct/fb97ab1e25a18c05eb85ddf2dd085d33462196.jpg","mtWmPoiId":"97226154**","originPrice":20.0,"praiseNum":0,"sellStatus":0,"shopName":"八合里海记","skuList":[{"activityStock":0,"boxFee":0.0,"count":0,"currentPrice":20.0,"minPurchaseNum":-1,"originPrice":20.0,"realStock":-1,"restrict":-1,"skuId":2418651365,"skuPromotionInfo":"","soldStatus":0,"spec":""}],"spuDesc":"玉米80g、白萝卜100g、1.55L怡宝纯净水2瓶","spuId":2122835340,"spuName":"怡宝纯净水汤底","spuPromotionInfo":"","statusDesc":"","tag":"131843184","unit":""}],"minFee":20.0,"mtWmPoiId":"972261542993051","onlinePay":1,"shipping_time":"","shopName":"八合里海记","shopPic":"http://p0.meituan.net/waimaipoi/720fd28a978fa675c524c1ba4caa471673621.jpg","shopStatus":0}
2.flink 消费kafka数据保存到hdfs和es
flink主要有 source、算计、sink组成的
2.1 source--从kafak消费数据
//获取flink执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//kafka属性
val properties = new Properties()
properties.setProperty("bootstrap.servers", "ELK01:9092")
properties.setProperty("group.id", "consumer-group")
properties.setProperty("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")
//flink 一般由三部分组成 1.source 2.算子 3.sink
//1.source输入---kafka作为source
//入参 topic SimpleStringSchema--读取kafka消息是string格式 properties kafka的配置
val inputStream = env.addSource(new FlinkKafkaConsumer011[String]("shop1", new SimpleStringSchema(), properties))
2.2 算子
2.2.1 把kafka数据转化为样例类(hdfs和es通用的计算部分)
//2.1算子--处理数据
val stream = inputStream.map(new MapFunction[String,JSONObject] {
override def map(value: String): JSONObject = {
val jsonObject: JSONObject = JSON.parseObject(value)
jsonObject
}
}).filter(new FilterFunction[JSONObject]() {
@throws[Exception]
override def filter(value: JSONObject): Boolean = {
value.containsKey("mtWmPoiId")
}
}).map(new MapFunction[JSONObject,Shop] {
override def map(value: JSONObject): Shop = {
var shopBean: Shop = null
try {
shopBean = dealShop(value)
} catch {
case e: JSONException => {
}
}
shopBean
}
}).filter(new FilterFunction[Shop] {
override def filter(t: Shop): Boolean = {
null!=t
}
}).assignTimestampsAndWatermarks(new MyCustomerAssigner())
.assignTimestampsAndWatermarks(new MyCustomerAssigner())
MyCustomerAssigner是AssignerWithPeriodicWatermarks的子类,主要是指定样例类中的create_time字段作为 eventTime
//自定义eventTime
class MyCustomerAssigner() extends AssignerWithPeriodicWatermarks[Shop]{
val maxOutOfOrderness = 3500L // 3.5 seconds
var currentMaxTimestamp: Long = _
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
override def getCurrentWatermark: Watermark = {
// return the watermark as current highest timestamp minus the out-of-orderness bound
new Watermark(currentMaxTimestamp - maxOutOfOrderness)
}
def max(timestamp: Long, currentMaxTimestamp: Long): Long = {
math.max(timestamp,currentMaxTimestamp)
}
override def extractTimestamp(shop: Shop, previousElementTimestamp: Long): Long = {
var create_time = shop.create_time
if(null==create_time){
create_time=getNowDate("yyyy-MM-dd HH:mm:ss")
}
val timestamp = dateFormat.parse(create_time).getTime
currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
timestamp
}
}
2.2.2 处理成保存进es所需要的json格式
// es 计算 处理成能保存进es的数据格式流
val esStream = stream.map(new MapFunction[Shop,String] {
override def map(shop: Shop): String = {
val conf = new SerializeConfig(true)
val shopJson = JSON.toJSONString(shop, conf)
shopJson
}
})
2.2.3 处理成保存进hdfs一行行数据
// hdfs 计算 --处理成能保存hdfs的数据格式流
val hdfsStream = stream.map(new MapFunction[Shop,String] {
override def map(shopBean:Shop): String = {
val shopLine:String = dealShopToLineString(shopBean)
shopLine
}
})
2.3 sink--分别sink进hdfs和es
2.3.1 sink进入es中:
//3.sink输出---es作为sink
//es配置属性
val config = new java.util.HashMap[String,String]()
//集群名称
config.put("cluster.name", "elk")
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1")
//地址
val transportAddresses = new java.util.ArrayList[InetSocketAddress]()
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300))
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("localhost"), 9300))
val currentDate = getNowDate("yyyy-MM-dd")
esStream.addSink(new ElasticsearchSink[String](config,transportAddresses,new ElasticsearchSinkFunction[String] {
override def process(shopJson: String, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
val request = Requests.indexRequest().index("shop").`type`(currentDate).source(shopJson)
requestIndexer.add(request)
}
}))
结果如图:
2.3.2 sink进入hdfs
主要是通过sink 进入hdfs表中,通过修改sink生成的文件所在目录,把数据保存到hive表数据目录中,把hdfs目录上的数据加载到hive表中
创建hive表
drop table if exists ods_shop_flink;
create table ods_shop_flink(
dp_shop_id bigint,
wm_poi_id string,
shop_name string,
shop_status string,
shop_pic string,
delivery_fee string,
delivery_time string,
delivery_type string,
min_fee double,
online_pay int,
shipping_time string,
bulletin string,
create_time string
)
partitioned by (batch_date string)
row format delimited fields terminated by '\t';
找到hive表保存数据的hdfs文件目录:
sink数据到hdfs文件
//3.sink输出---hdfs作为sink
val outputPath ="hdfs://ELK01:9000/user/hive/warehouse/wm.db/ods_shop_flink";
// BucketAssigner --分桶策略 默认每小时生成一个文件夹
// RollingPolicy -- 分件滚动策略
// --withInactivityInterval --最近30分钟没有收到新的记录 withRolloverInterval --它至少包含 60 分钟的数据
// --withMaxPartSize 文件大小达到多少
val sink = StreamingFileSink.forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
// 采用的是自定义的分桶类,把文件保存到某个表的目录下,按照hive分区目录命名方式生成文件名
.withBucketAssigner(new BatchDateBucketAssigner[String])
.withRollingPolicy(DefaultRollingPolicy.create()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(1))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(1))
.withMaxPartSize(1024 * 1024 )
.build())
.build()
hdfsStream.addSink(sink)
BatchDateBucketAssigner 为自定义的分桶策略,默认是按每小时生成一个文件夹,这里自定义作用主要是按照每天生成一个文件夹,相对于每天一个hive表的分区,以及改写文件命名方式如:batch_date=2020-05-24
package com.tang.crawler.flink;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import org.apache.flink.util.Preconditions;
import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;
@PublicEvolving
public class BatchDateBucketAssigner<IN> implements BucketAssigner<IN, String> {
private static final long serialVersionUID = 1L;
private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd";
private final String formatString;
private final ZoneId zoneId;
private transient DateTimeFormatter dateTimeFormatter;
public BatchDateBucketAssigner() {
this("yyyy-MM-dd");
}
public BatchDateBucketAssigner(String formatString) {
this(formatString, ZoneId.systemDefault());
}
public BatchDateBucketAssigner(ZoneId zoneId) {
this("yyyy-MM-dd", zoneId);
}
public BatchDateBucketAssigner(String formatString, ZoneId zoneId) {
this.formatString = (String) Preconditions.checkNotNull(formatString);
this.zoneId = (ZoneId)Preconditions.checkNotNull(zoneId);
}
@Override
public String getBucketId(IN in, Context context) {
if (this.dateTimeFormatter == null) {
this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
}
return "batch_date="+this.dateTimeFormatter.format(Instant.ofEpochMilli(context.timestamp()));
}
@Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
@Override
public String toString() {
return "BatchDateBucketAssigner{" +
"formatString='" + formatString + '\'' +
", zoneId=" + zoneId +
", dateTimeFormatter=" + dateTimeFormatter +
'}';
}
}
结果如:hdfs上生成文件内容