kylin如何支持flink_Flink+Druid构建实时OLAP的探索

场景

k12在线教育公司的业务场景中,有一些业务场景需要实时统计和分析,如分析在线上课老师数量、学生数量,实时销售额,课堂崩溃率等,需要实时反应上课的质量问题,以便于对整个公司的业务情况有大致的了解。

方案对比

对比了很多解决方案,如下几种,列出来供参考。

方案实时入库SQL支持度

Spark+CarbonData

支持

Spark SQL语法丰富

Kylin

不支持

支持join

Flink+Druid

支持

0.15以前不支持SQL,不支持join

上一篇文章所示,使用Spark+CarbonData也是一种解决方案,但是他的缺点也是比较明显,如不能和Flink进行结合,因为我们整个的大数据规划的大致方向是,Spark用来作为离线计算,Flink作为实时计算,并且这两个大方向短时间内不会改变;

Kylin一直是老牌OLAP引擎,但是有个缺点无法满足我们的需求,就是在技术选型的那个时间点kylin还不支持实时入库(后续2.0版本支持实时入库),所以就选择了放弃;

使用Flink+Druid方式实现,这个时间选择这个方案,简直是顺应潮流呀,Flink现在如日中天,各大厂都在使用,Druid是OLAP的新贵,关于它的文章也有很多,我也不赘述太多。有兴趣的可以看下这篇文章,我的博客其它文章也有最新版本的安装教程,实操方案哦。

设计方案

实时处理采用Flink SQL,实时入库Druid方式采用

场景举例

实时计算课堂连接掉线率。此事件包含两个埋点上报,进入教室和掉线分别上报数据。druid设计的字段

flink的处理

将上报的数据进行解析,上报使用的是json格式,需要解析出所需要的字段然后发送到kafka。字段包含如下

sysTime,DateTime格式

pt,格式yyyy-MM-dd

eventId,事件类型(enterRoom|disconnect)

lessonId,课程ID

Druid处理

启动Druid Supervisor,消费Kafka里的数据,使用预聚合,配置如下

{"type": "kafka","dataSchema": {"dataSource": "sac_core_analyze_v1","parser": {"parseSpec": {"dimensionsSpec": {"spatialDimensions": [],"dimensions": ["eventId","pt"]

},"format": "json","timestampSpec": {"column": "sysTime","format": "auto"}

},"type": "string"},"metricsSpec": [

{"filter": {"type": "selector","dimension": "msg_type","value": "disconnect"},"aggregator": {"name": "lesson_offline_molecule_id","type": "cardinality","fields": ["lesson_id"]

},"type": "filtered"}, {"filter": {"type": "selector","dimension": "msg_type","value": "enterRoom"},"aggregator": {"name": "lesson_offline_denominator_id","type": "cardinality","fields": ["lesson_id"]

},"type": "filtered"}

],"granularitySpec": {"type": "uniform","segmentGranularity": "DAY","queryGranularity": {"type": "none"},"rollup": true,"intervals": null},"transformSpec": {"filter": null,"transforms": []

}

},"tuningConfig": {"type": "kafka","maxRowsInMemory": 1000000,"maxBytesInMemory": 0,"maxRowsPerSegment": 5000000,"maxTotalRows": null,"intermediatePersistPeriod": "PT10M","basePersistDirectory": "/tmp/1564535441619-2","maxPendingPersists": 0,"indexSpec": {"bitmap": {"type": "concise"},"dimensionCompression": "lz4","metricCompression": "lz4","longEncoding": "longs"},"buildV9Directly": true,"reportParseExceptions": false,"handoffConditionTimeout": 0,"resetOffsetAutomatically": false,"segmentWriteOutMediumFactory": null,"workerThreads": null,"chatThreads": null,"chatRetries": 8,"httpTimeout": "PT10S","shutdownTimeout": "PT80S","offsetFetchPeriod": "PT30S","intermediateHandoffPeriod": "P2147483647D","logParseExceptions": false,"maxParseExceptions": 2147483647,"maxSavedParseExceptions": 0,"skipSequenceNumberAvailabilityCheck": false},"ioConfig": {"topic": "sac_druid_analyze_v2","replicas": 2,"taskCount": 1,"taskDuration": "PT600S","consumerProperties": {"bootstrap.servers": "bd-prod-kafka01:9092,bd-prod-kafka02:9092,bd-prod-kafka03:9092"},"pollTimeout": 100,"startDelay": "PT5S","period": "PT30S","useEarliestOffset": false,"completionTimeout": "PT1200S","lateMessageRejectionPeriod": null,"earlyMessageRejectionPeriod": null,"stream": "sac_druid_analyze_v2","useEarliestSequenceNumber": false},"context": null,"suspended": false}

View Code

最重要的配置是metricsSpec,他主要定义了预聚合的字段和条件。

数据查询

数据格式如下

pteventIdlesson_offline_molecule_idlesson_offline_denominator_id

2019-08-09

enterRoom

"AQAAAAAAAA=="

"AQAAAAAAAA=="

2019-08-09

disconnect

"AQAAAAAAAA=="

"AQAAAAAAAA=="

结果可以按照这样的SQL出

SELECT pt,CAST(APPROX_COUNT_DISTINCT(lesson_offline_molecule_id) AS DOUBLE)/CAST(APPROX_COUNT_DISTINCT(lesson_offline_denominator_id) AS DOUBLE) from sac_core_analyze_v1group by pt

可以使用Druid的接口查询结果,肥肠的方便~

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值