Druid at Pulsar

最新推荐文章于 2021-04-23 06:58:11 发布

ebay

最新推荐文章于 2021-04-23 06:58:11 发布

阅读量9k

点赞数

分类专栏：平台文章标签： Pulsar Druid

本文链接：https://blog.csdn.net/ebay/article/details/50205611

版权

本文由Xiaoming Zhang撰写，介绍了Pulsar项目中的Druid如何作为其报告部分的数据存储和查询组件。Druid以其亚秒级查询、实时摄入、可扩展性和高可用性等特点在Pulsar中发挥关键作用。文章详细阐述了Druid的各个节点，如实时节点、历史节点、协调节点、经纪节点和索引服务的工作原理，并分享了在实际使用中遇到的挑战和经验教训，包括Druid的故障和监控问题。

摘要由CSDN通过智能技术生成

作者：Xiaoming Zhang

A glance of Pulsar and druid

Pulsar is anopen source project of eBay and it includes two parts, pulsar pipeline andpulsar reporting. Pulsar pipeline is a streaming framework which willdistribute more than 8 billion events every day and pulsar reporting is in responseof storing, querying and visualizing these data. Druid is part of pulsarreporting.

This paper willhave an introduction and a little deep dive of druid and show you the role itis playing at pulsar reporting.

Druid components introduction

Druid is an open source project which is ananalytics data store designed for business intelligence (Online analyticalprocessing) queries on event data.

Druid Skills (From official website):

1. Sub-Second Queries.

Support multidimensional filtering, aggression and is ableto target the very data to do query.

2. Real time Ingestion

Support streaming data ingestion and offers insightson events immediately after they occur

3. Scalable

Able to deal with trillions of events for total,millions events for each second

4. Highly Available

SaaS (Software as a service), need to be up all the timeand Scale up and down will not lose data

5. Designed for Analytics

Supports a lot of filters, aggregators and query types, is ableto plugging in new functionality.

Supports approximate algorithms for cardinality estimation,and histogram and quantile calculations.

Glance at Druid Structure of Pulsarreporting:

Receiveabout 10 Billion events per day and the peak traffic is about 200k/s.

Eachmachine at our cluster is with 128GB memory and for each historical nodes, diskis more than 6 TB.

Druid ata glance:

Briefintroduction to all nodes:

Real-time

Real-timenode index the coming data and these indexed data are able to queryimmediately. Real-time nodes will build up data to segments and after a periodof time the segment will handover to historical node.

Anexample of real-time segment: 2015-11-18T06:00:00.000Z_2015-11-18T07:00:00.000Z,which will be stored at the folder of the scheme you defined. All segments arestored like the above format.

Here isthe segment information at My SQL:

Id |dataSource | created_date | start | end | partitioned | version | used |payload pulsar_event_2014-09-15T05:00:00.000-07:00_2014-09-15T06:00:00.000-07:00_2014-09-15T05:00:00.000-07:00_1| pulsar_event | 2014-09-15T09:37:30.231-07:00 | 2014-09-15T05:00:00.000-07:00| 2014-09-15T06:00:00.000-07:00 | 1 | 2014-09-15T05:00:00.000-07:00 | 0 | {"dataSource":"pulsar_event","interval":"2014-09-15T05:00:00.000-07:00/2014-09-15T06:00:00.000-07:00","version":"2014-09-15T05:00:00.000-07:00","loadSpec":{"type":"hdfs","path":"hdfs://xxxx/20140915T050000.000-0700_20140915T060000.000-0700/2014-09-15T05_00_00.000-07_00/1/index.zip"},"dimensions":"browserfamily,browserversion,city,continent,country,deviceclass,devicefamily,eventtype,guid,js_ev_type,linespeed,osfamily,osversion,page,region,sessionid,site,tenant,timestamp,uid","metrics":"count","shardSpec":{"type":