Druid Concepts
Druid is an open source data store designed for OLAP queries on event data,this page is meant to provide readers with a high level overview of how Druid stores data, and the architecture of a Druid cluster.
druid 是一个对事件数据设计的开源OLAP查询存储引擎,本页旨在提高读者对Druid存储的高级特性以及集群的结构。
The data
to frame our discussion ,let's begin with an example data set(from online advertising).
数据
先框定我们的讨论范围,我们从事例数据(在线广告)开始
timestamp publisher advertiser gender country click price
2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
This data set is composed of three distinct components, if you are acquainted with OLAP terminology,the following concepts should be familiar.
数据由三个不同组件组成,如果你熟悉OLAP术语,这些概念对你来说很熟悉。
- Timestamp column: We treat timestamp separately because all of our queries center around the time axis
- Dimension column:Dimensions are string attributes of an event,and the column most commonly used in filtering the data. We have four dimensions in our example data set: publisher, advertiser,gender,and country.They each represent and axis of the data that we've chosen to slice across.
- Metric column:Metrics are columns used in aggregations and computations,In our example,the metric are clicks and price.Metrics are usually numeric values,and computations include operations such as count,sum and mean.Also know as measures in standard OLAP terminology.
- 时间列:我们区别对待时间,是因为我们所有查询都围绕着时间轴。
- 纬度列:在事件中纬度是字符串属性,纬度列通常用在过滤数据上,实例中我们有四个纬度:发布者,广告主,性别,国家/地区。他们每个代表我们选择分片的数据轴。
- 指标列:指标列用在聚合和计算上,在我们的事例中,指标是点击量和价格,指标使用数字值,计算包括诸如计数,求和和平均值,也是总所周知OLAP标准术语中的度量。
Sharding the Data
Druid shards are called segments and Druid always first shards data by time,In our compacted data set,we can create two segments,one for each hour of data.
分片数据
Druid shards 叫做segment,Druid总是以时间先分解数据,在我们的压缩数据中,我们可以创建两个segment,每小时一个数据。
For example:
Segment sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0
contains
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
Segment sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0
contains
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
Segments are self-contained containers for the time interval of data they hold.Segments contain data stored in compressed column orientations.along with the indexes for those columns.Druid queries only understand how to scan segments.
Segments are uniquely identified by a datasource,interval,version,and a optional partition number.Examining our example segments.the segments are named following this convention:
dataSource_interval_version_partitionNumber