druid-概念翻译

最新推荐文章于 2020-04-14 15:56:37 发布

一个理性的疯子

最新推荐文章于 2020-04-14 15:56:37 发布

阅读量242

点赞数

文章标签： Druid Druid概念

Druid Concepts

Druid is an open source data store designed for OLAP queries on event data,this page is meant to provide readers with a high level overview of how Druid stores data, and the architecture of a Druid cluster.

druid 是一个对事件数据设计的开源OLAP查询存储引擎，本页旨在提高读者对Druid存储的高级特性以及集群的结构。

The data

to frame our discussion ,let's begin with an example data set(from online advertising).

数据

先框定我们的讨论范围，我们从事例数据（在线广告）开始

timestamp             publisher          advertiser  gender  country  click  price
2011-01-01T01:01:35Z  bieberfever.com    google.com  Male    USA      0      0.65
2011-01-01T01:03:63Z  bieberfever.com    google.com  Male    USA      0      0.62
2011-01-01T01:04:51Z  bieberfever.com    google.com  Male    USA      1      0.45
2011-01-01T01:00:00Z  ultratrimfast.com  google.com  Female  UK       0      0.87
2011-01-01T02:00:00Z  ultratrimfast.com  google.com  Female  UK       0      0.99
2011-01-01T02:00:00Z  ultratrimfast.com  google.com  Female  UK       1      1.53

This data set is composed of three distinct components, if you are acquainted with OLAP terminology,the following concepts should be familiar.

数据由三个不同组件组成，如果你熟悉OLAP术语，这些概念对你来说很熟悉。

Timestamp column: We treat timestamp separately because all of our queries center around the time axis
Dimension column:Dimensions are string attributes of an event,and the column most commonly used in filtering the data. We have four dimensions in our example data set: publisher, advertiser,gender,and country.They each represent and axis of the data that we've chosen to slice across.
Metric column:Metrics are columns used in aggregations and computations,In our example,the metric are clicks and price.Metrics are usually numeric values,and computations include operations such as count,sum and mean.Also know as measures in standard OLAP terminology.
时间列：我们区别对待时间，是因为我们所有查询都围绕着时间轴。
纬度列：在事件中纬度是字符串属性，纬度列通常用在过滤数据上，实例中我们有四个纬度：发布者，广告主，性别，国家/地区。他们每个代表我们选择分片的数据轴。
指标列：指标列用在聚合和计算上，在我们的事例中，指标是点击量和价格，指标使用数字值，计算包括诸如计数，求和和平均值，也是总所周知OLAP标准术语中的度量。

Sharding the Data

Druid shards are called segments and Druid always first shards data by time,In our compacted data set,we can create two segments,one for each hour of data.

分片数据

Druid shards 叫做segment，Druid总是以时间先分解数据，在我们的压缩数据中，我们可以创建两个segment，每小时一个数据。

For example:

Segment sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0 contains

 2011-01-01T01:00:00Z  ultratrimfast.com  google.com  Male   USA     1800        25     15.70
 2011-01-01T01:00:00Z  bieberfever.com    google.com  Male   USA     2912        42     29.18

Segment sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0 contains

 2011-01-01T02:00:00Z  ultratrimfast.com  google.com  Male   UK      1953        17     17.31
 2011-01-01T02:00:00Z  bieberfever.com    google.com  Male   UK      3194        170    34.01

Segments are self-contained containers for the time interval of data they hold.Segments contain data stored in compressed column orientations.along with the indexes for those columns.Druid queries only understand how to scan segments.

Segments are uniquely identified by a datasource,interval,version,and a optional partition number.Examining our example segments.the segments are named following this convention:

dataSource_interval_version_partitionNumber