4-Druid数据摄入-1

一、数据格式

【1】Data Formats
http://druid.io/docs/0.10.1/ingestion/data-formats.html
(1)摄入规范化数据:JSON、CSV、TSV 
(2)自定义格式
Regex parser or the JavaScript parsers 来解析数据
(3)其他格式
http://druid.io/docs/0.10.1/development/extensions.html
【2】Configuration
对数据格式进行配置dataSchema中的parseSpec字段
具体见:http://druid.io/docs/0.10.1/ingestion/data-formats.html

二、数据schema

主要是摄入的规则ingestion Spec
摄入规则主要包含3个部分


{
  "dataSchema" : {...},
  "ioConfig" : {...},
  "tuningConfig" : {...}
}
Field Type Description Required
dataSchema JSON Object 标识摄入数据的schema,不同specs可共享 yes
ioConfig JSON Object 标识data从哪来,到哪去。根据不同的ingestion method不同 yes
tuningConfig JSON Object 标识如何调优不同的ingestion parameters。根据不同的ingestion method不同 no
DataSchema
 
"dataSchema" : {
  "dataSource" : "wikipedia",
  "parser" : {
    "type" : "string",
    "parseSpec" : {
      "format" : "json",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": [
          "page",
          "language",
          "user",
          "unpatrolled",
          "newPage",
          "robot",
          "anonymous",
          "namespace",
          "continent",
          "country",
          "region",
          "city",
          {
            "type": "long",
            "name": "countryNum"
          },
          {
            "type": "float",
            "name": "userLatitude"
          },
          {
            "type": "float",
            "name": "userLongitude"
          }
        ],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [{
    "type" : "count",
    "name" : "count"
  }, {
    "type" : "doubleSum",
    "name" : "added",
    "fieldName" : "added"
  }, {
    "type" : "doubleSum",
    "name" : "deleted",
    "fieldName" : "deleted"
  }, {
    "type" : "doubleSum",
    "name" : "delta",
    "fieldName" : "delta"
  }],
  "granularitySpec" : {
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
  }
}
Field Type Description Required
dataSource String 要摄入的datasource名称,Datasources可看做为表 yes
parser JSON Object ingested data如何解析 yes
metricsSpec JSON Object array  aggregators器列表 yes
granularitySpec JSON Object 如何建立.segments,如何上卷数据 yes
Parser
"parser" : {
    "type" : "string",
    "parseSpec" : {
      "format" : "json",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": [
          "page",
          "language",
          "user",
          "unpatrolled",
          "newPage",
          "robot",
          "anonymous",
          "namespace",
          "continent",
          "country",
          "region",
          "city",
          {
            "type": "long",
            "name": "countryNum"
          },
          {
            "type": "float",
            "name": "userLatitude"
          },
          {
            "type": "float",
            "name": "userLongitude"
          }
        ],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  }
type 默认为string,其他数据格式见:extensions list.
String Parser
Field Type Description Required
type String

一般为string,或在Hadoop indexing job中使用hadoopyString


no
parseSpec JSON Object

标识格式format和、imestamp、dimensions


yes
ParseSpec

两个功能:

  • String Parser用parseSpec判定将要处理rows的数据格式( JSON, CSV, TSV)
  • 所有的Parsers 用parseSpec判定将要处理rows的 timestamp  和 dimensionsAll

format字段默认为tsv格式

JSON ParseSpec
Field Type Description Required
format String  json. no
timestampSpec JSON Object

timestamp的列和format

yes
dimensionsSpec JSON Object

数据的dimensions

yes
flattenSpec JSON Object

标识嵌套JSON如何打平的配置,详见 Flattening JSON 

no
JSON Lowercase ParseSpec

将输入的JSON数据小写处理

Field Type Description Required
format String TjsonLowercase. yes
timestampSpec JSON Object timestamp的列和format yes
dimensionsSpec JSON Object 数据的dimensions yes
CSV ParseSpec

使用String Parser 加载CSV,Strings用net.sf.opencsv library. parsed

Field Type Description Required
format String  csv. yes
timestampSpec JSON Object timestamp的列和format yes
dimensionsSpec JSON Object 数据的dimensions yes
listDelimiter String

多值dimensions的分割符

no (default == ctrl+A)
columns JSON array

数据列

yes
TimestampSpec
Field Type Description Required
column String timestamp的列 yes
format String iso, millis, posix, auto or any Joda time format. no (default == 'auto'
DimensionsSpec
Field Type Description Required
dimensions JSON array

dimension schema 对象或dimension names,标识维度列,否则将timestamp列外的所以string列作为维度列


yes
dimensionExclusions JSON String array

ingestion之外的dimensions

no (default == []
spatialDimensions JSON Object array spatial dimensions no (default == []
Dimension Schema

dimension schema标识要摄入dimension的type和name,不特殊标识type时为string

 "dimensionsSpec" : {
  "dimensions": [
    "page",
    "language",
    "user",
    "unpatrolled",
    "newPage",
    "robot",
    "anonymous",
    "namespace",
    "continent",
    "country",
    "region",
    "city",
    {
      "type": "long",
      "name": "countryNum"
    },
    {
      "type": "float",
      "name": "userLatitude"
    },
    {
      "type": "float",
      "name": "userLongitude"
    }
  ],
  "dimensionExclusions" : [],
  "spatialDimensions" : []
}
 
 GranularitySpec 
 "granularitySpec" : {
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
  }

granularity spec 默认是uniform,可以通过type字段配置,目前支持uniform和arbitrarytypes

 
 Uniform Granularity Spec 

标识uniform intervals.

 
Field Type Description Required
segmentGranularity string

建立segments的周期

no (default == 'DAY')
queryGranularity string

可query结果的最小granularity,数据已这个granularity在segment中granularity

例如:"minute" 说明 data已分钟级别的granularity聚合,也就是当 (minute(timestamp), dimensions)

tuple中有collisions时,将用aggregators聚合值,而不是对各个rows排序

no (default == 'NONE')
rollup boolean rollup or not no (default == true)
intervals string

raw data摄入的intervals列表,对于real-time摄取忽略


yes for batch, no for real-time
 
 Arbitrary Granularity Spec 

按照segments的大小决定intervals,不支持real-time 

Field Type Description Required
queryGranularity string 同上 no (default == 'NONE')
rollup boolean rollup or not no (default == true)
intervals string 同上 yes for batch, no for real-time

三、Schema Design

Druid将规范化后的数据分为3类:a timestamp, a dimension, or a measure (or a metric/aggregator as they are known in Druid).

更多信息:

  • Timestamp每行必须,数据以时间分区,每个query有一个时间filter ,Query results 可以用时间分桶( minutes, hours, days, and so on)
  • Dimensions可以filtered或者grouped by,一般是单Strings,Strings数组,单Longs,单Floats
  • Metrics可以aggregated,可排序

一般生产tables(datasources)少于100个维度列,100个metrics

Numeric dimensions

 数据类型的维度 (Long or Float) 必须在dimensionsSpec中标识,否则默认是字符串,数值型列在group时快,但由于没有索引在过滤时慢,Dimension Schema.

High cardinality dimensions (e.g. unique IDs)

实际中count-distinct不需要,对IDs列排序将杀掉 roll-up,影响压缩,再aggregations带着排序的IDS,增加性能减少存储,Druid's hyperUnique aggregator 基于Hyperloglog, here.

Nested dimensions

不支持嵌套维度,下面

 {"foo":{"bar": 3}}

在索引前转化为:

 {"foo_bar": 3}
Counting the number of ingested events

count aggregator 在数据摄入阶段计算摄入的数据量,在查询时用longSum aggregator.,根据这个计算结果决定roll-up 的速率

 ingestion spec:

 ...
"metricsSpec" : [
      {
        "type" : "count",
        "name" : "count"
      },
...:

按照如下查询摄入的量

 ...
"aggregations": [
    { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
...
Schema-less dimensions

dimensions在spec缺失时,所有非timestamp 的列作为string型作为维度


Including the same column as a dimension and a metric

一个列作为维度,同时由于去重计算需要,也作为hyperUnique,作为metric,这需要在ETL组织时就增加出来,


ETL中复制一列=:

 {"device_id_dim":123, "device_id_met":123}
在metricsSpec:
 { "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }

device_id_dim 自动作为维度


四、Schema Changes

datasources可以在任何时间改变,支持segments中存在不同的schemas

Replacing Segments

segments标识:datasource, interval, version, and partition number.partition number只在同一个granularity产生多个segments时可见,如hourly segments,在一个小时中的数据量超出一个segment存储范围,同一小时产生多个segments,以partition number区分

foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-01/2015-01-02_v1_1
foo_2015-01-01/2015-01-02_v1_2
dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum = 0. 如果此时用新的schema索引数据,新产生的segment有更高的version id。
foo_2015-01-01/2015-01-02_v2_0
foo_2015-01-01/2015-01-02_v2_1
foo_2015-01-01/2015-01-02_v2_2
Druid是批量构建索引的(either Hadoop-based or IndexTask-based),保证interval-by-interval间的原子性更新,例如直到2015-01-01/2015-01-02 间隔内的v2 segments加载到集群中后吗,queries才不再使用 v1 segments,此时v1从集群中卸载。

updates是夸过个segment的,指示在每个interval内是原子性的,不是整个更新的如下:

foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v1_1
foo_2015-01-03/2015-01-04_v1_2
 v2 segments 完全更新前,混存:
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v2_1
foo_2015-01-03/2015-01-04_v1_2

此时的查询可以命中V1和V2的混合

In this case, queries may hit a mixture of v1 and v2 segments.

Different Schemas Among Segments

datasource的segments可以有不同的schemas,如果一个stringcolumn (dimension) 在一个segment A中存在,另一个B不存在,认为B中该维度为null。对于numeric column,Aggregations跳过这条

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值