5-Druid数据摄入-2

  • 一、Stream Ingestion

    可以通过Tranquility (a Druid-aware client,提供负载均衡,其他服务,缺点复杂)、或者indexing service 、单个的 Realtime nodes ( limitations.)

    Stream push

    将Tranquility嵌入到Tranquility,Tranquility和Storm and Samza stream processors结合,同时它含有提供直接被JVM-based program(such as Spark Streaming or a Kafka consumer.)使用的API,以Tranquility处理partitioning, replication, service discovery, and schema rollover,使用者是需要设计schema( Tranquility README.)

    主要Tranquility server 和 the indexing service来工作

    Stream pull

    使用 Realtime Node 从 Firehose 中获取数据, Firehose 来直接连接数据源,Druid有针对不同源(Kafka, RabbitMQ等)的内置Firehose

    主要是Realtime Node 来工作,Realtime nodes优于indexing service.

    More information

    push based approach see here.

    pull based approach, see here.

    二、Stream Push

     stream loading tutorial 入门

    Druid通过Tranquility push stream到Druid,Tranquility需要自行下载,github:Tranquility

    必须确定incoming data enough (within a configurable windowPeriodof the current time),对于旧数据不会实时处理,最好用batch ingestion.

    主要依赖Tranquility server 和 the indexing service

    Server

    Druid可用Tranquility Server,由Tranquility Server来向Druid发送数据而无需开发JVM app,可以运行被 Druid middleManagers、 historical processes托管的Tranquility server

    Tranquility server 启动:

     bin/tranquility server -configFile <path_to_config_file>/server.json
    

     自定义 Tranquility Server:

    • In server.json, 自定义 properties 和dataSources.
    • 如果有服务运行Tranquility, 需要ctr-c重启

    自定义server.json, 参考 Loading your own streams tutorial 和 Tranquility Server documentation.

    Kafka

    Tranquility Kafka 用于Kafka数据的导入,无需编码只需要一个 configuration file.

    启动Tranquility server :

     bin/tranquility kafka -configFile <path_to_config_file>/kafka.json
    

    参考single-machine quickstart中配置:

    • In kafka.json, 定义 properties and dataSources.
    • 若有,重启已有的Tranquility.

    更多参考:Tranquility Kafka documentation.

    JVM apps and stream processors

    可以library(github:Tranquility)形式将Tranquility嵌入到 JVM-based applications ,使用 Core API,或者使用Tranquility内嵌的connectors(支持such as StormSamzaSpark Streaming, and Flink

    Concepts

    Task creation

    Tranquility 自动建立Druid realtime indexing tasks,来处理 partitioning, replication, service discovery, and schema rollover,具体Tranquility周期性的产生一些相对短时生命周期的tasks,每个task来处理一小部分 Druid segments,通过ZooKeeper协调这些tasks,更多管理tasks的细节 Tranquility overview

    segmentGranularity and windowPeriod

    segmentGranularity

    segments覆盖的时间粒度,例如segmentGranularity为"hour",tasks产生涵盖每个小时的数据的segments

    windowPeriod 

    是允许events 的slack松弛时间,例如windowPeriod(默认10分钟)表示timestamp在10分钟之前或者10分钟之后的数据将被丢掉不处理。

    这些决定着tasks存活时间,数据在推向historical nodes前在realtime system中逗留的时间,例如segmentGranularity "hour" and windowPeriod ten minutes,tasks一直检测 an hour and ten minutes 时段的事件events

    Append only

    Druid streaming ingestion是append-only,不支持摄入数据的update和delete,若需要,使用 batch reindexing process.  batch ingest .

    Guarantees

    Tranquility 不保证 exactly once. 在一些场景下回 drop or duplicate events:

    • windowPeriod之外的数据会dropped
    • Druid Middle Manager failures次数超过配置时,部分indexed 的数据可能丢失
    • 持续的与Druid indexing service 通信断开,重试次数超预设,或者时间超过windowPeriod吗,一些events会丢失
    • Tranquility得不到indexing service应答时,Tranquility会重试这批数据,导致events重复!!!!!
    • 使用 Storm or Samza内置的Tranquility,由于一些architectures 有at-least-once的设计,可能导致events重复!!!!!

    这些概率极小,若需要100%保证建议 hybrid batch/streaming architecture

    Tranquility documentation and Configuration

    Tranquility documentation  here.Tranquility configuration:here.Tranquility's tuningConfig: here

    三、Stream Pull Ingestion

    使用 Realtime Node 从 Firehose 中获取数据, Firehose 来直接连接数据源,Druid有针对不同源(Kafka, RabbitMQ等)的内置Firehose.quickstart中不包含如何建立standalone realtime nodes,但他们可以用来代替 Tranquility server 和 the indexing service, Realtime nodes优于indexing service.

    Realtime Node Ingestion

    Real-time Node information, see here.

    Real-time Node Configuration, see Realtime Configuration.

    如何写自己的plugins对接real-time node,see Firehose.

    Realtime "specFile"

    druid.realtime.specFile样例:

     [
      {
        "dataSchema" : {
          "dataSource" : "wikipedia",
          "parser" : {
            "type" : "string",
            "parseSpec" : {
              "format" : "json",
              "timestampSpec" : {
                "column" : "timestamp",
                "format" : "auto"
              },
              "dimensionsSpec" : {
                "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
                "dimensionExclusions" : [],
                "spatialDimensions" : []
              }
            }
          },
          "metricsSpec" : [{
            "type" : "count",
            "name" : "count"
          }, {
            "type" : "doubleSum",
            "name" : "added",
            "fieldName" : "added"
          }, {
            "type" : "doubleSum",
            "name" : "deleted",
            "fieldName" : "deleted"
          }, {
            "type" : "doubleSum",
            "name" : "delta",
            "fieldName" : "delta"
          }],
          "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "DAY",
            "queryGranularity" : "NONE"
          }
        },
        "ioConfig" : {
          "type" : "realtime",
          "firehose": {
            "type": "kafka-0.8",
            "consumerProps": {
              "zookeeper.connect": "localhost:2181",
              "zookeeper.connection.timeout.ms" : "15000",
              "zookeeper.session.timeout.ms" : "15000",
              "zookeeper.sync.time.ms" : "5000",
              "group.id": "druid-example",
              "fetch.message.max.bytes" : "1048586",
              "auto.offset.reset": "largest",
              "auto.commit.enable": "false"
            },
            "feed": "wikipedia"
          },
          "plumber": {
            "type": "realtime"
          }
        },
        "tuningConfig": {
          "type" : "realtime",
          "maxRowsInMemory": 75000,
          "intermediatePersistPeriod": "PT10m",
          "windowPeriod": "PT10m",
          "basePersistDirectory": "\/tmp\/realtime\/basePersist",
          "rejectionPolicy": {
            "type": "serverTime"
          }
        }
      }
    ]

    可以配置多个one realtime stream ,一般对于每一个realtime stream 需要2-threads: 1 thread for data consumption and aggregation, 1 thread for incremental persists and other background tasks

    配置文件主要包含3部分: dataSchemaIOConfigtuningConfig .

    DataSchema

    This field is required.

    See Ingestion

    IOConfig

    This field is required.

    Field
    Type
    Description
    Required
    type String always be 'realtime'. yes
    firehose JSON Object 数据源 yes
    plumber JSON Object Where the data is going. yes
    Firehose

    See Firehose for more information on various firehoses.

    Plumber
    Field
    Type
    Description
    Required
    type String always be 'realtime'. no

    TuningConfig

    有默认配置可用,可以自行配置优化

    Field
    Type
    Description
    Required
    type String 'realtime'. no
    maxRowsInMemory Integer

    persisting前聚合后的行数,控制所需 JVM heap size

    indexing所需Maximum heap memory= maxRowsInMemory * (2 + maxPendingPersists).

    no (default == 75000)
    windowPeriod ISO 8601 Period String 默认10分钟,详见上文segmentGranularity and windowPeriod no (default == PT10m)
    intermediatePersistPeriod ISO8601 Period String

    intermediate persists 的周期

    no (default == PT10m)
    basePersistDirectory String

    The directory to put things that need persistence.

    The plumber is responsible for the actual intermediate persists

    no (default == java tmp dir)
    versioningPolicy Object How to version segments. no (default == based on segment start time)
    rejectionPolicy Object Controls how data sets the data acceptance policy for creating and handing off segments. More on this below. no (default == 'serverTime')
    maxPendingPersists Integer

    Maximum number of persists that can be pending, but not started.

    If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes.

    no (default == 0; meaning one persist can be running concurrently with ingestion, and none can be queued up)
    shardSpec Object shard that is represented by this server.  sharded fashion. no (default == 'NoneShardSpec')
    buildV9Directly Boolean Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format. no (default == true)
    persistThreadPriority int If -XX:+UseThreadPriorities is properly enabled, this will set the thread priority of the persisting thread to Thread.NORM_PRIORITY plus this value within the bounds of Thread.MIN_PRIORITY and Thread.MAX_PRIORITY. A value of 0 indicates to not change the thread priority. no (default == 0; inherit and do not override)
    mergeThreadPriority int

    If -XX:+UseThreadPriorities is properly enabled, this will set the thread priority of the merging thread to Thread.NORM_PRIORITY plus this value within the bounds of Thread.MIN_PRIORITY and Thread.MAX_PRIORITY. A value of 0 indicates to not change the thread priority.

    Before enabling thread priority settings, users are highly encouraged to read the original pull request and other documentation about proper use of -XX:+UseThreadPriorities.

    no (default == 0; inherit and do not override)
    reportParseExceptions Boolean If true, exceptions encountered during parsing will be thrown and will halt ingestion. If false, unparseable rows and fields will be skipped. If an entire row is skipped, the "unparseable" counter will be incremented. If some fields in a row were parseable and some were not, the parseable fields will be indexed and the "unparseable" counter will not be incremented. no (default == false)
    handoffConditionTimeout long Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. no (default == 0)
    alertTimeout long Milliseconds timeout after which an alert is created if the task isn't finished by then. This allows users to monitor tasks that are failing to finish and give up the worker slot for any unexpected errors. no (default == 0)
    indexSpec Object Tune how data is indexed. See below for more information. no

     

    Rejection Policy
    • serverTime – The recommended policy for "current time" data, it is optimal for current data that is generated and ingested in real time. Uses windowPeriod to accept only those events that are inside the window looking forward and back.
    • messageTime – Can be used for non-"current time" as long as that data is relatively in sequence. Events are rejected if they are less than windowPeriod from the event with the latest timestamp. Hand off only occurs if an event is seen after the segmentGranularity and windowPeriod (hand off will not periodically occur unless you have a constant stream of data).
    • none – All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.
    IndexSpec
    Field
    Type
    Description
    Required
    bitmap Object Compression format for bitmap indexes. Should be a JSON object; see below for options. no (defaults to Concise)
    dimensionCompression String Compression format for dimension columns. Choose from LZ4LZF, or uncompressed. no (default == LZ4)
    metricCompression String Compression format for metric columns. Choose from LZ4LZF, or uncompressed. no (default == LZ4)
    Bitmap types

    For Concise bitmaps:

    Field
    Type
    Description
    Required
    type String Must be concise. yes

    For Roaring bitmaps:

    Field
    Type
    Description
    Required
    type String Must be roaring. yes
    compressRunOnSerialization Boolean Use a run-length encoding where it is estimated as more space efficient. no (default == true)
    Sharding

    Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of segmentGranularity. If, for example, segmentGranularity is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.

    Segments are identified by datasource, time interval, and version. With sharding, a segment is also identified by a partition number. Typically, each shard will have the same version but a different partition number to uniquely identify it.

    In small-data scenarios, sharding is unnecessary and can be set to none (the default):

         "shardSpec": {"type": "none"}
    

    However, in scenarios with multiple realtime nodes, none is less useful as it cannot help with scaling data volume (see below). Note that for the batch indexing service, no explicit configuration is required; sharding is provided automatically.

    Druid uses sharding based on the shardSpec setting you configure. The recommended choices, linear and numbered, are discussed below; other types have been useful for internal Druid development but are not appropriate for production setups.

    Keep in mind, that sharding configuration has nothing to do with configured firehose. For example, if you set partition number to 0, it doesn't mean that Kafka firehose will consume only from 0 topic partition.

    Linear

    This strategy provides following advantages:

    • There is no need to update the fileSpec configurations of existing nodes when adding new nodes.
    • All unique shards are queried, regardless of whether the partition numbering is sequential or not (it allows querying of partitions 0 and 2, even if partition 1 is missing).

    Configure linear under schema:

         "shardSpec": {
            "type": "linear",
            "partitionNum": 0
        }
    
    Numbered

    This strategy is similar to linear except that it does not tolerate non-sequential partition numbering (it will not allow querying of partitions 0 and 2 if partition 1 is missing). It also requires explicitly setting the total number of partitions.

    Configure numbered under schema:

         "shardSpec": {
            "type": "numbered",
            "partitionNum": 0,
            "partitions": 2
        }
    
    Scale and Redundancy

    The shardSpec configuration can be used to create redundancy by having the same partitionNum values on different nodes.

    For example, if RealTimeNode1 has:

         "shardSpec": {
            "type": "linear",
            "partitionNum": 0
        }
    

    and RealTimeNode2 has:

         "shardSpec": {
            "type": "linear",
            "partitionNum": 0
        }
    

    then two realtime nodes can store segments with the same datasource, version, time interval, and partition number. Brokers that query for data in such segments will assume that they hold the same data, and the query will target only one of the segments.

    shardSpec can also help achieve scale. For this, add nodes with a different partionNum. Continuing with the example, if RealTimeNode3 has:

         "shardSpec": {
            "type": "linear",
            "partitionNum": 1
        }
    

    then it can store segments with the same datasource, time interval, and version as in the first two nodes, but with a different partition number. Brokers that query for data in such segments will assume that a segment from RealTimeNode3 holds different data, and the query will target it along with a segment from the first two nodes.

    You can use type numbered similarly. Note that type none is essentially type linear with all shards having a fixed partitionNum of 0.

    Constraints

    intermediatePersistPeriod ≤ windowPeriod < segmentGranularity and queryGranularity ≤ segmentGranularity

    Name
    Effect
    Minimum
    Recommended
    windowPeriod 窗口时间 time jitter tolerance use this to reject outliers
    segmentGranularity Time granularity (minute, hour, day, week, month) for loading data at query time equal to indexGranularity more than queryGranularity
    queryGranularity Time granularity (minute, hour, day, week, month) for rollup less than segmentGranularity minute, hour, day, week, month
    intermediatePersistPeriod 把数据从 memory 推到 disk avoid excessive flushing 期间可以存储的行数由maxRowsInMemory决定
    maxRowsInMemory

    把数据从 memory 推到 disk前内存中存储的数据的最大行数

    intermediatePersistPeriod周期内的行数 intermediatePersistPeriod 防止期间内堆溢出

     

    Kafka

    Standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.

    Druid 在 N nodes个节点上备份数据,若其中N–1 nodes 宕机,仍可提供查询。但是 standard Kafka consumer groups 时Kafka topic 需要多个consumer(because consumers in different consumer groups will split up the data differently)时失效(Druid 无法在 N nodes个节点上备份数据)

    具体原因:

    For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2. Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2. Querying for your data through the broker will yield correct results.

    The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case, real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2. From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes 2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent results.

    Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues. Otherwise, you can run real-time nodes without replication.

    Locking

    结合batch ingestion 使用 stream pull ingestion,可能导致 data override issues.

    例如:hourly segments 产生今日数据,同时daily batch job处理今日数据。 batch job将产生比realtime ingestion更多的segments version, 如果batch job 正在索引今日未完全完成的数据batch job 产生的daily segment 将override,realtime nodes产生的recent segments 一部分数据将会丢失

    Schema changes

    Standalone realtime nodes需要重启来更新schema.

    Log management

    standalone realtime node 有各自的日志,诊断跨多个servers的多个 partitions,将很有难度


  • 四、Batch Data Ingestion

    可以从静态文件中摄入数据

    Hadoop-based Batch Ingestion

    Hadoop-based batch ingestion 通过Hadoop-ingestion task 调起 overlord完成.:


     {
      "type" : "index_hadoop",
      "spec" : {
        "dataSchema" : {
          "dataSource" : "wikipedia",
          "parser" : {
            "type" : "hadoopyString",
            "parseSpec" : {
              "format" : "json",
              "timestampSpec" : {
                "column" : "timestamp",
                "format" : "auto"
              },
              "dimensionsSpec" : {
                "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
                "dimensionExclusions" : [],
                "spatialDimensions" : []
              }
            }
          },
          "metricsSpec" : [
            {
              "type" : "count",
              "name" : "count"
            },
            {
              "type" : "doubleSum",
              "name" : "added",
              "fieldName" : "added"
            },
            {
              "type" : "doubleSum",
              "name" : "deleted",
              "fieldName" : "deleted"
            },
            {
              "type" : "doubleSum",
              "name" : "delta",
              "fieldName" : "delta"
            }
          ],
          "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "DAY",
            "queryGranularity" : "NONE",
            "intervals" : [ "2013-08-31/2013-09-01" ]
          }
        },
        "ioConfig" : {
          "type" : "hadoop",
          "inputSpec" : {
            "type" : "static",
            "paths" : "/MyDirectory/example/wikipedia_data.json"
          }
        },
        "tuningConfig" : {
          "type": "hadoop"
        }
      },
      "hadoopDependencyCoordinates": <my_hadoop_version>
    }
    
    property description required?
    type "index_hadoop". yes
    spec A Hadoop Index Spec. See Batch Ingestion yes
    hadoopDependencyCoordinates

    Hadoop dependency coordinates(JSON array),将会覆盖 default Hadoop coordinates,

    一旦配置Druid在ocation specified by druid.extensions.hadoopDependenciesDir中查找Hadoop dependencies

    no
    classpathPrefix

    Classpath that will be pre-appended for the peon process.

    druid自动计算hadoop job containers 的classpath,当出现hadoop和druid's dependencies冲突时,手动指定classpath,setting druid.extensions.hadoopContainerDruidClasspath , base druid configuration.

    no


    DataSchema

    This field is required. See Ingestion.

    IOConfig

    This field is required.

    Field Type Description Required
    type String 'hadoop'. yes
    inputSpec Object A specification of where to pull the data in from. See below. yes
    segmentOutputPath String The path to dump segments into. yes
    metadataUpdateSpec Object A specification of how to update the metadata for the druid cluster these segments belong to. yes
    InputSpec specification

    There are multiple types of inputSpecs:

    static

    A type of inputSpec where a static path to the data files is provided.

    Field Type Description Required
    paths Array of String A String of input paths indicating where the raw data is located. yes

    For example, static input paths:

     "paths" : "s3n://billy-bucket/the/data/is/here/data.gz, s3n://billy-bucket/the/data/is/here/moredata.gz, s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
    
    granularity

    A type of inputSpec that expects data to be organized in directories according to datetime using the path format: y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX (where date is represented by lowercase and time is represented by uppercase).

    Field Type Description Required
    dataGranularity String Specifies the granularity to expect the data at, e.g. hour means to expect directories y=XXXX/m=XX/d=XX/H=XX. yes
    inputPath String Base path to append the datetime path to. yes
    filePattern String Pattern that files should match to be included. yes
    pathFormat String Joda datetime format for each directory. Default value is "'y'=yyyy/'m'=MM/'d'=dd/'H'=HH", or see Joda documentation no

    For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths:

     s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
    s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
    ...
    s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
    
    dataSource

    Read Druid segments. See here for more information.

    multi

    Read multiple sources of data. See here for more information.

    TuningConfig

    The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.

    Field Type Description Required
    workingPath String The working path to use for intermediate results (results between Hadoop jobs). no (default == '/tmp/druid-indexing')
    version String The version of created segments. Ignored for HadoopIndexTask unless useExplicitVersion is set to true no (default == datetime that indexing starts at)
    partitionsSpec Object A specification of how to partition each time bucket into segments. Absence of this property means no partitioning will occur. See 'Partitioning specification' below. no (default == 'hashed')
    maxRowsInMemory Integer The number of rows to aggregate before persisting. Note that this is the number of post-aggregation rows which may not be equal to the number of input events due to roll-up. This is used to manage the required JVM heap size. no (default == 75000)
    leaveIntermediate Boolean Leave behind intermediate files (for debugging) in the workingPath when a job completes, whether it passes or fails. no (default == false)
    cleanupOnFailure Boolean Clean up intermediate files when a job fails (unless leaveIntermediate is on). no (default == true)
    overwriteFiles Boolean Override existing files found during indexing. no (default == false)
    ignoreInvalidRows Boolean Ignore rows found to have problems. no (default == false)
    combineText Boolean Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files. no (default == false)
    useCombiner Boolean Use Hadoop combiner to merge rows at mapper if possible. no (default == false)
    jobProperties Object A map of properties to add to the Hadoop job configuration, see below for details. no (default == null)
    indexSpec Object Tune how data is indexed. See below for more information. no
    buildV9Directly Boolean Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format. no (default == true)
    numBackgroundPersistThreads Integer The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and cpu usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1. no (default == 0)
    forceExtendableShardSpecs Boolean Forces use of extendable shardSpecs. Experimental feature intended for use with the Kafka indexing service extension. no (default = false)
    useExplicitVersion Boolean Forces HadoopIndexTask to use version. no (default = false)
    jobProperties field of TuningConfig
        "tuningConfig" : {
         "type": "hadoop",
         "jobProperties": {
           "<hadoop-property-a>": "<value-a>",
           "<hadoop-property-b>": "<value-b>"
         }
       }
    

    Hadoop's MapReduce documentation lists the possible configuration parameters.

    With some Hadoop distributions, it may be necessary to set mapreduce.job.classpath or mapreduce.job.user.classpath.firstto avoid class loading issues. See the working with different Hadoop versions documentation for more details.

    IndexSpec
    Field Type Description Required
    bitmap Object Compression format for bitmap indexes. Should be a JSON object; see below for options. no (defaults to Concise)
    dimensionCompression String Compression format for dimension columns. Choose from LZ4LZF, or uncompressed. no (default == LZ4)
    metricCompression String Compression format for metric columns. Choose from LZ4LZFuncompressed, or none. no (default == LZ4)
    longEncoding String Encoding format for metric and dimension columns with type long. Choose from auto or longsauto encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. longs stores the value as is with 8 bytes each. no (default == longs)
    Bitmap types

    For Concise bitmaps:

    Field Type Description Required
    type String Must be concise. yes

    For Roaring bitmaps:

    Field Type Description Required
    type String Must be roaring. yes
    compressRunOnSerialization Boolean Use a run-length encoding where it is estimated as more space efficient. no (default == true)

    Partitioning specification

    Segments一般基于timestamp(granularitySpec)分区,也可以有其他的 partition type.如:

     "hashed" (based on the hash of all dimensions in each row),

    "dimension" (based on ranges of a single dimension).

    Hashed partitioning 建议使用优于dimension, 有利于 indexing 性能优化,建立相对于single-dimension partitioning更格式大小统一的segments.

    Hash-based partitioning
       "partitionsSpec": {
         "type": "hashed",
         "targetPartitionSize": 5000000
       }
    

    Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments according to the hash of all dimensions in each row. The number of segments is determined automatically based on the cardinality of the input set and a target partition size.

    The configuration options are:

    Field Description Required
    type Type of partitionSpec to be used. "hashed"
    targetPartitionSize Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB. either this or numShards
    numShards Specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically. either this or targetPartitionSize
    partitionDimensions The dimensions to partition on. Leave blank to select all dimensions. Only used with numShards, will be ignored when targetPartitionSize is set no
    Single-dimension partitioning
       "partitionsSpec": {
         "type": "dimension",
         "targetPartitionSize": 5000000
       }
    

    Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example, your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and "f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can override it with a specific dimension.

    The configuration options are:

    Field Description Required
    type Type of partitionSpec to be used. "dimension"
    targetPartitionSize Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB. yes
    maxPartitionSize Maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize. no
    partitionDimension The dimension to partition on. Leave blank to select a dimension automatically. no
    assumeGrouped Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated. no

    Remote Hadoop Cluster

    远程Hadoop cluster, Druid _commonconfiguration folder中包含configuration *.xml files(hadoop的)若有依赖版本冲突:these docs.

    Having Problems?

    对于首次使用出现的问题可以再此找到帮助  google groups page.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值