5-Druid数据摄入-2

最新推荐文章于 2024-07-30 16:24:07 发布

hjw199089

最新推荐文章于 2024-07-30 16:24:07 发布

阅读量3.5k

点赞数 1

分类专栏： [21]Druid 文章标签： Druid

[21]Druid 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、Stream Ingestion

可以通过Tranquility (a Druid-aware client，提供负载均衡，其他服务，缺点复杂)、或者indexing service 、单个的 Realtime nodes （ limitations.）

Stream push

将Tranquility嵌入到Tranquility，Tranquility和Storm and Samza stream processors结合，同时它含有提供直接被JVM-based program（such as Spark Streaming or a Kafka consumer.）使用的API，以Tranquility处理partitioning, replication, service discovery, and schema rollover，使用者是需要设计schema（ Tranquility README.）

主要Tranquility server 和 the indexing service来工作

Stream pull

使用 Realtime Node 从 Firehose 中获取数据， Firehose 来直接连接数据源，Druid有针对不同源（Kafka, RabbitMQ等）的内置Firehose

主要是Realtime Node 来工作，Realtime nodes优于indexing service.

More information

push based approach see here.

pull based approach, see here.

二、Stream Push

stream loading tutorial 入门

Druid通过Tranquility push stream到Druid，Tranquility需要自行下载，github：Tranquility

必须确定incoming data enough (within a configurable windowPeriodof the current time)，对于旧数据不会实时处理，最好用batch ingestion.

主要依赖Tranquility server 和 the indexing service

Server

Druid可用Tranquility Server，由Tranquility Server来向Druid发送数据而无需开发JVM app，可以运行被 Druid middleManagers、 historical processes托管的Tranquility server

Tranquility server 启动:

 bin/tranquility server -configFile <path_to_config_file>/server.json

自定义 Tranquility Server:

In server.json, 自定义 properties 和dataSources.
如果有服务运行Tranquility, 需要ctr-c重启

自定义server.json, 参考 Loading your own streams tutorial 和 Tranquility Server documentation.

Kafka

Tranquility Kafka 用于Kafka数据的导入，无需编码只需要一个 configuration file.

启动Tranquility server :

 bin/tranquility kafka -configFile <path_to_config_file>/kafka.json

参考single-machine quickstart中配置:

In kafka.json, 定义 properties and dataSources.
若有，重启已有的Tranquility.

更多参考：Tranquility Kafka documentation.

JVM apps and stream processors

可以library（github：Tranquility）形式将Tranquility嵌入到 JVM-based applications ，使用 Core API，或者使用Tranquility内嵌的connectors（支持such as Storm, Samza, Spark Streaming, and Flink）

Concepts

Task creation

Tranquility 自动建立Druid realtime indexing tasks，来处理 partitioning, replication, service discovery, and schema rollover，具体Tranquility周期性的产生一些相对短时生命周期的tasks，每个task来处理一小部分 Druid segments，通过ZooKeeper协调这些tasks，更多管理tasks的细节 Tranquility overview

segmentGranularity and windowPeriod

segmentGranularity

segments覆盖的时间粒度，例如segmentGranularity为"hour"，tasks产生涵盖每个小时的数据的segments

windowPeriod

是允许events 的slack松弛时间，例如windowPeriod（默认10分钟）表示timestamp在10分钟之前或者10分钟之后的数据将被丢掉不处理。

这些决定着tasks存活时间，数据在推向historical nodes前在realtime system中逗留的时间，例如segmentGranularity "hour" and windowPeriod ten minutes，tasks一直检测 an hour and ten minutes 时段的事件events

Append only

Druid streaming ingestion是append-only,不支持摄入数据的update和delete，若需要，使用 batch reindexing process. batch ingest .

Guarantees

Tranquility 不保证 exactly once. 在一些场景下回 drop or duplicate events:

windowPeriod之外的数据会dropped
Druid Middle Manager failures次数超过配置时，部分indexed 的数据可能丢失
持续的与Druid indexing service 通信断开，重试次数超预设，或者时间超过windowPeriod吗，一些events会丢失
Tranquility得不到indexing service应答时，Tranquility会重试这批数据，导致events重复！！！！！
使用 Storm or Samza内置的Tranquility，由于一些architectures 有at-least-once的设计，可能导致events重复！！！！！

这些概率极小，若需要100%保证建议 hybrid batch/streaming architecture

Tranquility documentation and Configuration

Tranquility documentation here.Tranquility configuration：here.Tranquility's tuningConfig： here

三、Stream Pull Ingestion

使用 Realtime Node 从 Firehose 中获取数据， Firehose 来直接连接数据源，Druid有针对不同源（Kafka, RabbitMQ等）的内置Firehose.quickstart中不包含如何建立standalone realtime nodes，但他们可以用来代替 Tranquility server 和 the indexing service， Realtime nodes优于indexing service.

Realtime Node Ingestion

Real-time Node information, see here.

Real-time Node Configuration, see Realtime Configuration.

如何写自己的plugins对接real-time node,see Firehose.

Realtime "specFile"

druid.realtime.specFile样例:

 [
  {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [{
        "type" : "count",
        "name" : "count"
      }, {
        "type" : "doubleSum",
        "name" : "added",
        "fieldName" : "added"
      }, {
        "type" : "doubleSum",
        "name" : "deleted",
        "fieldName" : "deleted"
      }, {
        "type" : "doubleSum",
        "name" : "delta",
        "fieldName" : "delta"
      }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE"
      }
    },
    "ioConfig" : {
      "type" : "realtime",
      "firehose": {
        "type": "kafka-0.8",
        "consumerProps": {
          "zookeeper.connect": "localhost:2181",
          "zookeeper.connection.timeout.ms" : "15000",
          "zookeeper.session.timeout.ms" : "15000",
          "zookeeper.sync.time.ms" : "5000",
          "group.id": "druid-example",
          "fetch.message.max.bytes" : "1048586",
          "auto.offset.reset": "largest",
          "auto.commit.enable": "false"
        },
        "feed": "wikipedia"
      },
      "plumber": {
        "type": "realtime"
      }
    },
    "tuningConfig": {
      "type" : "realtime",
      "maxRowsInMemory": 75000,
      "intermediatePersistPeriod": "PT10m",
      "windowPeriod": "PT10m",
      "basePersistDirectory": "\/tmp\/realtime\/basePersist",
      "rejectionPolicy": {
        "type": "serverTime"
      }
    }
  }
]

可以配置多个one realtime stream ，一般对于每一个realtime stream 需要2-threads: 1 thread for data consumption and aggregation, 1 thread for incremental persists and other background tasks

配置文件主要包含3部分： dataSchema, IOConfig, tuningConfig .

DataSchema

This field is required.

See Ingestion

IOConfig

This field is required.

Field	Type	Description	Required
type	String	always be 'realtime'.	yes
firehose	JSON Object	数据源	yes
plumber	JSON Object	Where the data is going.	yes

Firehose

See Firehose for more information on various firehoses.

Plumber

Field	Type	Description	Required
type	String	always be 'realtime'.	no

TuningConfig

有默认配置可用，可以自行配置优化

Field	Type	Description	Required
type	String	'realtime'.	no
maxRowsInMemory	Integer	persisting前聚合后的行数，控制所需 JVM heap size indexing所需Maximum heap memory= maxRowsInMemory * (2 + maxPendingPersists).	no (default == 75000)
windowPeriod	ISO 8601 Period String	默认10分钟，详见上文segmentGranularity and windowPeriod	no (default == PT10m)
intermediatePersistPeriod	ISO8601 Period String	intermediate persists 的周期	no (default == PT10m)
basePersistDirectory	String	The directory to put things that need persistence. The plumber is responsible for the actual intermediate persists	no (default == java tmp dir)
versioningPolicy	Object	How to version segments.	no (default == based on segment start time)
rejectionPolicy	Object	Controls how data sets the data acceptance policy for creating and handing off segments. More on this below.	no (default == 'serverTime')
maxPendingPersists	Integer	Maximum number of persists that can be pending, but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes.	no (default == 0; meaning one persist can be running concurrently with ingestion, and none can be queued up)
shardSpec	Object	shard that is represented by this server. sharded fashion.	no (default == 'NoneShardSpec')
buildV9Directly	Boolean	Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format.	no (default == true)
persistThreadPriority	int	If `-XX:+UseThreadPriorities` is properly enabled, this will set the thread priority of the persisting thread to `Thread.NORM_PRIORITY` plus this value within the bounds of `Thread.MIN_PRIORITY` and `Thread.MAX_PRIORITY`. A value of 0 indicates to not change the thread priority.	no (default == 0; inherit and do not override)
mergeThreadPriority	int	If `-XX:+UseThreadPriorities` is properly enabled, this will set the thread priority of the merging thread to `Thread.NORM_PRIORITY` plus this value within the bounds of `Thread.MIN_PRIORITY` and `Thread.MAX_PRIORITY`. A value of 0 indicates to not change the thread priority. Before enabling thread priority settings, users are highly encouraged to read the original pull request and other documentation about proper use of `-XX:+UseThreadPriorities`.	no (default == 0; inherit and do not override)
reportParseExceptions	Boolean	If true, exceptions encountered during parsing will be thrown and will halt ingestion. If false, unparseable rows and fields will be skipped. If an entire row is skipped, the "unparseable" counter will be incremented. If some fields in a row were parseable and some were not, the parseable fields will be indexed and the "unparseable" counter will not be incremented.	no (default == false)
handoffConditionTimeout	long	Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever.	no (default == 0)
alertTimeout	long	Milliseconds timeout after which an alert is created if the task isn't finished by then. This allows users to monitor tasks that are failing to finish and give up the worker slot for any unexpected errors.	no (default == 0)
indexSpec	Object	Tune how data is indexed. See below for more information.	no

Rejection Policy

serverTime – The recommended policy for "current time" data, it is optimal for current data that is generated and ingested in real time. Uses windowPeriod to accept only those events that are inside the window looking forward and back.
messageTime – Can be used for non-"current time" as long as that data is relatively in sequence. Events are rejected if they are less than windowPeriod from the event with the latest timestamp. Hand off only occurs if an event is seen after the segmentGranularity and windowPeriod (hand off will not periodically occur unless you have a constant stream of data).
none – All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.

IndexSpec

Field	Type	Description	Required
bitmap	Object	Compression format for bitmap indexes. Should be a JSON object; see below for options.	no (defaults to Concise)
dimensionCompression	String	Compression format for dimension columns. Choose from `LZ4`, `LZF`, or `uncompressed`.	no (default == `LZ4`)
metricCompression	String	Compression format for metric columns. Choose from `LZ4`, `LZF`, or `uncompressed`.	no (default == `LZ4`)

Bitmap types

For Concise bitmaps:

Field	Type	Description	Required
type	String	Must be `concise`.	yes

For Roaring bitmaps:

Field	Type	Description	Required
type	String	Must be `roaring`.	yes
compressRunOnSerialization	Boolean	Use a run-length encoding where it is estimated as more space efficient.	no (default == `true`)

Sharding

Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of segmentGranularity. If, for example, segmentGranularity is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.

Segments are identified by datasource, time interval, and version. With sharding, a segment is also identified by a partition number. Typically, each shard will have the same version but a different partition number to uniquely identify it.

In small-data scenarios, sharding is unnecessary and can be set to none (the default):

     "shardSpec": {"type": "none"}

However, in scenarios with multiple realtime nodes, none is less useful as it cannot help with scaling data volume (see below). Note that for the batch indexing service, no explicit configuration is required; sharding is provided automatically.

Druid uses sharding based on the shardSpec setting you configure. The recommended choices, linear and numbered, are discussed below; other types have been useful for internal Druid development but are not appropriate for production setups.

Keep in mind, that sharding configuration has nothing to do with configured firehose. For example, if you set partition number to 0, it doesn't mean that Kafka firehose will consume only from 0 topic partition.

Linear

This strategy provides following advantages:

There is no need to update the fileSpec configurations of existing nodes when adding new nodes.
All unique shards are queried, regardless of whether the partition numbering is sequential or not (it allows querying of partitions 0 and 2, even if partition 1 is missing).

Configure linear under schema:

     "shardSpec": {
        "type": "linear",
        "partitionNum": 0
    }

Numbered

This strategy is similar to linear except that it does not tolerate non-sequential partition numbering (it will not allow querying of partitions 0 and 2 if partition 1 is missing). It also requires explicitly setting the total number of partitions.

Configure numbered under schema:

     "shardSpec": {
        "type": "numbered",
        "partitionNum": 0,
        "partitions": 2
    }

Scale and Redundancy

The shardSpec configuration can be used to create redundancy by having the same partitionNum values on different nodes.

For example, if RealTimeNode1 has:

     "shardSpec": {
        "type": "linear",
        "partitionNum": 0
    }

and RealTimeNode2 has:

     "shardSpec": {
        "type": "linear",
        "partitionNum": 0
    }

then two realtime nodes can store segments with the same datasource, version, time interval, and partition number. Brokers that query for data in such segments will assume that they hold the same data, and the query will target only one of the segments.

shardSpec can also help achieve scale. For this, add nodes with a different partionNum. Continuing with the example, if RealTimeNode3 has:

     "shardSpec": {
        "type": "linear",
        "partitionNum": 1
    }

then it can store segments with the same datasource, time interval, and version as in the first two nodes, but with a different partition number. Brokers that query for data in such segments will assume that a segment from RealTimeNode3 holds different data, and the query will target it along with a segment from the first two nodes.

You can use type numbered similarly. Note that type none is essentially type linear with all shards having a fixed partitionNum of 0.

Constraints

intermediatePersistPeriod ≤ windowPeriod < segmentGranularity and queryGranularity ≤ segmentGranularity

Name	Effect	Minimum	Recommended
windowPeriod	窗口时间	time jitter tolerance	use this to reject outliers
segmentGranularity	Time granularity (minute, hour, day, week, month) for loading data at query time	equal to indexGranularity	more than queryGranularity
queryGranularity	Time granularity (minute, hour, day, week, month) for rollup	less than segmentGranularity	minute, hour, day, week, month
intermediatePersistPeriod	把数据从 memory 推到 disk	avoid excessive flushing	期间可以存储的行数由maxRowsInMemory决定
maxRowsInMemory	把数据从 memory 推到 disk前内存中存储的数据的最大行数	intermediatePersistPeriod周期内的行数	intermediatePersistPeriod 防止期间内堆溢出

Kafka

Standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.

Druid 在 N nodes个节点上备份数据，若其中N–1 nodes 宕机，仍可提供查询。但是 standard Kafka consumer groups 时Kafka topic 需要多个consumer(because consumers in different consumer groups will split up the data differently)时失效（Druid 无法在 N nodes个节点上备份数据）

具体原因：

For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2. Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2. Querying for your data through the broker will yield correct results.

The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case, real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2. From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes 2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent results.

Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues. Otherwise, you can run real-time nodes without replication.

Locking

结合batch ingestion 使用 stream pull ingestion，可能导致 data override issues.

例如：hourly segments 产生今日数据，同时daily batch job处理今日数据。 batch job将产生比realtime ingestion更多的segments version，如果batch job 正在索引今日未完全完成的数据batch job 产生的daily segment 将override，realtime nodes产生的recent segments 一部分数据将会丢失

Schema changes

Standalone realtime nodes需要重启来更新schema.

Log management

standalone realtime node 有各自的日志，诊断跨多个servers的多个 partitions，将很有难度

四、Batch Data Ingestion

可以从静态文件中摄入数据

Hadoop-based Batch Ingestion

Hadoop-based batch ingestion 通过Hadoop-ingestion task 调起 overlord完成.:

 {
  "type" : "index_hadoop",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/MyDirectory/example/wikipedia_data.json"
      }
    },
    "tuningConfig" : {
      "type": "hadoop"
    }
  },
  "hadoopDependencyCoordinates": <my_hadoop_version>
}

property	description	required?
type	"index_hadoop".	yes
spec	A Hadoop Index Spec. See Batch Ingestion	yes
hadoopDependencyCoordinates	Hadoop dependency coordinates（JSON array），将会覆盖 default Hadoop coordinates，一旦配置Druid在ocation specified by `druid.extensions.hadoopDependenciesDir中查找Hadoop dependencies`	no
classpathPrefix	Classpath that will be pre-appended for the peon process. druid自动计算hadoop job containers 的classpath，当出现hadoop和druid's dependencies冲突时，手动指定classpath，setting `druid.extensions.hadoopContainerDruidClasspath` ， base druid configuration.	no

DataSchema

This field is required. See Ingestion.

IOConfig

This field is required.

Field	Type	Description	Required
type	String	'hadoop'.	yes
inputSpec	Object	A specification of where to pull the data in from. See below.	yes
segmentOutputPath	String	The path to dump segments into.	yes
metadataUpdateSpec	Object	A specification of how to update the metadata for the druid cluster these segments belong to.	yes

InputSpec specification

There are multiple types of inputSpecs:

`static`

A type of inputSpec where a static path to the data files is provided.

Field	Type	Description	Required
paths	Array of String	A String of input paths indicating where the raw data is located.	yes

For example, static input paths:

 "paths" : "s3n://billy-bucket/the/data/is/here/data.gz, s3n://billy-bucket/the/data/is/here/moredata.gz, s3n://billy-bucket/the/data/is/here/evenmoredata.gz"

`granularity`

A type of inputSpec that expects data to be organized in directories according to datetime using the path format: y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX (where date is represented by lowercase and time is represented by uppercase).

Field	Type	Description	Required
dataGranularity	String	Specifies the granularity to expect the data at, e.g. hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.	yes
inputPath	String	Base path to append the datetime path to.	yes
filePattern	String	Pattern that files should match to be included.	yes
pathFormat	String	Joda datetime format for each directory. Default value is `"'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"`, or see Joda documentation	no

For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths:

 s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
...
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23

`dataSource`

Read Druid segments. See here for more information.

`multi`

Read multiple sources of data. See here for more information.

TuningConfig

The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.

Field	Type	Description	Required
workingPath	String	The working path to use for intermediate results (results between Hadoop jobs).	no (default == '/tmp/druid-indexing')
version	String	The version of created segments. Ignored for HadoopIndexTask unless useExplicitVersion is set to true	no (default == datetime that indexing starts at)
partitionsSpec	Object	A specification of how to partition each time bucket into segments. Absence of this property means no partitioning will occur. See 'Partitioning specification' below.	no (default == 'hashed')
maxRowsInMemory	Integer	The number of rows to aggregate before persisting. Note that this is the number of post-aggregation rows which may not be equal to the number of input events due to roll-up. This is used to manage the required JVM heap size.	no (default == 75000)
leaveIntermediate	Boolean	Leave behind intermediate files (for debugging) in the workingPath when a job completes, whether it passes or fails.	no (default == false)
cleanupOnFailure	Boolean	Clean up intermediate files when a job fails (unless leaveIntermediate is on).	no (default == true)
overwriteFiles	Boolean	Override existing files found during indexing.	no (default == false)
ignoreInvalidRows	Boolean	Ignore rows found to have problems.	no (default == false)
combineText	Boolean	Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.	no (default == false)
useCombiner	Boolean	Use Hadoop combiner to merge rows at mapper if possible.	no (default == false)
jobProperties	Object	A map of properties to add to the Hadoop job configuration, see below for details.	no (default == null)
indexSpec	Object	Tune how data is indexed. See below for more information.	no
buildV9Directly	Boolean	Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format.	no (default == true)
numBackgroundPersistThreads	Integer	The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and cpu usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.	no (default == 0)
forceExtendableShardSpecs	Boolean	Forces use of extendable shardSpecs. Experimental feature intended for use with the Kafka indexing service extension.	no (default = false)
useExplicitVersion	Boolean	Forces HadoopIndexTask to use version.	no (default = false)

jobProperties field of TuningConfig

    "tuningConfig" : {
     "type": "hadoop",
     "jobProperties": {
       "<hadoop-property-a>": "<value-a>",
       "<hadoop-property-b>": "<value-b>"
     }
   }

Hadoop's MapReduce documentation lists the possible configuration parameters.

With some Hadoop distributions, it may be necessary to set mapreduce.job.classpath or mapreduce.job.user.classpath.firstto avoid class loading issues. See the working with different Hadoop versions documentation for more details.

IndexSpec

Field	Type	Description	Required
bitmap	Object	Compression format for bitmap indexes. Should be a JSON object; see below for options.	no (defaults to Concise)
dimensionCompression	String	Compression format for dimension columns. Choose from `LZ4`, `LZF`, or `uncompressed`.	no (default == `LZ4`)
metricCompression	String	Compression format for metric columns. Choose from `LZ4`, `LZF`, `uncompressed`, or `none`.	no (default == `LZ4`)
longEncoding	String	Encoding format for metric and dimension columns with type long. Choose from `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as is with 8 bytes each.	no (default == `longs`)

Bitmap types

For Concise bitmaps:

Field	Type	Description	Required
type	String	Must be `concise`.	yes

For Roaring bitmaps:

Field	Type	Description	Required
type	String	Must be `roaring`.	yes
compressRunOnSerialization	Boolean	Use a run-length encoding where it is estimated as more space efficient.	no (default == `true`)

Partitioning specification

Segments一般基于timestamp（granularitySpec）分区，也可以有其他的 partition type.如：

"hashed" (based on the hash of all dimensions in each row),

"dimension" (based on ranges of a single dimension).

Hashed partitioning 建议使用优于dimension, 有利于 indexing 性能优化，建立相对于single-dimension partitioning更格式大小统一的segments.

Hash-based partitioning

   "partitionsSpec": {
     "type": "hashed",
     "targetPartitionSize": 5000000
   }

Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments according to the hash of all dimensions in each row. The number of segments is determined automatically based on the cardinality of the input set and a target partition size.

The configuration options are:

Field	Description	Required
type	Type of partitionSpec to be used.	"hashed"
targetPartitionSize	Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB.	either this or numShards
numShards	Specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.	either this or targetPartitionSize
partitionDimensions	The dimensions to partition on. Leave blank to select all dimensions. Only used with numShards, will be ignored when targetPartitionSize is set	no

Single-dimension partitioning

   "partitionsSpec": {
     "type": "dimension",
     "targetPartitionSize": 5000000
   }

Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example, your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and "f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can override it with a specific dimension.

The configuration options are:

Field	Description	Required
type	Type of partitionSpec to be used.	"dimension"
targetPartitionSize	Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB.	yes
maxPartitionSize	Maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize.	no
partitionDimension	The dimension to partition on. Leave blank to select a dimension automatically.	no
assumeGrouped	Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated.	no