-
一、Stream Ingestion
可以通过Tranquility (a Druid-aware client,提供负载均衡,其他服务,缺点复杂)、或者indexing service 、单个的 Realtime nodes ( limitations.)
Stream push
将Tranquility嵌入到Tranquility,Tranquility和Storm and Samza stream processors结合,同时它含有提供直接被JVM-based program(such as Spark Streaming or a Kafka consumer.)使用的API,以Tranquility处理partitioning, replication, service discovery, and schema rollover,使用者是需要设计schema( Tranquility README.)
主要Tranquility server 和 the indexing service来工作
Stream pull
使用 Realtime Node 从 Firehose 中获取数据, Firehose 来直接连接数据源,Druid有针对不同源(Kafka, RabbitMQ等)的内置Firehose
主要是Realtime Node 来工作,Realtime nodes优于indexing service.
More information
push based approach see here.
pull based approach, see here.
二、Stream Push
Druid通过Tranquility push stream到Druid,Tranquility需要自行下载,github:Tranquility
必须确定incoming data enough (within a configurable windowPeriodof the current time),对于旧数据不会实时处理,最好用batch ingestion.
主要依赖Tranquility server 和 the indexing service
Server
Druid可用Tranquility Server,由Tranquility Server来向Druid发送数据而无需开发JVM app,可以运行被 Druid middleManagers、 historical processes托管的Tranquility server
Tranquility server 启动:
bin/tranquility server -configFile <path_to_config_file>/server.json
自定义 Tranquility Server:
- In
server.json
, 自定义properties
和dataSources
. - 如果有服务运行Tranquility, 需要ctr-c重启
自定义server.json
, 参考 Loading your own streams tutorial 和 Tranquility Server documentation.Kafka
Tranquility Kafka 用于Kafka数据的导入,无需编码只需要一个 configuration file.
启动Tranquility server :
bin/tranquility kafka -configFile <path_to_config_file>/kafka.json
参考single-machine quickstart中配置:
- In
kafka.json
, 定义properties
anddataSources
. - 若有,重启已有的Tranquility.
更多参考:Tranquility Kafka documentation.
JVM apps and stream processors
可以library(github:Tranquility)形式将Tranquility嵌入到 JVM-based applications ,使用 Core API,或者使用Tranquility内嵌的connectors(支持such as Storm, Samza, Spark Streaming, and Flink)
Concepts
Task creation
Tranquility 自动建立Druid realtime indexing tasks,来处理 partitioning, replication, service discovery, and schema rollover,具体Tranquility周期性的产生一些相对短时生命周期的tasks,每个task来处理一小部分 Druid segments,通过ZooKeeper协调这些tasks,更多管理tasks的细节 Tranquility overview
segmentGranularity and windowPeriod
segmentGranularity
segments覆盖的时间粒度,例如segmentGranularity为"hour",tasks产生涵盖每个小时的数据的segments
windowPeriod
是允许events 的slack松弛时间,例如windowPeriod(默认10分钟)表示timestamp在10分钟之前或者10分钟之后的数据将被丢掉不处理。
这些决定着tasks存活时间,数据在推向historical nodes前在realtime system中逗留的时间,例如segmentGranularity "hour" and windowPeriod ten minutes,tasks一直检测 an hour and ten minutes 时段的事件events
Append only
Druid streaming ingestion是append-only,不支持摄入数据的update和delete,若需要,使用 batch reindexing process. batch ingest .
Guarantees
Tranquility 不保证 exactly once. 在一些场景下回 drop or duplicate events:
- windowPeriod之外的数据会dropped
- Druid Middle Manager failures次数超过配置时,部分indexed 的数据可能丢失
- 持续的与Druid indexing service 通信断开,重试次数超预设,或者时间超过windowPeriod吗,一些events会丢失
- Tranquility得不到indexing service应答时,Tranquility会重试这批数据,导致events重复!!!!!
- 使用 Storm or Samza内置的Tranquility,由于一些architectures 有at-least-once的设计,可能导致events重复!!!!!
这些概率极小,若需要100%保证建议 hybrid batch/streaming architecture
Tranquility documentation and Configuration
Tranquility documentation here.Tranquility configuration:here.Tranquility's tuningConfig: here
三、Stream Pull Ingestion
使用 Realtime Node 从 Firehose 中获取数据, Firehose 来直接连接数据源,Druid有针对不同源(Kafka, RabbitMQ等)的内置Firehose.quickstart中不包含如何建立standalone realtime nodes,但他们可以用来代替 Tranquility server 和 the indexing service, Realtime nodes优于indexing service.
Realtime Node Ingestion
Real-time Node information, see here.
Real-time Node Configuration, see Realtime Configuration.
如何写自己的plugins对接real-time node,see Firehose.
Realtime "specFile"
druid.realtime.specFile样例:
[ { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "string", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [{ "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" }], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE" } }, "ioConfig" : { "type" : "realtime", "firehose": { "type": "kafka-0.8", "consumerProps": { "zookeeper.connect": "localhost:2181", "zookeeper.connection.timeout.ms" : "15000", "zookeeper.session.timeout.ms" : "15000", "zookeeper.sync.time.ms" : "5000", "group.id": "druid-example", "fetch.message.max.bytes" : "1048586", "auto.offset.reset": "largest", "auto.commit.enable": "false" }, "feed": "wikipedia" }, "plumber": { "type": "realtime" } }, "tuningConfig": { "type" : "realtime", "maxRowsInMemory": 75000, "intermediatePersistPeriod": "PT10m", "windowPeriod": "PT10m", "basePersistDirectory": "\/tmp\/realtime\/basePersist", "rejectionPolicy": { "type": "serverTime" } } } ]
可以配置多个one realtime stream ,一般对于每一个realtime stream 需要2-threads: 1 thread for data consumption and aggregation, 1 thread for incremental persists and other background tasks
配置文件主要包含3部分:
dataSchema
,IOConfig
,tuningConfig
.DataSchema
This field is required.
See Ingestion
IOConfig
This field is required.
FieldTypeDescriptionRequiredtype String always be 'realtime'. yes firehose JSON Object 数据源 yes plumber JSON Object Where the data is going. yes Firehose
See Firehose for more information on various firehoses.
Plumber
FieldTypeDescriptionRequiredtype String always be 'realtime'. no TuningConfig
有默认配置可用,可以自行配置优化
FieldTypeDescriptionRequiredtype String 'realtime'. no maxRowsInMemory Integer persisting前聚合后的行数,控制所需 JVM heap size
indexing所需Maximum heap memory= maxRowsInMemory * (2 + maxPendingPersists).
no (default == 75000) windowPeriod ISO 8601 Period String 默认10分钟,详见上文segmentGranularity and windowPeriod no (default == PT10m) intermediatePersistPeriod ISO8601 Period String intermediate persists 的周期
no (default == PT10m) basePersistDirectory String The directory to put things that need persistence.
The plumber is responsible for the actual intermediate persists
no (default == java tmp dir) versioningPolicy Object How to version segments. no (default == based on segment start time) rejectionPolicy Object Controls how data sets the data acceptance policy for creating and handing off segments. More on this below. no (default == 'serverTime') maxPendingPersists Integer Maximum number of persists that can be pending, but not started.
If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes.
no (default == 0; meaning one persist can be running concurrently with ingestion, and none can be queued up) shardSpec Object shard that is represented by this server. sharded fashion. no (default == 'NoneShardSpec') buildV9Directly Boolean Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format. no (default == true) persistThreadPriority int If -XX:+UseThreadPriorities
is properly enabled, this will set the thread priority of the persisting thread toThread.NORM_PRIORITY
plus this value within the bounds ofThread.MIN_PRIORITY
andThread.MAX_PRIORITY
. A value of 0 indicates to not change the thread priority.no (default == 0; inherit and do not override) mergeThreadPriority int If
-XX:+UseThreadPriorities
is properly enabled, this will set the thread priority of the merging thread toThread.NORM_PRIORITY
plus this value within the bounds ofThread.MIN_PRIORITY
andThread.MAX_PRIORITY
. A value of 0 indicates to not change the thread priority.Before enabling thread priority settings, users are highly encouraged to read the original pull request and other documentation about proper use of
-XX:+UseThreadPriorities
.no (default == 0; inherit and do not override) reportParseExceptions Boolean If true, exceptions encountered during parsing will be thrown and will halt ingestion. If false, unparseable rows and fields will be skipped. If an entire row is skipped, the "unparseable" counter will be incremented. If some fields in a row were parseable and some were not, the parseable fields will be indexed and the "unparseable" counter will not be incremented. no (default == false) handoffConditionTimeout long Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. no (default == 0) alertTimeout long Milliseconds timeout after which an alert is created if the task isn't finished by then. This allows users to monitor tasks that are failing to finish and give up the worker slot for any unexpected errors. no (default == 0) indexSpec Object Tune how data is indexed. See below for more information. no Rejection Policy
serverTime
– The recommended policy for "current time" data, it is optimal for current data that is generated and ingested in real time. UseswindowPeriod
to accept only those events that are inside the window looking forward and back.messageTime
– Can be used for non-"current time" as long as that data is relatively in sequence. Events are rejected if they are less thanwindowPeriod
from the event with the latest timestamp. Hand off only occurs if an event is seen after the segmentGranularity andwindowPeriod
(hand off will not periodically occur unless you have a constant stream of data).none
– All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.
IndexSpec
FieldTypeDescriptionRequiredbitmap Object Compression format for bitmap indexes. Should be a JSON object; see below for options. no (defaults to Concise) dimensionCompression String Compression format for dimension columns. Choose from LZ4
,LZF
, oruncompressed
.no (default == LZ4
)metricCompression String Compression format for metric columns. Choose from LZ4
,LZF
, oruncompressed
.no (default == LZ4
)Bitmap types
For Concise bitmaps:
FieldTypeDescriptionRequiredtype String Must be concise
.yes For Roaring bitmaps:
FieldTypeDescriptionRequiredtype String Must be roaring
.yes compressRunOnSerialization Boolean Use a run-length encoding where it is estimated as more space efficient. no (default == true
)Sharding
Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of
segmentGranularity
. If, for example,segmentGranularity
is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.Segments are identified by datasource, time interval, and version. With sharding, a segment is also identified by a partition number. Typically, each shard will have the same version but a different partition number to uniquely identify it.
In small-data scenarios, sharding is unnecessary and can be set to none (the default):
"shardSpec": {"type": "none"}
However, in scenarios with multiple realtime nodes,
none
is less useful as it cannot help with scaling data volume (see below). Note that for the batch indexing service, no explicit configuration is required; sharding is provided automatically.Druid uses sharding based on the
shardSpec
setting you configure. The recommended choices,linear
andnumbered
, are discussed below; other types have been useful for internal Druid development but are not appropriate for production setups.Keep in mind, that sharding configuration has nothing to do with configured firehose. For example, if you set partition number to 0, it doesn't mean that Kafka firehose will consume only from 0 topic partition.
Linear
This strategy provides following advantages:
- There is no need to update the fileSpec configurations of existing nodes when adding new nodes.
- All unique shards are queried, regardless of whether the partition numbering is sequential or not (it allows querying of partitions 0 and 2, even if partition 1 is missing).
Configure
linear
underschema
:"shardSpec": { "type": "linear", "partitionNum": 0 }
Numbered
This strategy is similar to
linear
except that it does not tolerate non-sequential partition numbering (it will not allow querying of partitions 0 and 2 if partition 1 is missing). It also requires explicitly setting the total number of partitions.Configure
numbered
underschema
:"shardSpec": { "type": "numbered", "partitionNum": 0, "partitions": 2 }
Scale and Redundancy
The
shardSpec
configuration can be used to create redundancy by having the samepartitionNum
values on different nodes.For example, if RealTimeNode1 has:
"shardSpec": { "type": "linear", "partitionNum": 0 }
and RealTimeNode2 has:
"shardSpec": { "type": "linear", "partitionNum": 0 }
then two realtime nodes can store segments with the same datasource, version, time interval, and partition number. Brokers that query for data in such segments will assume that they hold the same data, and the query will target only one of the segments.
shardSpec
can also help achieve scale. For this, add nodes with a differentpartionNum
. Continuing with the example, if RealTimeNode3 has:"shardSpec": { "type": "linear", "partitionNum": 1 }
then it can store segments with the same datasource, time interval, and version as in the first two nodes, but with a different partition number. Brokers that query for data in such segments will assume that a segment from RealTimeNode3 holds different data, and the query will target it along with a segment from the first two nodes.
You can use type
numbered
similarly. Note that typenone
is essentially typelinear
with all shards having a fixedpartitionNum
of 0.Constraints
intermediatePersistPeriod ≤ windowPeriod < segmentGranularity
andqueryGranularity ≤ segmentGranularity
NameEffectMinimumRecommendedwindowPeriod 窗口时间 time jitter tolerance use this to reject outliers segmentGranularity Time granularity (minute, hour, day, week, month) for loading data at query time equal to indexGranularity more than queryGranularity queryGranularity Time granularity (minute, hour, day, week, month) for rollup less than segmentGranularity minute, hour, day, week, month intermediatePersistPeriod 把数据从 memory 推到 disk avoid excessive flushing 期间可以存储的行数由maxRowsInMemory决定 maxRowsInMemory 把数据从 memory 推到 disk前内存中存储的数据的最大行数
intermediatePersistPeriod周期内的行数 intermediatePersistPeriod 防止期间内堆溢出 Kafka
Standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.
Druid 在 N nodes个节点上备份数据,若其中N–1 nodes 宕机,仍可提供查询。但是 standard Kafka consumer groups 时Kafka topic 需要多个consumer(because consumers in different consumer groups will split up the data differently)时失效(Druid 无法在 N nodes个节点上备份数据)
具体原因:
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2. Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2. Querying for your data through the broker will yield correct results.
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case, real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2. From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes 2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent results.
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues. Otherwise, you can run real-time nodes without replication.
Locking
结合batch ingestion 使用 stream pull ingestion,可能导致 data override issues.
例如:hourly segments 产生今日数据,同时daily batch job处理今日数据。 batch job将产生比realtime ingestion更多的segments version, 如果batch job 正在索引今日未完全完成的数据batch job 产生的daily segment 将override,realtime nodes产生的recent segments 一部分数据将会丢失
Schema changes
Standalone realtime nodes需要重启来更新schema.
Log management
standalone realtime node 有各自的日志,诊断跨多个servers的多个 partitions,将很有难度
- In
-
四、Batch Data Ingestion
可以从静态文件中摄入数据
Hadoop-based Batch Ingestion
Hadoop-based batch ingestion 通过Hadoop-ingestion task 调起 overlord完成.:
{ "type" : "index_hadoop", "spec" : { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "hadoopyString", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [ { "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" } ], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] } }, "ioConfig" : { "type" : "hadoop", "inputSpec" : { "type" : "static", "paths" : "/MyDirectory/example/wikipedia_data.json" } }, "tuningConfig" : { "type": "hadoop" } }, "hadoopDependencyCoordinates": <my_hadoop_version> }
property description required? type "index_hadoop". yes spec A Hadoop Index Spec. See Batch Ingestion yes hadoopDependencyCoordinates Hadoop dependency coordinates(JSON array),将会覆盖 default Hadoop coordinates,
一旦配置Druid在ocation specified by
druid.extensions.hadoopDependenciesDir中查找Hadoop dependencies
no classpathPrefix Classpath that will be pre-appended for the peon process.
druid自动计算hadoop job containers 的classpath,当出现hadoop和druid's dependencies冲突时,手动指定classpath,setting
druid.extensions.hadoopContainerDruidClasspath
, base druid configuration.no
DataSchema
This field is required. See Ingestion.
IOConfig
This field is required.
Field Type Description Required type String 'hadoop'. yes inputSpec Object A specification of where to pull the data in from. See below. yes segmentOutputPath String The path to dump segments into. yes metadataUpdateSpec Object A specification of how to update the metadata for the druid cluster these segments belong to. yes InputSpec specification
There are multiple types of inputSpecs:
static
A type of inputSpec where a static path to the data files is provided.
Field Type Description Required paths Array of String A String of input paths indicating where the raw data is located. yes For example, static input paths:
"paths" : "s3n://billy-bucket/the/data/is/here/data.gz, s3n://billy-bucket/the/data/is/here/moredata.gz, s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
granularity
A type of inputSpec that expects data to be organized in directories according to datetime using the path format:
y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX
(where date is represented by lowercase and time is represented by uppercase).Field Type Description Required dataGranularity String Specifies the granularity to expect the data at, e.g. hour means to expect directories y=XXXX/m=XX/d=XX/H=XX
.yes inputPath String Base path to append the datetime path to. yes filePattern String Pattern that files should match to be included. yes pathFormat String Joda datetime format for each directory. Default value is "'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"
, or see Joda documentationno For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths:
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00 s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01 ... s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
dataSource
Read Druid segments. See here for more information.
multi
Read multiple sources of data. See here for more information.
TuningConfig
The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.
Field Type Description Required workingPath String The working path to use for intermediate results (results between Hadoop jobs). no (default == '/tmp/druid-indexing') version String The version of created segments. Ignored for HadoopIndexTask unless useExplicitVersion is set to true no (default == datetime that indexing starts at) partitionsSpec Object A specification of how to partition each time bucket into segments. Absence of this property means no partitioning will occur. See 'Partitioning specification' below. no (default == 'hashed') maxRowsInMemory Integer The number of rows to aggregate before persisting. Note that this is the number of post-aggregation rows which may not be equal to the number of input events due to roll-up. This is used to manage the required JVM heap size. no (default == 75000) leaveIntermediate Boolean Leave behind intermediate files (for debugging) in the workingPath when a job completes, whether it passes or fails. no (default == false) cleanupOnFailure Boolean Clean up intermediate files when a job fails (unless leaveIntermediate is on). no (default == true) overwriteFiles Boolean Override existing files found during indexing. no (default == false) ignoreInvalidRows Boolean Ignore rows found to have problems. no (default == false) combineText Boolean Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files. no (default == false) useCombiner Boolean Use Hadoop combiner to merge rows at mapper if possible. no (default == false) jobProperties Object A map of properties to add to the Hadoop job configuration, see below for details. no (default == null) indexSpec Object Tune how data is indexed. See below for more information. no buildV9Directly Boolean Whether to build a v9 index directly instead of first building a v8 index and then converting it to v9 format. no (default == true) numBackgroundPersistThreads Integer The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and cpu usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1. no (default == 0) forceExtendableShardSpecs Boolean Forces use of extendable shardSpecs. Experimental feature intended for use with the Kafka indexing service extension. no (default = false) useExplicitVersion Boolean Forces HadoopIndexTask to use version. no (default = false) jobProperties field of TuningConfig
"tuningConfig" : { "type": "hadoop", "jobProperties": { "<hadoop-property-a>": "<value-a>", "<hadoop-property-b>": "<value-b>" } }
Hadoop's MapReduce documentation lists the possible configuration parameters.
With some Hadoop distributions, it may be necessary to set
mapreduce.job.classpath
ormapreduce.job.user.classpath.first
to avoid class loading issues. See the working with different Hadoop versions documentation for more details.IndexSpec
Field Type Description Required bitmap Object Compression format for bitmap indexes. Should be a JSON object; see below for options. no (defaults to Concise) dimensionCompression String Compression format for dimension columns. Choose from LZ4
,LZF
, oruncompressed
.no (default == LZ4
)metricCompression String Compression format for metric columns. Choose from LZ4
,LZF
,uncompressed
, ornone
.no (default == LZ4
)longEncoding String Encoding format for metric and dimension columns with type long. Choose from auto
orlongs
.auto
encodes the values using offset or lookup table depending on column cardinality, and store them with variable size.longs
stores the value as is with 8 bytes each.no (default == longs
)Bitmap types
For Concise bitmaps:
Field Type Description Required type String Must be concise
.yes For Roaring bitmaps:
Field Type Description Required type String Must be roaring
.yes compressRunOnSerialization Boolean Use a run-length encoding where it is estimated as more space efficient. no (default == true
)Partitioning specification
Segments一般基于timestamp(granularitySpec)分区,也可以有其他的 partition type.如:
"hashed" (based on the hash of all dimensions in each row),
"dimension" (based on ranges of a single dimension).
Hashed partitioning 建议使用优于dimension, 有利于 indexing 性能优化,建立相对于single-dimension partitioning更格式大小统一的segments.
Hash-based partitioning
"partitionsSpec": { "type": "hashed", "targetPartitionSize": 5000000 }
Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments according to the hash of all dimensions in each row. The number of segments is determined automatically based on the cardinality of the input set and a target partition size.
The configuration options are:
Field Description Required type Type of partitionSpec to be used. "hashed" targetPartitionSize Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB. either this or numShards numShards Specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically. either this or targetPartitionSize partitionDimensions The dimensions to partition on. Leave blank to select all dimensions. Only used with numShards, will be ignored when targetPartitionSize is set no Single-dimension partitioning
"partitionsSpec": { "type": "dimension", "targetPartitionSize": 5000000 }
Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example, your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and "f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can override it with a specific dimension.
The configuration options are:
Field Description Required type Type of partitionSpec to be used. "dimension" targetPartitionSize Target number of rows to include in a partition, should be a number that targets segments of 500MB~1GB. yes maxPartitionSize Maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize. no partitionDimension The dimension to partition on. Leave blank to select a dimension automatically. no assumeGrouped Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated. no Remote Hadoop Cluster
远程Hadoop cluster, Druid
_common
configuration folder中包含configuration*.xml
files(hadoop的)若有依赖版本冲突:these docs.Having Problems?
对于首次使用出现的问题可以再此找到帮助 google groups page.
5-Druid数据摄入-2
最新推荐文章于 2024-07-30 16:24:07 发布