{"time":1529209115078,"product_type":"unknown","model":"Other","log_type":"timeon.behavior","api_no":"81","data":{"page":"POPUP:confirmAutorecPack","title":"22222222","page_session":"12","category":"ForumSubscribeAndAutorecFlow","action":"ToggleSubscribeFromWizard","label":"Unsubscribe pack with autorec disabled","area_code":25,"operation_time":"20180611044635","time_zone":"+0900","pdid":"0741968309533296"},"pdid":"0741968309533296","uid":"no_uid"}
针对上述Flatten JSON数据为例,从kafka中读取数据的JSON配置文件如下:
{
"type": "kafka",
"dataSchema": {
"dataSource": "timeon.behavior",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"flattenSpec": {
"useFieldDiscovery":true,
"fields": [
{
"type": "root",
"name": "product_type"
},
{
"type": "root",
"name": "model"
},
{
"type": "root",
"name": "log_type"
},
{
"type": "root",
"name": "api_no"
},
{
"type": "path",
"name": "page",
"expr": "$.data.page"
},
{
"type": "path",
"name": "title",
"expr": "$.data.title"
},
{
"type": "path",
"name": "page_session",
"expr": "$.data.page_session"
},
{
"type":"path",
"name":"category",
"expr":"$.data.category"
},
{
"type": "path",
"name": "action",
"expr": "$.data.action"
},
{
"type": "path",
"name": "label",
"expr": "$.data.label"
},
{
"type": "path",
"name": "value",
"expr": "$.data.value"
},
{
"type": "path",
"name": "area_code",
"expr": "$.data.area_code"
},
{
"type": "path",
"name": "operation_time",
"expr": "$.data.operation_time"
},
{
"type": "path",
"name": "time_zone",
"expr": "$.data.time_zone"
},
{
"type": "path",
"name": "data_pdid",
"expr": "$.data.pdid"
},
{
"type": "root",
"name": "pdid"
},
{
"type": "root",
"name": "uid"
}
]
},
"dimensionsSpec" : {
"dimensions": [],
"dimensionExclusions" : [],
"spatialDimensions" : []
},
"timestampSpec": {
"column": "time",
"format": "posix"
}
}
},
"metricsSpec": [
{
"name" : "count",
"type" : "count"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "NONE"
}
},
"tuningConfig": {
"type": "kafka",
"maxRowsPerSegment": 5000000
},
"ioConfig": {
"topic": "timeon.behavior",
"consumerProperties": {
"bootstrap.servers": "kafka-a:9092,kafka-b:9092,kafka-c:9092"
},
"taskCount": 3,
"replicas": 1,
"taskDuration": "PT1H10M"
}
}
主要说以下几点:
(1)parseSpec主要包括flattenSpec、dimensionsSpec和timestampSpec。
(2)flattenSpec主要配置数据的fields,可利用root、path和jq等类型进行嵌套数据的读取,另外useFieldDiscovery=true,会自动读取root类型的field(timestamp、array和list 除外)。
timestamp column 不应该配置到fields中。
(3)dimensionsSpec主要配置dimension的列,如果dimensions=[],则会将fields中的字段直接作为dimensions,省去挨个配置dimension.
(4)timestampSpec主要配置时间戳列,时间戳的列配置主要在format,druid支持两种类型的时间戳列,字符型和数字型,
并且字符型兼容数字型,其中posix代表毫秒,millis代表毫秒 ,iso代表iso时间,Joda time参照http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html, format只是来定义数据源时间戳列的格式,并不是存入druid之后的数据格式,需根据实际情况来确定,否则数据解析会错误。