Druid -- 定时增量导入HDFS数据到Druid

最新推荐文章于 2021-08-26 17:05:33 发布

TheBiiigBlue

最新推荐文章于 2021-08-26 17:05:33 发布

阅读量706

点赞数

分类专栏： Druid 文章标签：大数据 druid hdfs hadoop

本文链接：https://blog.csdn.net/Aeve_imp/article/details/107764890

版权

Druid 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

向导

Druid集成HDFS
编写Spec.json
编写替换时间Shell
定时提交任务到Druid
完整的Shell脚本

Druid集成HDFS

安装Druid及集成HDFS参考：Druid – 基于Imply方式集群部署

编写Spec.json

如果是首次导入，可以直接在页面点点点即可，但是要是定时增量导入，还是需要通过脚本和crontab定时提交任务到Druid。那就需要准备提交任务所需的json文件，下面是我们的一个模板。官网案例请参考：Hadoop-based ingestion
因为我们是批量导入，时间是需要定时改的，但是提交的是json文件，json文件内容是提前写好的，里面可是不能设置shell之类的变量的，所以这里使用了#date1#这种的自定义的替换符吧算是。

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "hdfs",
        "paths": "/spark/struct_data/#date1#/PubLisErrTake2"
      },
      "inputFormat": {
        "type": "tsv",
        "findColumnsFromHeader": true
      },
      "appendToExisting": true
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    },
    "dataSchema": {
      "dataSource": "PubLisErrTake2",
      "granularitySpec": {
        "type": "uniform",
        "queryGranularity": "DAY",
        "rollup": true,
        "segmentGranularity": "DAY"
      },
      "timestampSpec": {
        "column": "!!!_no_such_column_!!!",
        "missingValue": "#date2#T00:00:00Z"
      },
      "dimensionsSpec": {
        "dimensions": [
          "样例url前五条",
          "listing",
          "pubcode",
          {
            "type": "long",
            "name": "漏文数量"
          }
        ]
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "sum_miss",
          "type": "longSum",
          "fieldName": "漏文数量"
        }
      ]
    }
  }
}

编写替换时间Shell

既然json不能设置变量，我们就在json中设置自定义的独一无二的字符串，用shell脚本进行替换。此处使用shell脚本将文件中的date1和date2替换为当日日期，并输出到一个新文件。

#!/bin/bash
date1=`date +\%Y\%m\%d`
date2=`date +\%Y-\%m-\%d`

#替换 PubLisErrTake2日期
sed "s/#date1#/$date1/g;s/#date2#/$date2/g" spec/Template_PubLisErrTake2Spec.json > spec/PubLisErrTake2Spec.json

定时提交任务到Druid

通过druid自带的脚本：bin/post-index-task，定时将文件导入HDFS。

/services/imply-3.4.0/bin/post-index-task \
--file spec/PubLisErrTake2Spec.json \
--url http://xxx:8081

完整的Shell脚本

#!/bin/bash
date1=`date +\%Y\%m\%d`
date2=`date +\%Y-\%m-\%d`

#替换 PubLisErrTake2日期
sed "s/#date1#/$date1/g;s/#date2#/$date2/g" spec/TemplateSpec.json > spec/Spec.json

binPath=/services/imply-3.4.0/bin
$binPath/post-index-task --file spec/Spec.json \
--url http://xxx:8081