Apache Druid加载本地数据

1. 加载数据的原理

数据的加载需要向Overlord提交ingestion spec。ingestion spec其实就是一个json格式的元数据,生成方式有两种:

  1. 可以手动编写
  2. 通过Druid Web控制台内置的data loader加载少量数据,并配置参数,来帮助我们生成

2. 使用Data Loader来加载数据

操作步骤如下:

选择本地加载选择文件
点击Apply,进行文件数据预览。再点击Next: Parse data。自动检测出Input format为json。再点击Next: Parse time

时间转换Druid需要一个时间列_time。如果没有时间字段,可以选择Constant Value

点击Next: Transform,再点击Next: Filter,再点击Next: Configure schema,Configure schema可以选择需要的字段,再点击Next:Partition

分区再点击Next:Publish

数据源名称Spec
可以编辑Spec,返回前面,会看到前面的选型被修改

查看数据源
查询数据

3. 使用spec加载数据(通过Web控制台)

提交Json task
submit json task
将/opt/apache-druid-0.22.1/quickstart/tutorial/wikipedia-index.json的内容粘贴到文本框,然后点击Submit提交任务

该json文件定义了task会自动读取/opt/apache-druid-0.22.1/quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz,创建名称为wikipedia的数据源

4. 使用spec加载数据(通过命令行)

首先对集群进行重新初始化,将数据进行清空,清空步骤如下:

  1. stop集群
  2. HDFS将目录/druid删除
  3. Zookeeper将目录/druid删除
  4. Mysql将数据库druid删除,再创建
  5. Druid集群所有服务器下的目录var_master/*、var_data/*、var_query/*、var/druid、var/tmp/*进行删除
  6. start集群

查看wikipedia-index.json的内容如下:

[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# cat quickstart/tutorial/wikipedia-index.json 
{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec" : {
        "dimensions" : [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user",
          { "name": "added", "type": "long" },
          { "name": "deleted", "type": "long" },
          { "name": "delta", "type": "long" }
        ]
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}
[root@bigdata001 apache-druid-0.22.1]#

通过命令行执行task

[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# pwd
/opt/apache-druid-0.22.1
[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://bigdata003:9081
Beginning indexing data for wikipedia
Redirect response received, setting url to [http://bigdata002:9081]
Task started: index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z
Task log:     http://bigdata002:9081/druid/indexer/v1/task/index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z/log
Task status:  http://bigdata002:9081/druid/indexer/v1/task/index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z/status
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task index_parallel_wikipedia_jiklodcc_2022-03-28T09:52:50.915Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
Traceback (most recent call last):
  File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 174, in <module>
    main()
  File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 171, in main
    await_load_completion(args, datasource, load_timeout_at)
  File "/opt/apache-druid-0.22.1/bin/post-index-task-main", line 119, in await_load_completion
    response = urllib2.urlopen(req, None, response_timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
[root@bigdata001 apache-druid-0.22.1]#
  • port 9081:是Coordinator的plain port
  • 这个urllib2.HTTPError可以忽略,因为上面已经显示数据导入成功

5. 使用CRUL来加载数据

切换到目录apache-druid-0.22.1

[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# pwd
/opt/apache-druid-0.22.1
[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://bigdata002:9081/druid/indexer/v1/task
{"task":"index_parallel_wikipedia_ffpapmhp_2022-03-28T15:02:43.737Z"}[root@bigdata001 apache-druid-0.22.1]# 
[root@bigdata001 apache-druid-0.22.1]# 
  • 连接的Coordinator必须是leader Coordinator
  • 任务成功submit,才会返回json字符串
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值