关闭

Gobblin部署--mapreduce模式

标签: gobblinkafka
1008人阅读 评论(2) 收藏 举报
分类:

目前Gobblin编译好的tar.gz文件,默认都是以Hadoop2.3版本进行编译的,因此,如果你的Hadoop版本高于Hadoop 2.3,那么你需要下载源码手动编译,否则当以mapreduce模式运行Job时,会报错:
这里写图片描述

这在FAQ中也有说明:

What Hadoop versions can Gobblin run on?

Gobblin can only be run on Hadoop 2.x. By default, Gobblin compiles against Hadoop 2.3.0.
How do I fix UnsupportedFileSystemException: No AbstractFileSystem for scheme: null?

This error typically occurs due to Hadoop version conflict issues. If Gobblin is compiled against a specific Hadoop version, but then deployed on a different Hadoop version or installation, this error may be thrown. For example, if you simply compile Gobblin using ./gradlew clean build, but deploy Gobblin to a cluster with CDH installed, you may hit this error.

下面我们看看如何从头开始部署一个Gobblin的mapreduce任务:
这里写图片描述

1、下载Gobblin源码

这里以0.8.0的源码为例:Gobblin Source Code 0.8.0Github Gobblin 0.8.0

 unzip software/gobblin-gobblin_0.8.0.zip
[flink@cninfo ~]$ ls | grep gobblin
gobblin-gobblin_0.8.0

2、下载安装Gradle并配置环境变量

DOWNLOAD GRADLE 3.2.1

配置环境变量:

1.vim ~/.bashrc

2.添加下面内容:
export GRADLE_HOME=/home/flink/gradle-3.2.1
export PATH=$PATH:$GRADLE_HOME/bin

3.source ~/.brashrc

3、编译Gobblin

确保安装完成gradle:
这里写图片描述

编译适合自己hadoop版本的gobblin:

[flink@cninfo gobblin-gobblin_0.8.0]$ ./gradlew clean build -PhadoopVersion=2.6.0 -Pversion=gobblin_0.8.0 -x test

这个过程会下载很多jar包,时间会比较长,首次编译时间可能会持续几个小时。编译完成后,会出现“BUILD SUCCESSFUL”的标记:
这里写图片描述

成功后会在gobblin-gobblin_0.8.0目录下产生一个编译后的tar.gz文件:

gobblin-distribution-gobblin_0.8.0.tar.gz

4、配置gobblin的mapreduce Job

解压后配置bin/gobblin-env.sh:

[flink@cninfo gobblin]$ cat bin/gobblin-env.sh 
#!/bin/bash

# Set Gobblin specific environment variables here.

export GOBBLIN_JOB_CONFIG_DIR=/home/flink/gobblin/gobblin-config-dir
export GOBBLIN_WORK_DIR=/home/flink/gobblin/gobblin-work-dir
export HADOOP_BIN_DIR=/home/flink/hadoop/hadoop-2.6.0/bin

创建以上2个目录。

在GOBBLIN_JOB_CONFIG_DIR对应的目录下,配置mapreduce Job文件:

[flink@cninfo gobblin]$ cat gobblin-config-dir/gobblin-mapreduce.pull 
job.name=GobblinKafkaMapreduce
job.group=GobblinKafkaForMapreduce
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false

kafka.brokers=flink:9092,data0:9092,mf:9092

source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt

topic.whitelist=test

simple.writer.delimiter=\n

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=1

metrics.reporting.file.enabled=true
metrics.log.dir=/gobblin-kafka/metrics
metrics.reporting.file.suffix=txt

bootstrap.with.offset=earliest

fs.uri=hdfs://flink:9000
writer.fs.uri=hdfs://flink:9000
state.store.fs.uri=hdfs://flink:9000

mr.job.root.dir=/gobblin-kafka/working
state.store.dir=/gobblin-kafka/state-store
task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
data.publisher.final.dir=/gobblintest/job-output

这个配置文件仅仅比standalone模式多了一些配置,增加了hdfs的信息fs.uri等,指向hdfs路径。

5、运行mapreduce Job

[flink@flink bin]$ gobblin-mapreduce.sh --conf ../gobblin-config-dir/gobblin-mapreduce.pull --workdir /home/flink/gobblin/gobblin-work-dir

注意,这里的workdir参数,是指hdfs根目录下的gobblin-work-dir路径。

6、输出

mr程序运行结束,会在/目录下产生2个目录:

gobblin-kafka
gobblintest

这里写图片描述

其中,gobblin-kafka保存状态等信息,而最终的输出路径是gobblintest,这个也是.pull文件指定的路径。

topic为test的内容被输出到gobblintest中:
这里写图片描述

参考

FAQ
Building Gobblin
搭建 Gobblin 环境
Gobblin编译支持CDH5.4.0
Deployment
下载与安装Gradle
Gobblin采集kafka数据

0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:59352次
    • 积分:1094
    • 等级:
    • 排名:千里之外
    • 原创:46篇
    • 转载:5篇
    • 译文:4篇
    • 评论:18条
    文章分类
    最新评论