flume使用HDFS Sink将数据导入到Hive中

整体流程:avro Source获取数据,然后通过SPILLABLEMEMORY channel,再然后使用hdfs sink将数据落地到hdfs中,最后通过调度系统执行脚本导入到hive中。

最初是打算使用hive sink的:

logger.sources = r1
logger.sinks = k1
logger.channels = c1

# Describe/configure the source
logger.sources.r1.type = Avro
logger.sources.r1.bind = 0.0.0.0
logger.sources.r1.port = 6666

#Spillable Memory Channel
logger.channels.c1.type=SPILLABLEMEMORY
logger.channels.c1.checkpointDir = /data/flume/checkpoint
logger.channels.c1.dataDirs = /data/flume

# Describe the sink
logger.sinks.k1.type = hive
logger.sinks.k1.hive.metastore = thrift://hadoop01.com:9083
logger.sinks.k1.hive.database = tmp
logger.sinks.k1.hive.table = app_log
logger.sinks.k1.hive.partition = %y-%m-%d-%H-%M
logger.sinks.k1.batchSize = 10000
logger.sinks.k1.useLocalTimeStamp = true
logger.sinks.k1.round = true
logger.sinks.k1.roundValue = 10
logger.sinks.k1.roundUnit = minute
logger.sinks.k1.serializer = DELIMITED
logger.sinks.k1.serializer.delimiter = "\n"
logger.sinks.k1.serializer.serdeSeparator = '\t'
logger.sinks.k1.serializer.fieldnames =log

# Bind the source and sink to the channel
logger.sources.r1.channels = c1
logger.sinks.k1.channel=c1

但是使用开发过程中遇到各种坑,各种莫名其妙的错误,最终放弃。

1、flume.conf

logger.sources = r1
logger.sinks = k1
logger.channels = c1

# Describe/configure the source
logger.sources.r1.type = Avro
logger.sources.r1.bind = 0.0.0.0
logger.sources.r1.port = 6666

#Spillable Memory Channel
logger.channels.c1.type=SPILLABLEMEMORY
logger.channels.c1.checkpointDir = /data/flume/checkpoint
logger.channels.c1.dataDirs = /data/flume

# Describe the sink
logger.sinks.k1.type = hdfs
logger.sinks.k1.hdfs.path = hdfs://zsCluster/collection-logs/buried-logs/dt=%Y-%m-%d/
logger.sinks.k1.hdfs.filePrefix = collection-%Y-%m-%d_%H
logger.sinks.k1.hdfs.fileSuffix = .log
logger.sinks.k1.hdfs.useLocalTimeStamp = true
logger.sinks.k1.hdfs.round = false
logger.sinks.k1.hdfs.roundValue = 10
logger.sinks.k1.hdfs.roundUnit = minute
logger.sinks.k1.hdfs.batchSize = 1000
logger.sinks.k1.hdfs.minBlockReplicas=1
#fileType:默认值:SequenceFile,
#当使用DataStream时候,文件不会被压缩,不需要设置hdfs.codeC;当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值
logger.sinks.k1.hdfs.fileType=DataStream
logger.sinks.k1.hdfs.writeFormat=Text
logger.sinks.k1.hdfs.rollSize=0
logger.sinks.k1.hdfs.rollInterval=600
logger.sinks.k1.hdfs.rollCount=0
logger.sinks.k1.hdfs.callTimeout = 60000

# Bind the source and sink to the channel
logger.sources.r1.channels = c1
logger.sinks.k1.channel=c1

这里要注意一下:

1)logger.sinks.k1.hdfs.path这一项,路径后面是dt=%Y-%m-%d,dt为后面hive表的分区字段;

2)logger.sinks.k1.hdfs.writeFormat这一项设置为Text(默认为Writable);

至于其他每一项的含义可查看官网:Flume 1.9.0 User Guide — Apache Flumehttps://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hdfs-sink

中文版:Flume 1.9用户手册中文版 — 可能是目前翻译最完整的版本了https://flume.liyifeng.org/#hdfs-sink

2、创建Hive临时表

create external table tmp.app_log(
log String
)
comment '日志信息'
partitioned by (`dt` string)
row format delimited
fields terminated by '\001'
lines terminated by '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/collection-logs/buried-logs/';

由于我的数据为json格式,所以这里暂时只有一个字段,到后面业务层再去拆分;这里的分区字段名称要和上一步的hdfs.path路径里的一致,都为dt;STORED AS这里因为上一步writeFormat设置了Text,所以这里要设置为TextInputFormat;LOCATION这里和上面hdfs.path的路径保持一致;

3、导入分区

ALTER TABLE tmp.app_log ADD PARTITION(dt='${last_date}')

4、将临时表数据导入ods层

insert overwrite table ods.app_log partition(dt='${last_date}')
select log from tmp.app_log WHERE dt='${last_date}' and log <> '';

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码道功成

过程不易,恳请支持一下!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值