Flume案例分享01(Kafka -> Hive)

1、背景

最近公司要搭建一个大数据ETL平台,过程涉及一些测试工作,在测试过程中,将一些可用性强的步骤给记录下来,方便后面开发作业的时候,拿来直接 copy

2、环境介绍

  • 作业流程:从Kafka消费数据,直接写入至Hive库表中
  • 本机环境:Linux(7.9)、Hadoop(3.1.3)、Hive(3.1.2)、Kafka(2.11-2.4.1)
  • 集群节点:3台(node01、node02、node03)

3、任务准备

Flume官网上关于Hive Sink的一句话,即:Hive的表需要是事务表

This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions. As soon as a set of events are committed to Hive, they become immediately visible to Hive queries. Partitions to which flume will stream to can either be pre-created or, optionally, Flume can create them if they are missing. Fields from incoming event data are mapped to corresponding columns in the Hive table.

我们知道,Hive的事务表其实限制挺多的,我看查看Hive官网就有相关描述:

Limitations:
	1BEGIN, COMMIT, and ROLLBACK are not yet supported.  All language operations are auto-commit.  The plan is to support these in a future release.
	2、Only ORC file format is supported in this first release.  The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
	3By default transactions are configured to be off.  See the Hive Transactions#Configuration section below for a discussion of which values need to be set to configure it.
	4Tables must be bucketed to make use of these features.  Tables in the same system not using transactions and ACID do not need to be bucketed. External tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor (HIVE-13175).
	5、Reading/writing to an ACID table from a non-ACID session is not allowed. In other words, the Hive transaction manager must be set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager in order to work with ACID tables.
	6、At this time only snapshot level isolation is supported.  When a given query starts it will be provided with a consistent snapshot of the data.  There is no support for dirty read, read committed, repeatable read, or serializable.  With the introduction of BEGIN the intention is to support snapshot isolation for the duration of transaction rather than just a single query.  Other isolation levels may be added depending on user requests.
	7、The existing ZooKeeper and in-memory lock managers are not compatible with transactions.  There is no intention to address this issue.  See Hive Transactions#Basic Design below for a discussion of how locks are stored for transactions.
	8Schema changes using ALTER TABLE is NOT supported for ACID tables. HIVE-11421 is tracking it.  Fixed in 1.3.0/2.0.0.
	9Using Oracle as the Metastore DB and "datanucleus.connectionPoolingType=BONECP" may generate intermittent "No such lock.." and "No such transaction..." errors.  Setting "datanucleus.connectionPoolingType=DBCP" is recommended in this case. 
	10LOAD DATA... statement is not supported with transactional tables.  (This was not properly enforced until HIVE-16732)

没办法,只能创建事务表
3.1 创建表

-- 开启事务支持
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
-- 建表,必须是分桶表,仅支持OCR格式
create table dwd_mhs_opt_register (
     global_id                 string
    ,patient_id                string
    ,med_org_id                string
    ,med_org_name              string
    ,register_dept_id          string
    ,register_dept_name        string
    ,register_date             date
    ,visit_id                  string
    ,register_id               string
    ,register_time             timestamp
    ,fee_register              decimal(18,2)
    ,patient_name              string
    ,id_card                   string
    ,id_card_type              string
    ,gender                    string
    ,birthday                  string
    ,patient_phone             string
    ,visit_card_no             string
    ,visit_card_type           string
    ,ds_date                   date
) clustered by(global_id) into 2 buckets stored as orc tblproperties('transactional'='true');

3.2 创建Kafka主题

# 创建主题 k2h_01
kafka-topics.sh --bootstrap-server node01:9092 --create --replication-factor 3 --partitions 1 --topic k2h_01
# 查看主题
kafka-topics.sh --bootstrap-server node01:9092 --list

3.3 创建flume配置文件

a.sources=source_from_kafka
a.channels=file_channel
a.sinks=hive_sink

#kafka为souce的配置
a.sources.source_from_kafka.type=org.apache.flume.source.kafka.KafkaSource
a.sources.source_from_kafka.batchSize=10
a.sources.source_from_kafka.kafka.bootstrap.servers=node01:9092
a.sources.source_from_kafka.topic=k2h_01
a.sources.source_from_kafka.channels=file_channel
a.sources.source_from_kafka.consumer.timeout.ms=1000

#hive为sink的配置
a.sinks.hive_sink.type=hive
a.sinks.hive_sink.hive.metastore=thrift://node01:9083
a.sinks.hive_sink.hive.database=test
a.sinks.hive_sink.hive.table=dwd_mhs_opt_register
a.sinks.hive_sink.hive.txnsPerBatchAsk=2
a.sinks.hive_sink.batchSize=10
a.sinks.hive_sink.serializer=JSON
a.sinks.hive_sink.serializer.fieldnames=global_id,patient_id,med_org_id,med_org_name,register_dept_id,register_dept_name,register_date,visit_id,register_id,register_time,fee_register,patient_name,id_card,id_card_type,gender,birthday,patient_phone,visit_card_no,visit_card_type,ds_date

#channel的配置
a.channels.file_channel.type=file
a.channels.file_channel.capacity=1000
a.channels.file_channel.transactionCapacity=100

#三者之间的关系
a.sources.source_from_kafka.channels=file_channel
a.sinks.hive_sink.channel=file_channel

4、任务测试

4.1 启动Flume任务

# 默认 Hadoop Kafka等集群都已经启动
flume-ng agent -n a -c $FLUME_HOME/conf -f $FLUME_HOME/conf/kafka_to_hive_01.conf -Dflume.root.logger=INFO,console

在这里插入图片描述

4.2 启动Kafka任务

# 启动Kafka控制台生产者,并发送测试数据
kafka-console-producer.sh --broker-list node01:9092 --topic k2h_01

在这里插入图片描述
4.3 查看Hive表中数据

# 登录hive查看,此处我登录的是beeline
# 开启 hiveserver2 服务
hive --service hiveserver2 2>&1 >/dev/null &
# 开启元数据服务
hive --service metastore 2>&1 >/dev/null &
# 登录 beeline
beeline -u jdbc:hive2://node01:10000 -n root
# 查询数据
use test;
select i.* from dwd_mhs_opt_register i;

在这里插入图片描述

5、总结

  • Flume的确可以从Kafka读数据并写入Hive中,但是限制颇多(事务表)
  • 针对时间戳格式,Kafka导入Hive中默认带了毫秒(可参考上面截图)
  • 需要添加上调度,测试平台使用的是dolphinscheduler,后续写一篇文章介绍一下
  • 针对速率啥的,需要监控,Ganglia应该是不错的选择,有时间也会研究下
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Apache DolphinScheduler是一个新一代分布式大数据工作流任务调度系统,致力于“解决大数据任务之间错综复杂的依赖关系,整个数据处理开箱即用”。它以 DAG(有向无环图) 的方式将任务连接起来,可实时监控任务的运行状态,同时支持重试、从指定节点恢复失败、暂停及 Kill任务等操作。目前已经有像IBM、腾讯、美团、360等400多家公司生产上使用。 调度系统现在市面上的调度系统那么多,比如老牌的Airflow, Oozie,Kettle,xxl-job ,Spring Batch等等, 为什么要选DolphinScheduler ? DolphinScheduler 的定位是大数据工作流调度。通过把大数据和工作流做了重点标注. 从而可以知道DolphinScheduler的定位是针对于大数据体系。 DolphinScheduler是非常强大的大数据调度工具,有以下一些特点:1、通过拖拽以DAG 图的方式将 Task 按照任务的依赖关系关联起来,可实时可视化监控任务的运行状态;2、支持丰富的任务类型;3、支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复,同时支持失败重试/告警、从指定节点恢复失败、Kill 任务等操作;4、支持工作流全局参数及节点自定义参数设置;5、支持集群HA,通过 Zookeeper实现 Master 集群和 Worker 集群去中心化;6、支持工作流运行历史树形/甘特图展示、支持任务状态统计、流程状态统计;7、支持补数,并行或串行回填数据。课程会带大家构建DolphinScheduler大数据调度平台,实战讲解多种任务调度配置,会基于案例讲解DolphinScheduler使用,让大家在实战中掌握DolphinScheduler。 DolphinScheduler 发展很快 很多公司调度都切换到了DolphinScheduler,掌握DolphinScheduler调度使用势在必行,抓住新技术机遇,为跳巢涨薪做好准备。
好的,我可以回答这个问题。您可以按照以下步骤来使用Flume采集Kafka数据到HDFS和Hive中: 1. 首先,确保您已经安装了FlumeKafka、HDFS和Hive等必要的软件。 2. 然后,在Flume的配置文件中添加Kafka作为Source和HDFS/Hive作为Sink。例如,以下是一个简单的配置文件示例: ``` agent.sources = kafka-source agent.channels = hdfs-channel agent.sinks = hdfs-sink hive-sink agent.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource agent.sources.kafka-source.batchSize = 1000 agent.sources.kafka-source.kafka.bootstrap.servers = localhost:9092 agent.sources.kafka-source.kafka.topics = my-topic agent.sources.kafka-source.kafka.consumer.group.id = my-group agent.channels.hdfs-channel.type = memory agent.channels.hdfs-channel.capacity = 10000 agent.channels.hdfs-channel.transactionCapacity = 100 agent.sinks.hdfs-sink.type = hdfs agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:9000/flume/kafka agent.sinks.hdfs-sink.hdfs.fileType = DataStream agent.sinks.hdfs-sink.hdfs.writeFormat = Text agent.sinks.hdfs-sink.channel = hdfs-channel agent.sinks.hive-sink.type = hive agent.sinks.hive-sink.hive.metastore = thrift://localhost:9083 agent.sinks.hive-sink.hive.database = my-db agent.sinks.hive-sink.hive.table = my-table agent.sinks.hive-sink.hive.partition = dt=%Y-%m-%d agent.sinks.hive-sink.channel = hdfs-channel ``` 在上面的配置文件中,我们使用Kafka作为Source,将数据写入到HDFS和Hive中。我们使用Memory Channel将数据缓存在内存中,然后将其写入到HDFS和Hive中。在Hive Sink中,我们还定义了一个分区,以便按日期对数据进行分区。 3. 最后,运行Flume代理以开始从Kafka读取数据并将其写入到HDFS和Hive中。您可以使用以下命令来启动Flume代理: ``` $ bin/flume-ng agent -n agent -c conf -f conf/flume-kafka-hdfs-hive.conf ``` 这样,Flume代理就会开始从Kafka读取数据,并将其写入到HDFS和Hive中。 希望这些信息能够帮助您采集Kafka数据到HDFS和Hive中。如果您有任何其他问题,请随时问我。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值