目录
2.3. 运行 docker compose 脚本并设置配置以启动集群
3.8. 将第二批上传到 Kafka 并运行 DeltaStreamer 进行
主要介绍Hudi提供的一个docker环境进行Hudi的快速体验和使用
0. 相关文章链接
1. 构建CentOS7环境
首先需要一台安装有CentOS7环境的服务器,可以使用云服务器或者自己安装虚拟机,博主这里是在windows10上使用VMware安装了一台CentOS7.9的虚拟机。
准备好服务器后,先需要安装好JDK等,然后还需要在服务器上安装好maven,可以参考博主的另一篇博文:Maven的下载安装和使用_yang_shibiao的博客-CSDN博客,安装好后需要配置好环境变量和软连接,如下图所示,推荐mavne的镜像也如下所示:
<localRepository>/opt/module/apache-maven/repository</localRepository>
<mirrors>
<!-- mirror
| Specifies a repository mirror site to use instead of a given repository. The repository that
| this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
| for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
|
<mirror>
<id>mirrorId</id>
<mirrorOf>repositoryId</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://my.repository.com/repo/path</url>
</mirror>
-->
<mirror>
<id>aliyunCentralMaven</id>
<name>aliyun central maven</name>
<url>https://maven.aliyun.com/repository/central/</url>
<mirrorOf>central</mirrorOf>
</mirror>
<mirror>
<id>centralMaven</id>
<name>central maven</name>
<url>http://mvnrepository.com/</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
因为是要进行docker环境的集成,所以需要在 /etc/hosts中配置对应的映射,如下所示:
192.168.12.122 hudi
127.0.0.1 adhoc-1
127.0.0.1 adhoc-2
127.0.0.1 namenode
127.0.0.1 datanode1
127.0.0.1 hiveserver
127.0.0.1 hivemetastore
127.0.0.1 kafkabroker
127.0.0.1 sparkmaster
127.0.0.1 zookeeper
2. 设置Docker
2.1. 编译Hudi源码
下载hudi的源码包,然后上传到服务器,解压后配置软连接,如下图所示:
进入hudi命令下,执行编译命令,注意,因为docker环境中用的是默认版本,所以编译也使用默认版本即可,默认版本中包含spark2.x和scala2.11等,编译命令:mvn package -DskipTests
2.2. 安装docker和docker-compose
安装docker:
# 使用yum安装docker
yum -y install docker
# 启动
systemctl start docker.service
# 设置为开机自启动
systemctl enable docker.service
安装docker-compose
# 下载docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 添加可执行权限(这里不懂可以看一下菜鸟教程-linux教程-文件权限)
sudo chmod +x /usr/local/bin/docker-compose
# 查看docker-compose版本
docker-compose --version
2.3. 运行 docker compose 脚本并设置配置以启动集群
进入hudi源码包的docker目录下,执行启动脚本,会启动对应的docker容器,如果没有会下载对应的容器,此时,可以将centos的原生源修改为国内的镜像源,可以参考 CentOS7 原生 yum 源修改为阿里 yum 源_yang_shibiao的博客-CSDN博客 博文设置 ,启动docker容器如下图所示:
此时,使用docker命令查看启动的容器如下图所示:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f7d16a39436 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest "entrypoint.sh /bi..." 2 minutes ago Up 2 minutes 0-1024/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:4040->4040/tcp adhoc-1
4bc2df0b6c29 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest "entrypoint.sh /bi..." 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp adhoc-2
afec3de0144a apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkworker_2.4.4:latest "entrypoint.sh /bi..." 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8080/tcp, 8082-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8081->8081/tcp spark-worker-1
9b4fec3f1257 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.4.4:latest "entrypoint.sh /bi..." 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 6066/tcp, 7000-7076/tcp, 0.0.0.0:7077->7077/tcp, 7078-8079/tcp, 8081-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8080->8080/tcp sparkmaster
2a56a1883d41 apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest "entrypoint.sh worker" 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp presto-worker-1
9c86afc17d52 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest "entrypoint.sh /bi..." 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-9999/tcp, 10001-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:10000->10000/tcp hiveserver
e64409b161c9 apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest "entrypoint.sh coo..." 2 minutes ago Up 2 minutes 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8089/tcp, 8091-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8090->8090/tcp presto-coordinator-1
075fe1cc0c2c apachehudi/hudi-hadoop_2.8.4-datanode:latest "/bin/bash /entryp..." 2 minutes ago Up 2 minutes (healthy) 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50009/tcp, 0.0.0.0:50010->50010/tcp, 50011-50074/tcp, 50076-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50075->50075/tcp datanode1
ca52b9ef6763 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest "entrypoint.sh /op..." 2 minutes ago Up 2 minutes (healthy) 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-9082/tcp, 9084-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:9083->9083/tcp hivemetastore
69d3491d3cbb apachehudi/hudi-hadoop_2.8.4-history:latest "/bin/bash /entryp..." 2 minutes ago Up 2 minutes (healthy) 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8187/tcp, 8189-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:58188->8188/tcp historyserver
d595ab93d0e6 apachehudi/hudi-hadoop_2.8.4-namenode:latest "/bin/bash /entryp..." 2 minutes ago Up 2 minutes (healthy) 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8019/tcp, 8021-10100/tcp, 0.0.0.0:8020->8020/tcp, 50000-50069/tcp, 50071-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50070->50070/tcp namenode
7a6f67ad55af graphiteapp/graphite-statsd "/entrypoint" 2 minutes ago Up 2 minutes 0.0.0.0:80->80/tcp, 2013-2014/tcp, 2023-2024/tcp, 8080/tcp, 0.0.0.0:2003-2004->2003-2004/tcp, 0.0.0.0:8126->8126/tcp, 8125/tcp, 8125/udp graphite
d13c4bc5f63a bde2020/hive-metastore-postgresql:2.3.0 "/docker-entrypoin..." 2 minutes ago Up 2 minutes 5432/tcp hive-metastore-postgresql
ebda232b3bef bitnami/zookeeper:3.4.12-r68 "/app-entrypoint.s..." 2 minutes ago Up 2 minutes 2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp zookeeper
d412faca8800 bitnami/kafka:2.0.0 "/app-entrypoint.s..." 2 minutes ago Up 2 minutes 0.0.0.0:9092->9092/tcp kafkabroker
这给集群带来了如下服务:
- HDFS 服务(NameNode、DataNode)
- Spark Master 和 Worker
- Hive 服务(Metastore、HiveServer2 以及 PostgresDB)
- Kafka Broker 和 Zookeeper 节点(Kafka 将用作演示的上游源)
- Presto 设置的容器(Presto 协调器和工作器)
- Trino 设置的容器(Trino 协调器和工作器)
- 用于运行 Hudi/Hive CLI 命令的临时容器
3. 运行Demo
3.1. 使用脚本拉取案例数据并发送到kafka中
因为这是使用docker容器安装部署,所以在主服务器上没有kafka脚本,此时需要去kafka官网下载一个kafka的安装包,因为这个docker中用的是2.0.0版本的,所以建议下载同样的包,然后将包上传解压,配置软连接即可,配置完成后如下图所示:
这样能直接调用上述安装包中的kafka脚本,将数据发送到docker容器中的kafka服务中,发送命令如下所示:
kafka-console-producer.sh --broker-list hudi:9092 --topic stock_ticks < /opt/module/hudi/docker/demo/data/batch_1.json
另外再启动一个消费者接收监控是否有数据:
启动命令:
kafka-console-consumer.sh --bootstrap-server hudi:9092 --topic stock_ticks
接收到的数据:
3.2. 将kafka中的数据写入到Hudi中
# 启动对应的容器
docker exec -it adhoc-2 /bin/bash
# 运行以下spark-subment命令以执行Delta-Looper并在HDFS中摄取到Stock_Ticks_cow表
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# 运行以下spark-subment命令以执行Delta-Fleoms,并在HDFS中摄取到Stock_Ticks_mor表
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_mor \
--target-table stock_ticks_mor \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--disable-compaction
执行上述第一个spark-submit时,会在hdfs的 /user/hive/warehouse/stock_ticks_cow 目录下写入如下文件:
执行上述第二个spark-submit时,会在hdfs的 /user/hive/warehouse/stock_ticks_mor 目录下写入如下文件:
3.3. hudi数据于hive的集成
执行下述第一个脚本会在hive中生成一张 stock_ticks_cow 表,支持对 Copy On Write 表的快照和增量查询。
执行下述第二个脚本会在hive中生成 stock_ticks_mor_rt 和 stock_ticks_mor_ro 这2张新表,前者支持快照和增量查询(提供近实时数据),而后者支持 ReadOptimized 查询。
# 进入对应的docker容器中
docker exec -it adhoc-2 /bin/bash
# 此命令用HDFS中的Hiveserver URL和COW Hudi表位置,并将HDFS状态同步到Hive
/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
# 使用Merge-On-Read(Mor表类型)运行HDF中的第二个数据集的Hive-Sync
/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_mor \
--database default \
--table stock_ticks_mor
3.4. 运行hive执行查询
运行 hive 查询以查找为股票代码“GOOG”提取的最新时间戳。您会注意到快照(对于 COW 和 MOR _rt 表)和读取优化查询(对于 MOR _ro 表)都给出了相同的值“上午 10:29”,因为 Hudi 为第一批数据创建了 parquet 文件。
# 进入对应的容器
docker exec -it adhoc-2 /bin/bash
# 连接hive
beeline -u jdbc:hive2://hiveserver:10000 \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false
# 显示hive中的表
0: jdbc:hive2://hiveserver:10000> show tables;
+---------------------+--+
| tab_name |
+---------------------+--+
| stock_ticks_cow |
| stock_ticks_mor_ro |
| stock_ticks_mor_rt |
+---------------------+--+
3 rows selected (1.199 seconds)
0: jdbc:hive2://hiveserver:10000>
# 查看添加的分区
0: jdbc:hive2://hiveserver:10000> show partitions stock_ticks_mor_rt;
+----------------+--+
| partition |
+----------------+--+
| dt=2018-08-31 |
+----------------+--+
1 row selected (0.24 seconds)
# 执行查询
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:29:00 |
+---------+----------------------+--+
Now, run a projection query:
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924221953 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
# Merge-On-Read Queries:
==========================
Lets run similar queries against M-O-R table. Lets look at both
ReadOptimized and Snapshot(realtime data) queries supported by M-O-R table
# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:29:00 |
+---------+----------------------+--+
1 row selected (6.326 seconds)
# Run Snapshot Query. Notice that the latest timestamp is again 10:29
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:29:00 |
+---------+----------------------+--+
1 row selected (1.606 seconds)
# Run Read Optimized and Snapshot project queries
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
3.5. 运行Spark-SQL集成Hudi
Hudi 支持 Spark 作为查询处理器,就像 Hive 一样。这是在 spark-sql 中运行的相同配置单元查询
docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
--jars $HUDI_SPARK_BUNDLE \
--master local[2] \
--driver-class-path $HADOOP_CONF_DIR \
--conf spark.sql.hive.convertMetastoreParquet=false \
--deploy-mode client \
--driver-memory 1G \
--executor-memory 3G \
--num-executors 1 \
--packages org.apache.spark:spark-avro_2.11:2.4.4
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sql("show tables").show(100, false)
+--------+------------------+-----------+
|database|tableName |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow |false |
|default |stock_ticks_mor_ro|false |
|default |stock_ticks_mor_rt|false |
+--------+------------------+-----------+
# Copy-On-Write Table
## Run max timestamp query against COW table
scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
[Stage 0:> (0 + 1) / 1]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes#StaticLoggerBinder for further details.
+------+-------------------+
|symbol|max(ts) |
+------+-------------------+
|GOOG |2018-08-31 10:29:00|
+------+-------------------+
## Projection Query
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts |volume|open |close |
+-------------------+------+-------------------+------+---------+--------+
|20180924221953 |GOOG |2018-08-31 09:59:00|6330 |1230.5 |1230.02 |
|20180924221953 |GOOG |2018-08-31 10:29:00|3391 |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+
# Merge-On-Read Queries:
==========================
Lets run similar queries against M-O-R table. Lets look at both
ReadOptimized and Snapshot queries supported by M-O-R table
# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts) |
+------+-------------------+
|GOOG |2018-08-31 10:29:00|
+------+-------------------+
# Run Snapshot Query. Notice that the latest timestamp is again 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts) |
+------+-------------------+
|GOOG |2018-08-31 10:29:00|
+------+-------------------+
# Run Read Optimized and Snapshot project queries
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts |volume|open |close |
+-------------------+------+-------------------+------+---------+--------+
|20180924222155 |GOOG |2018-08-31 09:59:00|6330 |1230.5 |1230.02 |
|20180924222155 |GOOG |2018-08-31 10:29:00|3391 |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts |volume|open |close |
+-------------------+------+-------------------+------+---------+--------+
|20180924222155 |GOOG |2018-08-31 09:59:00|6330 |1230.5 |1230.02 |
|20180924222155 |GOOG |2018-08-31 10:29:00|3391 |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+
3.6. 运行Presto集成Hudi
以下是类似 Hive 和 Spark 查询的 Presto 查询。目前 Presto 不支持对 Hudi 表的快照或增量查询。
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> show catalogs;
Catalog
-----------
hive
jmx
localfile
system
(4 rows)
Query 20190817_134851_00000_j8rcz, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:04 [0 rows, 0B] [0 rows/s, 0B/s]
presto> use hive.default;
USE
presto:default> show tables;
Table
--------------------
stock_ticks_cow
stock_ticks_mor_ro
stock_ticks_mor_rt
(3 rows)
Query 20190822_181000_00001_segyw, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
0:05 [3 rows, 99B] [0 rows/s, 18B/s]
# COPY-ON-WRITE Queries:
=========================
presto:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20190822_181011_00002_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:12 [197 rows, 613B] [16 rows/s, 50B/s]
presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20190822180221 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20190822180221 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20190822_181141_00003_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [109 rows/s, 341B/s]
# Merge-On-Read Queries:
==========================
Lets run similar queries against M-O-R table.
# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20190822_181158_00004_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:02 [197 rows, 613B] [110 rows/s, 343B/s]
presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20190822180250 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20190822_181256_00006_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [92 rows/s, 286B/s]
presto:default> exit
3.7. 运行Trino集成Hudi
以下是 Trino 的类似查询。目前,Trino 不支持对 Hudi 表的快照或增量查询。
docker exec -it adhoc-2 trino --server trino-coordinator-1:8091
trino> show catalogs;
Catalog
---------
hive
system
(2 rows)
Query 20220112_055038_00000_sac73, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
3.74 [0 rows, 0B] [0 rows/s, 0B/s]
trino> use hive.default;
USE
trino:default> show tables;
Table
--------------------
stock_ticks_cow
stock_ticks_mor_ro
stock_ticks_mor_rt
(3 rows)
Query 20220112_055050_00003_sac73, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1.84 [3 rows, 102B] [1 rows/s, 55B/s]
# COPY-ON-WRITE Queries:
=========================
trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20220112_055101_00005_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
4.08 [197 rows, 442KB] [48 rows/s, 108KB/s]
trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20220112054822108 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20220112054822108 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20220112_055113_00006_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.40 [197 rows, 450KB] [487 rows/s, 1.09MB/s]
# Merge-On-Read Queries:
==========================
Lets run similar queries against MOR table.
# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
trino:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20220112_055125_00007_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.50 [197 rows, 442KB] [395 rows/s, 888KB/s]
trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20220112054844841 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20220112054844841 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20220112_055136_00008_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.49 [197 rows, 450KB] [404 rows/s, 924KB/s]
trino:default> exit
3.8. 将第二批上传到 Kafka 并运行 DeltaStreamer 进行
上传第二批数据并使用 delta-streamer 摄取这批数据。由于此批次没有引入任何新分区,因此无需运行 hive-sync
执行的kafka脚本:
kafka-console-producer.sh --broker-list hudi:9092 --topic stock_ticks < /opt/module/hudi/docker/demo/data/batch_2.json
执行的spark作业:
# 启动容器
docker exec -it adhoc-2 /bin/bash
# 执行如下spark-submit命令,将delta-streamer和ingest写入HDFS的stock_ticks_cow表中
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# 执行如下spark-submit命令,执行delta-streamer和ingest到HDFS的stock_ticks_mor表
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_mor \
--target-table stock_ticks_mor \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--disable-compaction
生成的stock_ticks_cow文件如下:
生成的stock_ticks_mor文件如下:
3.9 运行hive
使用 Copy-On-Write 表,一旦批次提交,快照查询会立即将更改视为第二批的一部分,因为每次摄取都会创建更新版本的 parquet 文件。
使用 Merge-On-Read 表,第二次摄取只是将批处理附加到未合并的增量(日志)文件中。这是 ReadOptimized 和 Snapshot 查询将提供不同结果的时候。ReadOptimized 查询仍将返回“10:29 am”,因为它只会从 Parquet 文件中读取。快照查询将进行即时合并并返回最新提交的数据,即“上午 10:59”。
docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false
# Copy On Write Table:
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+--+
1 row selected (1.932 seconds)
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924224524 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
# Merge On Read Table:
# Read Optimized Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:29:00 |
+---------+----------------------+--+
1 row selected (1.6 seconds)
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+--+
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924224537 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
exit
3.10. 运行SparkSQL
在 Spark-SQL 中运行相同的查询:
docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
--jars $HUDI_SPARK_BUNDLE \
--driver-class-path $HADOOP_CONF_DIR \
--conf spark.sql.hive.convertMetastoreParquet=false \
--deploy-mode client \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
--num-executors 1 \
--packages org.apache.spark:spark-avro_2.11:2.4.4
# Copy On Write Table:
scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts) |
+------+-------------------+
|GOOG |2018-08-31 10:59:00|
+------+-------------------+
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924224524 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
# Merge On Read Table:
# Read Optimized Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol | _c1 |
+---------+----------------------+
| GOOG | 2018-08-31 10:29:00 |
+---------+----------------------+
1 row selected (1.6 seconds)
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924222155 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085 |
+----------------------+---------+----------------------+---------+------------+-----------+
# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol | _c1 |
+---------+----------------------+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924222155 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924224537 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+
exit
3.11. 运行Presto
在 Presto 上为 ReadOptimized 查询运行相同的查询。
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> use hive.default;
USE
# Copy On Write Table:
presto:default>select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:59:00
(1 row)
Query 20190822_181530_00007_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:02 [197 rows, 613B] [125 rows/s, 389B/s]
presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20190822180221 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20190822181433 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215
(2 rows)
Query 20190822_181545_00008_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [106 rows/s, 332B/s]
As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
# Merge On Read Table:
# Read Optimized Query
presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20190822_181602_00009_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [139 rows/s, 435B/s]
presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20190822180250 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20190822_181615_00010_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [197 rows, 613B] [154 rows/s, 480B/s]
presto:default> exit
3.12. 运行Trino
在 Trino 上为读取优化查询运行相同的查询。
docker exec -it adhoc-2 trino --server trino-coordinator-1:8091
trino> use hive.default;
USE
# Copy On Write Table:
trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:59:00
(1 row)
Query 20220112_055443_00012_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.63 [197 rows, 442KB] [310 rows/s, 697KB/s]
trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20220112054822108 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20220112055352654 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215
(2 rows)
Query 20220112_055450_00013_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.65 [197 rows, 450KB] [303 rows/s, 692KB/s]
As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
# Merge On Read Table:
# Read Optimized Query
trino:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:29:00
(1 row)
Query 20220112_055500_00014_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.59 [197 rows, 442KB] [336 rows/s, 756KB/s]
trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20220112054844841 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20220112054844841 | GOOG | 2018-08-31 10:29:00 | 3391 | 1230.1899 | 1230.085
(2 rows)
Query 20220112_055506_00015_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.35 [197 rows, 450KB] [556 rows/s, 1.24MB/s]
trino:default> exit
3.13. COPY-ON-WRITE表
摄取了 2 批数据,让我们展示对 Hudi Copy-On-Write 表中增量查询的支持
让我们以相同的投影查询示例为例
docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064621 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924065039 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
正如从上述查询中注意到的那样,有 2 个提交 - 按时间线顺序提交 20180924064621 和 20180924065039。当您按照这些步骤操作时,您将获得不同的提交时间戳。用它们代替上述时间戳。
为了展示增量查询的效果,让我们假设读者已经将这些更改视为摄取第一批的一部分。现在,为了让读者看到第二批的效果,他/她必须将开始时间戳保持到第一批的提交时间(20180924064621)并运行增量查询
Hudi 增量模式通过使用 hudi 管理的元数据过滤掉没有任何候选行的文件,为增量查询提供高效扫描。
docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
No rows affected (0.009 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.max.commits=3;
No rows affected (0.009 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
使用上述设置,从提交 20180924065039 中没有任何更新的文件 ID 将被过滤掉而不进行扫描。这是增量查询:
0: jdbc:hive2://hiveserver:10000>
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924065039 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
1 row selected (0.83 seconds)
0: jdbc:hive2://hiveserver:10000>
3.14. 使用 Spark SQL 进行增量查询
docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
--jars $HUDI_SPARK_BUNDLE \
--driver-class-path $HADOOP_CONF_DIR \
--conf spark.sql.hive.convertMetastoreParquet=false \
--deploy-mode client \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
--num-executors 1 \
--packages org.apache.spark:spark-avro_2.11:2.4.4
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceReadOptions
# In the below query, 20180925045257 is the first commit's timestamp
scala> val hoodieIncViewDF = spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes#StaticLoggerBinder for further details.
hoodieIncViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 15 more fields]
scala> hoodieIncViewDF.registerTempTable("stock_ticks_cow_incr_tmp1")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow_incr_tmp1 where symbol = 'GOOG'").show(100, false);
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924065039 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+
3.15. Merge-On-Read表
让我们安排并运行压缩以创建新版本的柱状文件,以便读取优化的读者将看到更新鲜的数据。同样,可以使用 Hudi CLI 手动调度和运行压缩
docker exec -it adhoc-1 /bin/bash
root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
...
Table command getting loaded
HoodieSplashScreen loaded
===================================================================
* ___ ___ *
* /\__\ ___ /\ \ ___ *
* / / / /\__\ / \ \ /\ \ *
* / /__/ / / / / /\ \ \ \ \ \ *
* / \ \ ___ / / / / / \ \__\ / \__\ *
* / /\ \ /\__\ / /__/ ___ / /__/ \ |__| / /\/__/ *
* \/ \ \/ / / \ \ \ /\__\ \ \ \ / / / /\/ / / *
* \ / / \ \ / / / \ \ / / / \ /__/ *
* / / / \ \/ / / \ \/ / / \ \__\ *
* / / / \ / / \ / / \/__/ *
* \/__/ \/__/ \/__/ Apache Hudi CLI *
* *
===================================================================
Welcome to Apache Hudi CLI. Please type help if you are looking for help.
hudi->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
hoodie:stock_ticks_mor->compactions show all
20/02/10 03:41:32 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200210015059__clean__COMPLETED], [20200210015059__deltacommit__COMPLETED], [20200210022758__clean__COMPLETED], [20200210022758__deltacommit__COMPLETED], [==>20200210023843__compaction__REQUESTED]]
___________________________________________________________________
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
# Schedule a compaction. This will use Spark Launcher to schedule compaction
hoodie:stock_ticks_mor->compaction schedule
....
Compaction successfully completed for 20180924070031
# Now refresh and check again. You will see that there is a new compaction requested
hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
hoodie:stock_ticks_mor->compactions show all
18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], [20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], [==>20180924063245__compaction__REQUESTED]]
___________________________________________________________________
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
| 20180924070031 | REQUESTED| 1 |
# Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
hoodie:stock_ticks_mor->compaction run --compactionInstant 20180924070031 --parallelism 2 --sparkMemory 1G --schemaFilePath /var/demo/config/schema.avsc --retry 1
....
Compaction successfully completed for 20180924070031
## Now check if compaction is completed
hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
hoodie:stock_ticks->compactions show all
18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], [20180924070031__commit__COMPLETED]]
___________________________________________________________________
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
| 20180924070031 | COMPLETED| 1 |
3.16. 运行 Hive 进行增量查询
看到 ReadOptimized 和 Snapshot 查询都将显示最新提交的数据。让我们也运行 MOR 表的增量查询。从下面的查询输出中可以清楚地看出,MOR 表的第一次提交时间是 20180924064636,第二次提交时间是 20180924070031
docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
--hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
--hiveconf hive.stats.autogather=false
# Read Optimized Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+--+
1 row selected (1.6 seconds)
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064636 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol | _c1 |
+---------+----------------------+--+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+--+
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064636 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
# Incremental Query:
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
No rows affected (0.008 seconds)
# Max-Commits covers both second batch and compaction commit
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.max.commits=3;
No rows affected (0.007 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
No rows affected (0.013 seconds)
# Query:
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064636';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+--+
exit
3.17. 进行压缩
docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
--jars $HUDI_SPARK_BUNDLE \
--driver-class-path $HADOOP_CONF_DIR \
--conf spark.sql.hive.convertMetastoreParquet=false \
--deploy-mode client \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
--num-executors 1 \
--packages org.apache.spark:spark-avro_2.11:2.4.4
# Read Optimized Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol | max(ts) |
+---------+----------------------+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+
1 row selected (1.6 seconds)
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924064636 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+
# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol | max(ts) |
+---------+----------------------+
| GOOG | 2018-08-31 10:59:00 |
+---------+----------------------+
scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time | symbol | ts | volume | open | close |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924064636 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02 |
| 20180924070031 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215 |
+----------------------+---------+----------------------+---------+------------+-----------+
3.18. 使用Presto读压缩文件
docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> use hive.default;
USE
# Read Optimized Query
resto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
symbol | _col1
--------+---------------------
GOOG | 2018-08-31 10:59:00
(1 row)
Query 20190822_182319_00011_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [133 rows/s, 414B/s]
presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_ro where symbol = 'GOOG';
_hoodie_commit_time | symbol | ts | volume | open | close
---------------------+--------+---------------------+--------+-----------+----------
20190822180250 | GOOG | 2018-08-31 09:59:00 | 6330 | 1230.5 | 1230.02
20190822181944 | GOOG | 2018-08-31 10:59:00 | 9021 | 1227.1993 | 1227.215
(2 rows)
Query 20190822_182333_00012_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [98 rows/s, 307B/s]
presto:default>
注:Hudi系列博文为通过对Hudi官网学习记录所写,其中有加入个人理解,如有不足,请各位读者谅解☺☺☺
注:其他相关文章链接由此进(包括Hudi在内的各数据湖相关博文) -> 数据湖 文章汇总