数据湖之Hudi(7):使用docker进行Hudi的快速体验和使用

目录

0. 相关文章链接

1. 构建CentOS7环境

2. 设置Docker

2.1. 编译Hudi源码

2.2. 安装docker和docker-compose

2.3. 运行 docker compose 脚本并设置配置以启动集群

3. 运行Demo

3.1. 使用脚本拉取案例数据并发送到kafka中

3.2. 将kafka中的数据写入到Hudi中

3.3. hudi数据于hive的集成

3.4. 运行hive执行查询

3.5. 运行Spark-SQL集成Hudi

3.6. 运行Presto集成Hudi

3.7. 运行Trino集成Hudi

3.8. 将第二批上传到 Kafka 并运行 DeltaStreamer 进行

3.9 运行hive

3.10. 运行SparkSQL

3.11. 运行Presto

3.12. 运行Trino

3.13. COPY-ON-WRITE表

3.14. 使用 Spark SQL 进行增量查询

3.15. Merge-On-Read表

3.16. 运行 Hive 进行增量查询

3.17. 进行压缩

3.18. 使用Presto读压缩文件


主要介绍Hudi提供的一个docker环境进行Hudi的快速体验和使用

0. 相关文章链接

数据湖 文章汇总

1. 构建CentOS7环境

        首先需要一台安装有CentOS7环境的服务器,可以使用云服务器或者自己安装虚拟机,博主这里是在windows10上使用VMware安装了一台CentOS7.9的虚拟机。

        准备好服务器后,先需要安装好JDK等,然后还需要在服务器上安装好maven,可以参考博主的另一篇博文:Maven的下载安装和使用_yang_shibiao的博客-CSDN博客,安装好后需要配置好环境变量和软连接,如下图所示,推荐mavne的镜像也如下所示:

<localRepository>/opt/module/apache-maven/repository</localRepository>
 
 
  <mirrors>
    <!-- mirror
     | Specifies a repository mirror site to use instead of a given repository. The repository that
     | this mirror serves has an ID that matches the mirrorOf element of this mirror. IDs are used
     | for inheritance and direct lookup purposes, and must be unique across the set of mirrors.
     |
    <mirror>
      <id>mirrorId</id>
      <mirrorOf>repositoryId</mirrorOf>
      <name>Human Readable Name for this Mirror.</name>
      <url>http://my.repository.com/repo/path</url>
    </mirror>
     -->
    <mirror>
      <id>aliyunCentralMaven</id>
      <name>aliyun central maven</name>
      <url>https://maven.aliyun.com/repository/central/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
    <mirror>
      <id>centralMaven</id>
      <name>central maven</name>
      <url>http://mvnrepository.com/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>

        因为是要进行docker环境的集成,所以需要在 /etc/hosts中配置对应的映射,如下所示:

192.168.12.122 hudi

127.0.0.1 adhoc-1
127.0.0.1 adhoc-2
127.0.0.1 namenode
127.0.0.1 datanode1
127.0.0.1 hiveserver
127.0.0.1 hivemetastore
127.0.0.1 kafkabroker
127.0.0.1 sparkmaster
127.0.0.1 zookeeper

2. 设置Docker

2.1. 编译Hudi源码

下载hudi的源码包,然后上传到服务器,解压后配置软连接,如下图所示:

进入hudi命令下,执行编译命令,注意,因为docker环境中用的是默认版本,所以编译也使用默认版本即可,默认版本中包含spark2.x和scala2.11等,编译命令:mvn package -DskipTests

2.2. 安装docker和docker-compose

安装docker:

# 使用yum安装docker
yum -y install docker
# 启动
systemctl start docker.service
# 设置为开机自启动
systemctl enable docker.service

 安装docker-compose

# 下载docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 添加可执行权限(这里不懂可以看一下菜鸟教程-linux教程-文件权限)
sudo chmod +x /usr/local/bin/docker-compose
# 查看docker-compose版本
docker-compose --version

2.3. 运行 docker compose 脚本并设置配置以启动集群

进入hudi源码包的docker目录下,执行启动脚本,会启动对应的docker容器,如果没有会下载对应的容器,此时,可以将centos的原生源修改为国内的镜像源,可以参考 CentOS7 原生 yum 源修改为阿里 yum 源_yang_shibiao的博客-CSDN博客 博文设置 ,启动docker容器如下图所示:

此时,使用docker命令查看启动的容器如下图所示:

CONTAINER ID        IMAGE                                                              COMMAND                  CREATED             STATUS                   PORTS                                                                                                                                                                                           NAMES
0f7d16a39436        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest    "entrypoint.sh /bi..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:4040->4040/tcp                                                                             adhoc-1
4bc2df0b6c29        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest    "entrypoint.sh /bi..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp                                                                                           adhoc-2
afec3de0144a        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkworker_2.4.4:latest   "entrypoint.sh /bi..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8080/tcp, 8082-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8081->8081/tcp                                                    spark-worker-1
9b4fec3f1257        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.4.4:latest   "entrypoint.sh /bi..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 6066/tcp, 7000-7076/tcp, 0.0.0.0:7077->7077/tcp, 7078-8079/tcp, 8081-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8080->8080/tcp   sparkmaster
2a56a1883d41        apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest               "entrypoint.sh worker"   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp                                                                                           presto-worker-1
9c86afc17d52        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest                     "entrypoint.sh /bi..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-9999/tcp, 10001-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:10000->10000/tcp                                                 hiveserver
e64409b161c9        apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest               "entrypoint.sh coo..."   2 minutes ago       Up 2 minutes             0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8089/tcp, 8091-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8090->8090/tcp                                                    presto-coordinator-1
075fe1cc0c2c        apachehudi/hudi-hadoop_2.8.4-datanode:latest                       "/bin/bash /entryp..."   2 minutes ago       Up 2 minutes (healthy)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50009/tcp, 0.0.0.0:50010->50010/tcp, 50011-50074/tcp, 50076-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50075->50075/tcp     datanode1
ca52b9ef6763        apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest                     "entrypoint.sh /op..."   2 minutes ago       Up 2 minutes (healthy)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-9082/tcp, 9084-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:9083->9083/tcp                                                    hivemetastore
69d3491d3cbb        apachehudi/hudi-hadoop_2.8.4-history:latest                        "/bin/bash /entryp..."   2 minutes ago       Up 2 minutes (healthy)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8187/tcp, 8189-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:58188->8188/tcp                                                   historyserver
d595ab93d0e6        apachehudi/hudi-hadoop_2.8.4-namenode:latest                       "/bin/bash /entryp..."   2 minutes ago       Up 2 minutes (healthy)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8019/tcp, 8021-10100/tcp, 0.0.0.0:8020->8020/tcp, 50000-50069/tcp, 50071-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50070->50070/tcp         namenode
7a6f67ad55af        graphiteapp/graphite-statsd                                        "/entrypoint"            2 minutes ago       Up 2 minutes             0.0.0.0:80->80/tcp, 2013-2014/tcp, 2023-2024/tcp, 8080/tcp, 0.0.0.0:2003-2004->2003-2004/tcp, 0.0.0.0:8126->8126/tcp, 8125/tcp, 8125/udp                                                        graphite
d13c4bc5f63a        bde2020/hive-metastore-postgresql:2.3.0                            "/docker-entrypoin..."   2 minutes ago       Up 2 minutes             5432/tcp                                                                                                                                                                                        hive-metastore-postgresql
ebda232b3bef        bitnami/zookeeper:3.4.12-r68                                       "/app-entrypoint.s..."   2 minutes ago       Up 2 minutes             2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp                                                                                                                                                      zookeeper
d412faca8800        bitnami/kafka:2.0.0                                                "/app-entrypoint.s..."   2 minutes ago       Up 2 minutes             0.0.0.0:9092->9092/tcp                                                                                                                                                                          kafkabroker

这给集群带来了如下服务:

  • HDFS 服务(NameNode、DataNode)
  • Spark Master 和 Worker
  • Hive 服务(Metastore、HiveServer2 以及 PostgresDB)
  • Kafka Broker 和 Zookeeper 节点(Kafka 将用作演示的上游源)
  • Presto 设置的容器(Presto 协调器和工作器)
  • Trino 设置的容器(Trino 协调器和工作器)
  • 用于运行 Hudi/Hive CLI 命令的临时容器

3. 运行Demo

3.1. 使用脚本拉取案例数据并发送到kafka中

因为这是使用docker容器安装部署,所以在主服务器上没有kafka脚本,此时需要去kafka官网下载一个kafka的安装包,因为这个docker中用的是2.0.0版本的,所以建议下载同样的包,然后将包上传解压,配置软连接即可,配置完成后如下图所示:

这样能直接调用上述安装包中的kafka脚本,将数据发送到docker容器中的kafka服务中,发送命令如下所示:

kafka-console-producer.sh --broker-list hudi:9092 --topic stock_ticks  < /opt/module/hudi/docker/demo/data/batch_1.json

另外再启动一个消费者接收监控是否有数据:

启动命令:

kafka-console-consumer.sh --bootstrap-server hudi:9092  --topic stock_ticks

 接收到的数据:

3.2. 将kafka中的数据写入到Hudi中

# 启动对应的容器
docker exec -it adhoc-2 /bin/bash

# 运行以下spark-subment命令以执行Delta-Looper并在HDFS中摄取到Stock_Ticks_cow表
spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts  \
  --target-base-path /user/hive/warehouse/stock_ticks_cow \
  --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

# 运行以下spark-subment命令以执行Delta-Fleoms,并在HDFS中摄取到Stock_Ticks_mor表
spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_mor \
  --target-table stock_ticks_mor \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --disable-compaction

执行上述第一个spark-submit时,会在hdfs的 /user/hive/warehouse/stock_ticks_cow 目录下写入如下文件:

执行上述第二个spark-submit时,会在hdfs的 /user/hive/warehouse/stock_ticks_mor 目录下写入如下文件:

3.3. hudi数据于hive的集成

执行下述第一个脚本会在hive中生成一张 stock_ticks_cow 表,支持对 Copy On Write 表的快照和增量查询。

执行下述第二个脚本会在hive中生成 stock_ticks_mor_rt 和 stock_ticks_mor_ro 这2张新表,前者支持快照和增量查询(提供近实时数据),而后者支持 ReadOptimized 查询。

# 进入对应的docker容器中
docker exec -it adhoc-2 /bin/bash

# 此命令用HDFS中的Hiveserver URL和COW Hudi表位置,并将HDFS状态同步到Hive
/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://hiveserver:10000 \
  --user hive \
  --pass hive \
  --partitioned-by dt \
  --base-path /user/hive/warehouse/stock_ticks_cow \
  --database default \
  --table stock_ticks_cow


# 使用Merge-On-Read(Mor表类型)运行HDF中的第二个数据集的Hive-Sync
/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://hiveserver:10000 \
  --user hive \
  --pass hive \
  --partitioned-by dt \
  --base-path /user/hive/warehouse/stock_ticks_mor \
  --database default \
  --table stock_ticks_mor

3.4. 运行hive执行查询

运行 hive 查询以查找为股票代码“GOOG”提取的最新时间戳。您会注意到快照(对于 COW 和 MOR _rt 表)和读取优化查询(对于 MOR _ro 表)都给出了相同的值“上午 10:29”,因为 Hudi 为第一批数据创建了 parquet 文件。

# 进入对应的容器
docker exec -it adhoc-2 /bin/bash

# 连接hive
beeline -u jdbc:hive2://hiveserver:10000 \
  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
  --hiveconf hive.stats.autogather=false

# 显示hive中的表
0: jdbc:hive2://hiveserver:10000> show tables;
+---------------------+--+
|      tab_name       |
+---------------------+--+
| stock_ticks_cow     |
| stock_ticks_mor_ro  |
| stock_ticks_mor_rt  |
+---------------------+--+
3 rows selected (1.199 seconds)
0: jdbc:hive2://hiveserver:10000>


# 查看添加的分区
0: jdbc:hive2://hiveserver:10000> show partitions stock_ticks_mor_rt;
+----------------+--+
|   partition    |
+----------------+--+
| dt=2018-08-31  |
+----------------+--+
1 row selected (0.24 seconds)

# 执行查询
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:29:00  |
+---------+----------------------+--+

Now, run a projection query:

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924221953       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
+----------------------+---------+----------------------+---------+------------+-----------+--+


# Merge-On-Read Queries:
==========================

Lets run similar queries against M-O-R table. Lets look at both 
ReadOptimized and Snapshot(realtime data) queries supported by M-O-R table

# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:29:00  |
+---------+----------------------+--+
1 row selected (6.326 seconds)


# Run Snapshot Query. Notice that the latest timestamp is again 10:29

0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:29:00  |
+---------+----------------------+--+
1 row selected (1.606 seconds)


# Run Read Optimized and Snapshot project queries

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

3.5. 运行Spark-SQL集成Hudi

Hudi 支持 Spark 作为查询处理器,就像 Hive 一样。这是在 spark-sql 中运行的相同配置单元查询

docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
  --jars $HUDI_SPARK_BUNDLE \
  --master local[2] \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4
...

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("show tables").show(100, false)
+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow   |false      |
|default |stock_ticks_mor_ro|false      |
|default |stock_ticks_mor_rt|false      |
+--------+------------------+-----------+

# Copy-On-Write Table

## Run max timestamp query against COW table

scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
[Stage 0:>                                                          (0 + 1) / 1]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes#StaticLoggerBinder for further details.
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+

## Projection Query

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20180924221953     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20180924221953     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

# Merge-On-Read Queries:
==========================

Lets run similar queries against M-O-R table. Lets look at both
ReadOptimized and Snapshot queries supported by M-O-R table

# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


# Run Snapshot Query. Notice that the latest timestamp is again 10:29

scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+

# Run Read Optimized and Snapshot project queries

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+

3.6. 运行Presto集成Hudi

以下是类似 Hive 和 Spark 查询的 Presto 查询。目前 Presto 不支持对 Hudi 表的快照或增量查询。

docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> show catalogs;
  Catalog
-----------
 hive
 jmx
 localfile
 system
(4 rows)

Query 20190817_134851_00000_j8rcz, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:04 [0 rows, 0B] [0 rows/s, 0B/s]

presto> use hive.default;
USE
presto:default> show tables;
       Table
--------------------
 stock_ticks_cow
 stock_ticks_mor_ro
 stock_ticks_mor_rt
(3 rows)

Query 20190822_181000_00001_segyw, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
0:05 [3 rows, 99B] [0 rows/s, 18B/s]


# COPY-ON-WRITE Queries:
=========================


presto:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20190822_181011_00002_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:12 [197 rows, 613B] [16 rows/s, 50B/s]

presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20190822180221      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20190822_181141_00003_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [109 rows/s, 341B/s]


# Merge-On-Read Queries:
==========================

Lets run similar queries against M-O-R table. 

# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
    presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20190822_181158_00004_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:02 [197 rows, 613B] [110 rows/s, 343B/s]


presto:default>  select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20190822_181256_00006_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [92 rows/s, 286B/s]

presto:default> exit

3.7. 运行Trino集成Hudi

以下是 Trino 的类似查询。目前,Trino 不支持对 Hudi 表的快照或增量查询。

docker exec -it adhoc-2 trino --server trino-coordinator-1:8091
trino> show catalogs;
 Catalog 
---------
 hive    
 system  
(2 rows)

Query 20220112_055038_00000_sac73, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
3.74 [0 rows, 0B] [0 rows/s, 0B/s]

trino> use hive.default;
USE
trino:default> show tables;
       Table        
--------------------
 stock_ticks_cow    
 stock_ticks_mor_ro 
 stock_ticks_mor_rt 
(3 rows)

Query 20220112_055050_00003_sac73, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
1.84 [3 rows, 102B] [1 rows/s, 55B/s]

# COPY-ON-WRITE Queries:
=========================
    
trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1        
--------+---------------------
 GOOG   | 2018-08-31 10:29:00 
(1 row)

Query 20220112_055101_00005_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
4.08 [197 rows, 442KB] [48 rows/s, 108KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close   
---------------------+--------+---------------------+--------+-----------+----------
 20220112054822108   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02 
 20220112054822108   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085 
(2 rows)

Query 20220112_055113_00006_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.40 [197 rows, 450KB] [487 rows/s, 1.09MB/s]

# Merge-On-Read Queries:
==========================

Lets run similar queries against MOR table.

# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
    
trino:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1        
--------+---------------------
 GOOG   | 2018-08-31 10:29:00 
(1 row)

Query 20220112_055125_00007_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.50 [197 rows, 442KB] [395 rows/s, 888KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close   
---------------------+--------+---------------------+--------+-----------+----------
 20220112054844841   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02 
 20220112054844841   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085 
(2 rows)

Query 20220112_055136_00008_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.49 [197 rows, 450KB] [404 rows/s, 924KB/s]

trino:default> exit

3.8. 将第二批上传到 Kafka 并运行 DeltaStreamer 进行

上传第二批数据并使用 delta-streamer 摄取这批数据。由于此批次没有引入任何新分区,因此无需运行 hive-sync

执行的kafka脚本:

kafka-console-producer.sh --broker-list hudi:9092 --topic stock_ticks  < /opt/module/hudi/docker/demo/data/batch_2.json

执行的spark作业:

# 启动容器
docker exec -it adhoc-2 /bin/bash

# 执行如下spark-submit命令,将delta-streamer和ingest写入HDFS的stock_ticks_cow表中  
spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_cow \
  --target-table stock_ticks_cow \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

# 执行如下spark-submit命令,执行delta-streamer和ingest到HDFS的stock_ticks_mor表  
spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path /user/hive/warehouse/stock_ticks_mor \
  --target-table stock_ticks_mor \
  --props /var/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --disable-compaction

生成的stock_ticks_cow文件如下:

生成的stock_ticks_mor文件如下:

3.9 运行hive

使用 Copy-On-Write 表,一旦批次提交,快照查询会立即将更改视为第二批的一部分,因为每次摄取都会创建更新版本的 parquet 文件。

使用 Merge-On-Read 表,第二次摄取只是将批处理附加到未合并的增量(日志)文件中。这是 ReadOptimized 和 Snapshot 查询将提供不同结果的时候。ReadOptimized 查询仍将返回“10:29 am”,因为它只会从 Parquet 文件中读取。快照查询将进行即时合并并返回最新提交的数据,即“上午 10:59”。

docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
  --hiveconf hive.stats.autogather=false

# Copy On Write Table:

0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+--+
1 row selected (1.932 seconds)

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.


# Merge On Read Table:

# Read Optimized Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:29:00  |
+---------+----------------------+--+
1 row selected (1.6 seconds)

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+--+

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

exit

3.10. 运行SparkSQL

在 Spark-SQL 中运行相同的查询:

docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
  --jars $HUDI_SPARK_BUNDLE \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --master local[2] \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4

# Copy On Write Table:

scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:59:00|
+------+-------------------+

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)

+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.


# Merge On Read Table:

# Read Optimized Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol  |         _c1          |
+---------+----------------------+
| GOOG    | 2018-08-31 10:29:00  |
+---------+----------------------+
1 row selected (1.6 seconds)

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
+----------------------+---------+----------------------+---------+------------+-----------+

# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol  |         _c1          |
+---------+----------------------+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+

exit

3.11. 运行Presto

在 Presto 上为 ReadOptimized 查询运行相同的查询。

docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> use hive.default;
USE

# Copy On Write Table:

presto:default>select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:59:00
(1 row)

Query 20190822_181530_00007_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:02 [197 rows, 613B] [125 rows/s, 389B/s]

presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20190822181433      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
(2 rows)

Query 20190822_181545_00008_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [106 rows/s, 332B/s]

As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.


# Merge On Read Table:

# Read Optimized Query
presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20190822_181602_00009_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [139 rows/s, 435B/s]

presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20190822_181615_00010_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [197 rows, 613B] [154 rows/s, 480B/s]

presto:default> exit

3.12. 运行Trino

在 Trino 上为读取优化查询运行相同的查询。

docker exec -it adhoc-2 trino --server trino-coordinator-1:8091
trino> use hive.default;
USE
    
# Copy On Write Table:

trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1        
--------+---------------------
 GOOG   | 2018-08-31 10:59:00 
(1 row)

Query 20220112_055443_00012_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.63 [197 rows, 442KB] [310 rows/s, 697KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close   
---------------------+--------+---------------------+--------+-----------+----------
 20220112054822108   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02 
 20220112055352654   | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215 
(2 rows)

Query 20220112_055450_00013_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.65 [197 rows, 450KB] [303 rows/s, 692KB/s]

As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.

# Merge On Read Table:
# Read Optimized Query
    
trino:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1        
--------+---------------------
 GOOG   | 2018-08-31 10:29:00 
(1 row)

Query 20220112_055500_00014_sac73, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0.59 [197 rows, 442KB] [336 rows/s, 756KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close   
---------------------+--------+---------------------+--------+-----------+----------
 20220112054844841   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02 
 20220112054844841   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085 
(2 rows)

Query 20220112_055506_00015_sac73, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0.35 [197 rows, 450KB] [556 rows/s, 1.24MB/s]

trino:default> exit

3.13. COPY-ON-WRITE表

摄取了 2 批数据,让我们展示对 Hudi Copy-On-Write 表中增量查询的支持

让我们以相同的投影查询示例为例

docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
  --hiveconf hive.stats.autogather=false

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

正如从上述查询中注意到的那样,有 2 个提交 - 按时间线顺序提交 20180924064621 和 20180924065039。当您按照这些步骤操作时,您将获得不同的提交时间戳。用它们代替上述时间戳。

为了展示增量查询的效果,让我们假设读者已经将这些更改视为摄取第一批的一部分。现在,为了让读者看到第二批的效果,他/她必须将开始时间戳保持到第一批的提交时间(20180924064621)并运行增量查询

Hudi 增量模式通过使用 hudi 管理的元数据过滤掉没有任何候选行的文件,为增量查询提供高效扫描。

docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
  --hiveconf hive.stats.autogather=false

0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
No rows affected (0.009 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.max.commits=3;
No rows affected (0.009 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;

使用上述设置,从提交 20180924065039 中没有任何更新的文件 ID 将被过滤掉而不进行扫描。这是增量查询:

0: jdbc:hive2://hiveserver:10000>
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+
1 row selected (0.83 seconds)
0: jdbc:hive2://hiveserver:10000>

3.14. 使用 Spark SQL 进行增量查询

docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
  --jars $HUDI_SPARK_BUNDLE \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --master local[2] \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceReadOptions

# In the below query, 20180925045257 is the first commit's timestamp
scala> val hoodieIncViewDF =  spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes#StaticLoggerBinder for further details.
hoodieIncViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 15 more fields]

scala> hoodieIncViewDF.registerTempTable("stock_ticks_cow_incr_tmp1")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow_incr_tmp1 where  symbol = 'GOOG'").show(100, false);
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+

3.15. Merge-On-Read表

让我们安排并运行压缩以创建新版本的柱状文件,以便读取优化的读者将看到更新鲜的数据。同样,可以使用 Hudi CLI 手动调度和运行压缩

docker exec -it adhoc-1 /bin/bash
root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
...
Table command getting loaded
HoodieSplashScreen loaded
===================================================================
*         ___                          ___                        *
*        /\__\          ___           /\  \           ___         *
*       / /  /         /\__\         /  \  \         /\  \        *
*      / /__/         / /  /        / /\ \  \        \ \  \       *
*     /  \  \ ___    / /  /        / /  \ \__\       /  \__\      *
*    / /\ \  /\__\  / /__/  ___   / /__/ \ |__|     / /\/__/      *
*    \/  \ \/ /  /  \ \  \ /\__\  \ \  \ / /  /  /\/ /  /         *
*         \  /  /    \ \  / /  /   \ \  / /  /   \  /__/          *
*         / /  /      \ \/ /  /     \ \/ /  /     \ \__\          *
*        / /  /        \  /  /       \  /  /       \/__/          *
*        \/__/          \/__/         \/__/    Apache Hudi CLI    *
*                                                                 *
===================================================================

Welcome to Apache Hudi CLI. Please type help if you are looking for help.
hudi->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded
hoodie:stock_ticks_mor->compactions show all
20/02/10 03:41:32 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200210015059__clean__COMPLETED], [20200210015059__deltacommit__COMPLETED], [20200210022758__clean__COMPLETED], [20200210022758__deltacommit__COMPLETED], [==>20200210023843__compaction__REQUESTED]]
___________________________________________________________________
| Compaction Instant Time| State    | Total FileIds to be Compacted|
|==================================================================|

# Schedule a compaction. This will use Spark Launcher to schedule compaction
hoodie:stock_ticks_mor->compaction schedule
....
Compaction successfully completed for 20180924070031

# Now refresh and check again. You will see that there is a new compaction requested

hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded

hoodie:stock_ticks_mor->compactions show all
18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], [20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], [==>20180924063245__compaction__REQUESTED]]
___________________________________________________________________
| Compaction Instant Time| State    | Total FileIds to be Compacted|
|==================================================================|
| 20180924070031         | REQUESTED| 1                            |

# Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 --parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc --retry 1  
....
Compaction successfully completed for 20180924070031

## Now check if compaction is completed

hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
Metadata for table stock_ticks_mor loaded

hoodie:stock_ticks->compactions show all
18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], [20180924070031__commit__COMPLETED]]
___________________________________________________________________
| Compaction Instant Time| State    | Total FileIds to be Compacted|
|==================================================================|
| 20180924070031         | COMPLETED| 1                            |

3.16. 运行 Hive 进行增量查询

看到 ReadOptimized 和 Snapshot 查询都将显示最新提交的数据。让我们也运行 MOR 表的增量查询。从下面的查询输出中可以清楚地看出,MOR 表的第一次提交时间是 20180924064636,第二次提交时间是 20180924070031

docker exec -it adhoc-2 /bin/bash
beeline -u jdbc:hive2://hiveserver:10000 \
  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
  --hiveconf hive.stats.autogather=false

# Read Optimized Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+--+
1 row selected (1.6 seconds)

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

# Snapshot Query
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------------------+--+
| symbol  |         _c1          |
+---------+----------------------+--+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+--+

0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

# Incremental Query:

0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
No rows affected (0.008 seconds)
# Max-Commits covers both second batch and compaction commit
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.max.commits=3;
No rows affected (0.007 seconds)
0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
No rows affected (0.013 seconds)
# Query:
0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064636';
+----------------------+---------+----------------------+---------+------------+-----------+--+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+--+
| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+--+

exit

3.17. 进行压缩

docker exec -it adhoc-1 /bin/bash
$SPARK_INSTALL/bin/spark-shell \
  --jars $HUDI_SPARK_BUNDLE \
  --driver-class-path $HADOOP_CONF_DIR \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --deploy-mode client \
  --driver-memory 1G \
  --master local[2] \
  --executor-memory 3G \
  --num-executors 1 \
  --packages org.apache.spark:spark-avro_2.11:2.4.4

# Read Optimized Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol  |        max(ts)       |
+---------+----------------------+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+
1 row selected (1.6 seconds)

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+

# Snapshot Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+---------+----------------------+
| symbol  |     max(ts)          |
+---------+----------------------+
| GOOG    | 2018-08-31 10:59:00  |
+---------+----------------------+

scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+----------------------+---------+----------------------+---------+------------+-----------+
| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
+----------------------+---------+----------------------+---------+------------+-----------+
| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
+----------------------+---------+----------------------+---------+------------+-----------+

3.18. 使用Presto读压缩文件

docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
presto> use hive.default;
USE

# Read Optimized Query
resto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
  symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:59:00
(1 row)

Query 20190822_182319_00011_segyw, FINISHED, 1 node
Splits: 49 total, 49 done (100.00%)
0:01 [197 rows, 613B] [133 rows/s, 414B/s]

presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20190822181944      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
(2 rows)

Query 20190822_182333_00012_segyw, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [197 rows, 613B] [98 rows/s, 307B/s]

presto:default>

注:Hudi系列博文为通过对Hudi官网学习记录所写,其中有加入个人理解,如有不足,请各位读者谅解☺☺☺

注:其他相关文章链接由此进(包括Hudi在内的各数据湖相关博文) -> 数据湖 文章汇总


评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

电光闪烁

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值