目录
0. 相关文章链接
1. 环境准备和数据准备
对Hudi的环境准备和数据准备,可以参考博主的另一篇博文,这里就不多描述了,博文连接:数据湖之Hudi(9):使用Spark向Hudi中插入数据
2. Maven依赖
在另一篇博文中有Maven依赖,但在这里还是补充一下
<repositories>
<repository>
<id>aliyun</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>jboss</id>
<url>http://repository.jboss.com/nexus/content/groups/public</url>
</repository>
</repositories>
<properties>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
<hadoop.version>3.0.0</hadoop.version>
<hudi.version>0.9.0</hudi.version>
</properties>
<dependencies>
<!-- 依赖Scala语言 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark Core 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark SQL 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Hadoop Client 依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- hudi-spark3 -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark3-bundle_2.12</artifactId>
<version>${hudi.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<resources>
<resource>
<directory>${project.basedir}/src/main/resources</directory>
</resource>
</resources>
<!-- Maven 编译的插件 -->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
3. 核心代码
第1步、加载Hudi表所有数据,获取commit time时间,作为增量查询数据阈值
第2步、设置Hudi数据CommitTime时间阈值,进行增量数据查询
第3步、将增量查询数据注册为临时视图,查询费用大于20数据
package com.ouyang.hudi.crud
import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
/**
* @ date: 2022/2/23
* @ author: yangshibiao
* @ desc: 增量查询(Incremental Query)数据,采用SQL方式
*/
object Demo04_IncrementalQuery {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
// 创建SparkSession实例对象,设置属性
val spark: SparkSession = {
SparkSession.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[4]")
// 设置序列化方式:Kryo
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
}
// 定义变量:表名称、保存路径
val tableName: String = "tbl_trips_cow"
val tablePath: String = "/hudi-warehouse/tbl_trips_cow"
// 导入隐式转换和相关方法
import spark.implicits._
// 第1步、加载Hudi表所有数据,获取commit time时间,作为增量查询数据阈值
spark.read
.format("hudi")
.load(tablePath)
.createOrReplaceTempView("view_temp_hudi_trips")
val allDF: DataFrame = spark.sql(
"""
|select * from view_temp_hudi_trips
|""".stripMargin)
println("Hudi中所有数据的条数为:" + allDF.count())
val commits: Array[String] = spark
.sql(
"""
|select
| distinct(_hoodie_commit_time) as commitTime
|from
| view_temp_hudi_trips
|order by
| commitTime DESC
|""".stripMargin
)
.map((row: Row) => row.getString(0))
.collect()
println("对_hoodie_commit_time字段去重后的条数为:" + commits.length)
val beginTime: Long = commits(commits.length - 1).toLong - 1
println(s"获取_hoodie_commit_time中最大的时间减1秒的时间为 : ${beginTime}")
// 第2步、设置Hudi数据CommitTime时间阈值,进行增量数据查询
println("再次从hudi中拉取数据,但是设置开始时间")
val tripsIncrementalDF: DataFrame = spark.read
.format("hudi")
// 设置查询数据模式为:incremental,增量读取
.option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL)
// 设置增量读取数据时开始时间
.option(BEGIN_INSTANTTIME.key(), beginTime)
.load(tablePath)
// 第3步、将增量查询数据注册为临时视图,查询费用大于20数据
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
val resultDF: DataFrame = spark
.sql(
"""
|select
| `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
|from
| hudi_trips_incremental
|where
| fare > 20.0
|""".stripMargin
)
println("根据条件查询到数据条数一共是:" + resultDF.count())
println("打印的数据如下所示:")
resultDF.show(100, truncate = false)
}
}
最终打印结果如下,可以发现,会对查询结果进行过滤:
Hudi中所有数据的条数为:100
对_hoodie_commit_time字段去重后的条数为:1
获取_hoodie_commit_time中最大的时间减1秒的时间为 : 20220224124828
再次从hudi中拉取数据,但是设置开始时间
根据条件查询到数据条数一共是:76
打印的数据如下所示:
+-------------------+------------------+-------------------+--------------------+-------------+
|_hoodie_commit_time|fare |begin_lon |begin_lat |ts |
+-------------------+------------------+-------------------+--------------------+-------------+
|20220224124829 |24.65031205441023 |0.7150696027624646 |9.544772278234914E-4|1645508871779|
|20220224124829 |84.28575558796736 |0.6627849637400387 |0.7985867991529113 |1645494731336|
|20220224124829 |46.65992353549729 |0.3157934820865995 |0.9924142645535157 |1645076258368|
|20220224124829 |96.25194167049236 |0.2350569142085449 |0.826183030502974 |1645233146143|
|20220224124829 |93.56018115236618 |0.14285051259466197|0.21624150367601136 |1645372155865|
|20220224124829 |28.874644702723472|0.49689215534636744|0.04316839215753254 |1645540197461|
|20220224124829 |64.28985520906711 |0.7701561399054763 |0.6995789249723998 |1645279339568|
|20220224124829 |56.69486815784974 |0.8047885824928995 |0.4461749593405654 |1645349002597|
|20220224124829 |51.85760883596304 |0.40231135692262376|0.45812904555684386 |1645131514017|
|20220224124829 |92.0536330577404 |0.8032800489802543 |0.5655712287397079 |1645353419405|
|20220224124829 |25.216729525590676|0.03482702091010481|0.48687190581855855 |1645317257338|
|20220224124829 |42.76921664939422 |0.41452263884832685|0.20404106962358204 |1645142380107|
|20220224124829 |64.12151064878266 |0.9563153782052657 |0.8675932789048282 |1645137535487|
|20220224124829 |64.27696295884016 |0.4923479652912024 |0.5731835407930634 |1645137067428|
|20220224124829 |33.92216483948643 |0.9694586417848392 |0.1856488085068272 |1645405916135|
|20220224124829 |57.62896261799536 |0.3883212395069259 |0.30057620949299213 |1645474978593|
|20220224124829 |66.64889106258252 |0.09632451474505643|0.47805950282725407 |1645613061572|
|20220224124829 |96.35314017496283 |0.3643791915968686 |0.776244653745167 |1645163496324|
|20220224124829 |93.09855584709396 |0.8015679659795738 |0.952306583683483 |1645345997528|
|20220224124829 |77.05976291070496 |0.610843492129245 |0.5692544178629111 |1645646023496|
|20220224124829 |27.79478688582596 |0.6273212202489661 |0.11488393157088261 |1645146955615|
|20220224124829 |72.86514373710996 |0.4805634684323683 |0.2697207272566471 |1645553761536|
|20220224124829 |38.697902072535484|0.2895800693712469 |0.9199515909032545 |1645575245660|
|20220224124829 |49.899171213436844|0.8716474406347761 |0.49054633351061006 |1645078211377|
|20220224124829 |84.9600214569341 |0.8039197581711358 |0.2947661370147079 |1645082674593|
|20220224124829 |64.14546157902316 |0.2285420562988809 |0.4008802745410629 |1645199107465|
|20220224124829 |52.69712318306616 |0.37272120488128546|0.3748535764638379 |1645389790539|
|20220224124829 |95.96221628238303 |0.5824868069725256 |0.27967030157708683 |1645614102866|
|20220224124829 |49.121690071563506|0.8750494376540229 |0.3880100101379198 |1645518390136|
|20220224124829 |30.24821012722806 |0.3259549255934986 |0.6437496229932878 |1645392427568|
|20220224124829 |66.62084366450246 |0.03844104444445928|0.0750588760043035 |1645462801205|
|20220224124829 |22.85729206746916 |0.14011059922351543|0.5378950285504629 |1645518360450|
|20220224124829 |34.158284716382845|0.46157858450465483|0.4726905879569653 |1645337084089|
|20220224124829 |54.54006969282713 |0.06150601978071968|0.41321106258416285 |1645265444224|
|20220224124829 |88.49896596590881 |0.9989772163510318 |0.19873758263401708 |1645671912869|
|20220224124829 |66.21616968017035 |0.4404703912280492 |0.8354158487065114 |1645620067715|
|20220224124829 |87.08158608552242 |0.2693250504574297 |0.9025710109008239 |1645183842286|
|20220224124829 |43.4923811219014 |0.8779402295427752 |0.6100070562136587 |1645526524405|
|20220224124829 |72.75824843751782 |0.28393433672984614|0.67243450582925 |1645571404222|
|20220224124829 |46.971815642308016|0.7723215898397776 |0.6325393869124881 |1645517561808|
|20220224124829 |31.32477949501916 |0.2202009625132143 |0.7267793086410466 |1645523732370|
|20220224124829 |88.57635938164037 |0.25472131435987666|0.5045582154226707 |1645456462268|
|20220224124829 |49.57985534250222 |0.2365242449257826 |0.13036108279724024 |1645396953932|
|20220224124829 |53.57146284741064 |0.7976793493421773 |0.9084944020139248 |1645139138687|
|20220224124829 |70.7187160490572 |0.17206379614150713|0.29234574995144014 |1645091090945|
|20220224124829 |94.06345130502677 |0.04322400348102218|0.9668843983075005 |1645245827293|
|20220224124829 |42.46412330377599 |0.11580010866153201|0.8918316400031095 |1645146252031|
|20220224124829 |44.839244944180244|0.04241635032425073|0.6372504913279929 |1645498541192|
|20220224124829 |60.047501243947934|0.3961523475372767 |0.983428192817987 |1645241465099|
|20220224124829 |41.076686078636236|0.4559336764388273 |0.5712378196458244 |1645427900899|
|20220224124829 |22.991770617403628|0.8105360506582145 |0.699025398548803 |1645110471240|
|20220224124829 |27.911375263393268|0.07097928915812768|0.9461601725825765 |1645399666782|
|20220224124829 |27.66236301605771 |0.7525032121800279 |0.7527035644196625 |1645445303070|
|20220224124829 |81.50991077375751 |0.9486805724237938 |0.7107035158051175 |1645117968327|
|20220224124829 |57.62049570298873 |0.747341018556108 |0.030241465200331774|1645408896521|
|20220224124829 |30.47844781909017 |0.07682825311613706|0.10509642405359532 |1645430260244|
|20220224124829 |80.5491504148736 |0.4159896734134194 |0.7548568723276352 |1645327117577|
|20220224124829 |30.80177695413958 |0.8750683366449247 |0.3613216010259426 |1645286544570|
|20220224124829 |93.00604432281203 |0.28072552620450797|0.49527694252432053 |1645199689631|
|20220224124829 |71.65259045568622 |0.731314927888718 |0.572693797369734 |1645202406745|
|20220224124829 |55.191161039724676|0.7826771915638148 |0.42204161309648225 |1645109711884|
|20220224124829 |89.45662517846566 |0.15358646185072777|0.2536616844442684 |1645565664004|
|20220224124829 |70.59591659793207 |0.17992665967365185|0.8679173655153939 |1645633279179|
|20220224124829 |61.202928373030986|0.8708563161108613 |0.7696704530645273 |1645399715549|
|20220224124829 |38.61457381408665 |0.5761097193536119 |0.39253605282983284 |1645144087129|
|20220224124829 |40.211140833035394|0.8801105093619153 |0.9090538095331541 |1645355610612|
|20220224124829 |99.46343958295148 |0.8630157667444018 |0.4805271604136475 |1645338532512|
|20220224124829 |61.47361832518315 |0.15209476758450546|0.9475737219843783 |1645426757250|
|20220224124829 |28.53709038726113 |0.2370254092732652 |0.132849613764075 |1645388731719|
|20220224124829 |99.75501233740296 |0.392670629542598 |0.5561568349082263 |1645549642116|
|20220224124829 |86.92639065900747 |0.2887009329948117 |0.03154543220118411 |1645530212011|
|20220224124829 |39.31163975206524 |0.9049457113019617 |0.7548086309564753 |1645216808248|
|20220224124829 |97.1099231460059 |0.5077348257408091 |0.3625677215882801 |1645537665593|
|20220224124829 |52.18544099844657 |0.7797963278409558 |0.4527914447326259 |1645125921687|
|20220224124829 |41.06290929046368 |0.8192868687714224 |0.651058505660742 |1645114163811|
|20220224124829 |53.69977335639399 |0.9623582692596406 |0.09384124531808036 |1645675210280|
+-------------------+------------------+-------------------+--------------------+-------------+
注:Hudi系列博文为通过对Hudi官网学习记录所写,其中有加入个人理解,如有不足,请各位读者谅解☺☺☺
注:其他相关文章链接由此进(包括Hudi在内的各数据湖相关博文) -> 数据湖 文章汇总