数据湖之Hudi(12):使用Spark对Hudi中的数据进行增量查询(Incremental query)

目录

0. 相关文章链接

1. 环境准备和数据准备

2. Maven依赖

3. 核心代码


0. 相关文章链接

数据湖 文章汇总

1. 环境准备和数据准备

对Hudi的环境准备和数据准备,可以参考博主的另一篇博文,这里就不多描述了,博文连接:数据湖之Hudi(9):使用Spark向Hudi中插入数据

2. Maven依赖

在另一篇博文中有Maven依赖,但在这里还是补充一下

    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>
 
    <properties>
        <scala.version>2.12.10</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.0.0</spark.version>
        <hadoop.version>3.0.0</hadoop.version>
        <hudi.version>0.9.0</hudi.version>
    </properties>
 
    <dependencies>
 
        <!-- 依赖Scala语言 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
 
        <!-- Spark Core 依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark SQL 依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
 
        <!-- Hadoop Client 依赖 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
 
        <!-- hudi-spark3 -->
        <dependency>
            <groupId>org.apache.hudi</groupId>
            <artifactId>hudi-spark3-bundle_2.12</artifactId>
            <version>${hudi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>
 
    </dependencies>
 
    <build>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
        <resources>
            <resource>
                <directory>${project.basedir}/src/main/resources</directory>
            </resource>
        </resources>
        <!-- Maven 编译的插件 -->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

3. 核心代码

第1步、加载Hudi表所有数据,获取commit time时间,作为增量查询数据阈值
第2步、设置Hudi数据CommitTime时间阈值,进行增量数据查询
第3步、将增量查询数据注册为临时视图,查询费用大于20数据

package com.ouyang.hudi.crud

import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}

/**
 * @ date: 2022/2/23
 * @ author: yangshibiao
 * @ desc: 增量查询(Incremental Query)数据,采用SQL方式
 */
object Demo04_IncrementalQuery {

    def main(args: Array[String]): Unit = {

        System.setProperty("HADOOP_USER_NAME", "root")

        // 创建SparkSession实例对象,设置属性
        val spark: SparkSession = {
            SparkSession.builder()
                .appName(this.getClass.getSimpleName.stripSuffix("$"))
                .master("local[4]")
                // 设置序列化方式:Kryo
                .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .getOrCreate()
        }

        // 定义变量:表名称、保存路径
        val tableName: String = "tbl_trips_cow"
        val tablePath: String = "/hudi-warehouse/tbl_trips_cow"

        // 导入隐式转换和相关方法
        import spark.implicits._

        // 第1步、加载Hudi表所有数据,获取commit time时间,作为增量查询数据阈值
        spark.read
            .format("hudi")
            .load(tablePath)
            .createOrReplaceTempView("view_temp_hudi_trips")
        val allDF: DataFrame = spark.sql(
            """
              |select * from view_temp_hudi_trips
              |""".stripMargin)
        println("Hudi中所有数据的条数为:" + allDF.count())

        val commits: Array[String] = spark
            .sql(
                """
				  |select
				  |  distinct(_hoodie_commit_time) as commitTime
				  |from
				  |  view_temp_hudi_trips
				  |order by
				  |  commitTime DESC
				  |""".stripMargin
            )
            .map((row: Row) => row.getString(0))
            .collect()
        println("对_hoodie_commit_time字段去重后的条数为:" + commits.length)


        val beginTime: Long = commits(commits.length - 1).toLong - 1
        println(s"获取_hoodie_commit_time中最大的时间减1秒的时间为 : ${beginTime}")

        // 第2步、设置Hudi数据CommitTime时间阈值,进行增量数据查询
        println("再次从hudi中拉取数据,但是设置开始时间")
        val tripsIncrementalDF: DataFrame = spark.read
            .format("hudi")
            // 设置查询数据模式为:incremental,增量读取
            .option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL)
            // 设置增量读取数据时开始时间
            .option(BEGIN_INSTANTTIME.key(), beginTime)
            .load(tablePath)

        // 第3步、将增量查询数据注册为临时视图,查询费用大于20数据
        tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
        val resultDF: DataFrame = spark
            .sql(
                """
				  |select
				  |  `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
				  |from
				  |  hudi_trips_incremental
				  |where
				  |  fare > 20.0
				  |""".stripMargin
            )
        println("根据条件查询到数据条数一共是:" + resultDF.count())
        println("打印的数据如下所示:")
        resultDF.show(100, truncate = false)

    }
}

最终打印结果如下,可以发现,会对查询结果进行过滤:

Hudi中所有数据的条数为:100
对_hoodie_commit_time字段去重后的条数为:1
获取_hoodie_commit_time中最大的时间减1秒的时间为 : 20220224124828
再次从hudi中拉取数据,但是设置开始时间
根据条件查询到数据条数一共是:76
打印的数据如下所示:
+-------------------+------------------+-------------------+--------------------+-------------+
|_hoodie_commit_time|fare              |begin_lon          |begin_lat           |ts           |
+-------------------+------------------+-------------------+--------------------+-------------+
|20220224124829     |24.65031205441023 |0.7150696027624646 |9.544772278234914E-4|1645508871779|
|20220224124829     |84.28575558796736 |0.6627849637400387 |0.7985867991529113  |1645494731336|
|20220224124829     |46.65992353549729 |0.3157934820865995 |0.9924142645535157  |1645076258368|
|20220224124829     |96.25194167049236 |0.2350569142085449 |0.826183030502974   |1645233146143|
|20220224124829     |93.56018115236618 |0.14285051259466197|0.21624150367601136 |1645372155865|
|20220224124829     |28.874644702723472|0.49689215534636744|0.04316839215753254 |1645540197461|
|20220224124829     |64.28985520906711 |0.7701561399054763 |0.6995789249723998  |1645279339568|
|20220224124829     |56.69486815784974 |0.8047885824928995 |0.4461749593405654  |1645349002597|
|20220224124829     |51.85760883596304 |0.40231135692262376|0.45812904555684386 |1645131514017|
|20220224124829     |92.0536330577404  |0.8032800489802543 |0.5655712287397079  |1645353419405|
|20220224124829     |25.216729525590676|0.03482702091010481|0.48687190581855855 |1645317257338|
|20220224124829     |42.76921664939422 |0.41452263884832685|0.20404106962358204 |1645142380107|
|20220224124829     |64.12151064878266 |0.9563153782052657 |0.8675932789048282  |1645137535487|
|20220224124829     |64.27696295884016 |0.4923479652912024 |0.5731835407930634  |1645137067428|
|20220224124829     |33.92216483948643 |0.9694586417848392 |0.1856488085068272  |1645405916135|
|20220224124829     |57.62896261799536 |0.3883212395069259 |0.30057620949299213 |1645474978593|
|20220224124829     |66.64889106258252 |0.09632451474505643|0.47805950282725407 |1645613061572|
|20220224124829     |96.35314017496283 |0.3643791915968686 |0.776244653745167   |1645163496324|
|20220224124829     |93.09855584709396 |0.8015679659795738 |0.952306583683483   |1645345997528|
|20220224124829     |77.05976291070496 |0.610843492129245  |0.5692544178629111  |1645646023496|
|20220224124829     |27.79478688582596 |0.6273212202489661 |0.11488393157088261 |1645146955615|
|20220224124829     |72.86514373710996 |0.4805634684323683 |0.2697207272566471  |1645553761536|
|20220224124829     |38.697902072535484|0.2895800693712469 |0.9199515909032545  |1645575245660|
|20220224124829     |49.899171213436844|0.8716474406347761 |0.49054633351061006 |1645078211377|
|20220224124829     |84.9600214569341  |0.8039197581711358 |0.2947661370147079  |1645082674593|
|20220224124829     |64.14546157902316 |0.2285420562988809 |0.4008802745410629  |1645199107465|
|20220224124829     |52.69712318306616 |0.37272120488128546|0.3748535764638379  |1645389790539|
|20220224124829     |95.96221628238303 |0.5824868069725256 |0.27967030157708683 |1645614102866|
|20220224124829     |49.121690071563506|0.8750494376540229 |0.3880100101379198  |1645518390136|
|20220224124829     |30.24821012722806 |0.3259549255934986 |0.6437496229932878  |1645392427568|
|20220224124829     |66.62084366450246 |0.03844104444445928|0.0750588760043035  |1645462801205|
|20220224124829     |22.85729206746916 |0.14011059922351543|0.5378950285504629  |1645518360450|
|20220224124829     |34.158284716382845|0.46157858450465483|0.4726905879569653  |1645337084089|
|20220224124829     |54.54006969282713 |0.06150601978071968|0.41321106258416285 |1645265444224|
|20220224124829     |88.49896596590881 |0.9989772163510318 |0.19873758263401708 |1645671912869|
|20220224124829     |66.21616968017035 |0.4404703912280492 |0.8354158487065114  |1645620067715|
|20220224124829     |87.08158608552242 |0.2693250504574297 |0.9025710109008239  |1645183842286|
|20220224124829     |43.4923811219014  |0.8779402295427752 |0.6100070562136587  |1645526524405|
|20220224124829     |72.75824843751782 |0.28393433672984614|0.67243450582925    |1645571404222|
|20220224124829     |46.971815642308016|0.7723215898397776 |0.6325393869124881  |1645517561808|
|20220224124829     |31.32477949501916 |0.2202009625132143 |0.7267793086410466  |1645523732370|
|20220224124829     |88.57635938164037 |0.25472131435987666|0.5045582154226707  |1645456462268|
|20220224124829     |49.57985534250222 |0.2365242449257826 |0.13036108279724024 |1645396953932|
|20220224124829     |53.57146284741064 |0.7976793493421773 |0.9084944020139248  |1645139138687|
|20220224124829     |70.7187160490572  |0.17206379614150713|0.29234574995144014 |1645091090945|
|20220224124829     |94.06345130502677 |0.04322400348102218|0.9668843983075005  |1645245827293|
|20220224124829     |42.46412330377599 |0.11580010866153201|0.8918316400031095  |1645146252031|
|20220224124829     |44.839244944180244|0.04241635032425073|0.6372504913279929  |1645498541192|
|20220224124829     |60.047501243947934|0.3961523475372767 |0.983428192817987   |1645241465099|
|20220224124829     |41.076686078636236|0.4559336764388273 |0.5712378196458244  |1645427900899|
|20220224124829     |22.991770617403628|0.8105360506582145 |0.699025398548803   |1645110471240|
|20220224124829     |27.911375263393268|0.07097928915812768|0.9461601725825765  |1645399666782|
|20220224124829     |27.66236301605771 |0.7525032121800279 |0.7527035644196625  |1645445303070|
|20220224124829     |81.50991077375751 |0.9486805724237938 |0.7107035158051175  |1645117968327|
|20220224124829     |57.62049570298873 |0.747341018556108  |0.030241465200331774|1645408896521|
|20220224124829     |30.47844781909017 |0.07682825311613706|0.10509642405359532 |1645430260244|
|20220224124829     |80.5491504148736  |0.4159896734134194 |0.7548568723276352  |1645327117577|
|20220224124829     |30.80177695413958 |0.8750683366449247 |0.3613216010259426  |1645286544570|
|20220224124829     |93.00604432281203 |0.28072552620450797|0.49527694252432053 |1645199689631|
|20220224124829     |71.65259045568622 |0.731314927888718  |0.572693797369734   |1645202406745|
|20220224124829     |55.191161039724676|0.7826771915638148 |0.42204161309648225 |1645109711884|
|20220224124829     |89.45662517846566 |0.15358646185072777|0.2536616844442684  |1645565664004|
|20220224124829     |70.59591659793207 |0.17992665967365185|0.8679173655153939  |1645633279179|
|20220224124829     |61.202928373030986|0.8708563161108613 |0.7696704530645273  |1645399715549|
|20220224124829     |38.61457381408665 |0.5761097193536119 |0.39253605282983284 |1645144087129|
|20220224124829     |40.211140833035394|0.8801105093619153 |0.9090538095331541  |1645355610612|
|20220224124829     |99.46343958295148 |0.8630157667444018 |0.4805271604136475  |1645338532512|
|20220224124829     |61.47361832518315 |0.15209476758450546|0.9475737219843783  |1645426757250|
|20220224124829     |28.53709038726113 |0.2370254092732652 |0.132849613764075   |1645388731719|
|20220224124829     |99.75501233740296 |0.392670629542598  |0.5561568349082263  |1645549642116|
|20220224124829     |86.92639065900747 |0.2887009329948117 |0.03154543220118411 |1645530212011|
|20220224124829     |39.31163975206524 |0.9049457113019617 |0.7548086309564753  |1645216808248|
|20220224124829     |97.1099231460059  |0.5077348257408091 |0.3625677215882801  |1645537665593|
|20220224124829     |52.18544099844657 |0.7797963278409558 |0.4527914447326259  |1645125921687|
|20220224124829     |41.06290929046368 |0.8192868687714224 |0.651058505660742   |1645114163811|
|20220224124829     |53.69977335639399 |0.9623582692596406 |0.09384124531808036 |1645675210280|
+-------------------+------------------+-------------------+--------------------+-------------+

注:Hudi系列博文为通过对Hudi官网学习记录所写,其中有加入个人理解,如有不足,请各位读者谅解☺☺☺

注:其他相关文章链接由此进(包括Hudi在内的各数据湖相关博文) -> 数据湖 文章汇总


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

电光闪烁

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值