Apache Hudi - 1 - quick-start-guide 快速入门

本文档详细介绍了如何使用Apache Hudi进行数据操作,包括在pyspark环境中配置Hudi,插入、查询、更新、时间旅行查询、增量查询、删除数据,以及处理常见错误。通过实例展示了Hudi的upsert和append模式,并提供了FAQ解答。
摘要由CSDN通过智能技术生成

前言

本文参考 Hudi 的官方文档中quick-start-guide部分完成 : https://hudi.apache.org/docs/quick-start-guide(单纯照着文档来我是没搞起来,总之踩了一些小坑)

请严格遵守jar包版本的约束

Hudi Supported Spark 3 version
0.10.0 3.1.x (default build), 3.0.x
0.7.0 - 0.9.0 3.0.x
0.6.0 and prior not supported

spark shell 配置

这个步骤是根据官网的描述操作的,只不过一开始没有进行下载jar,踩了一个微型坑

启动pyspark的hudi环境报错

ERROR INFO:

(venv) gavin@GavindeMacBook-Pro test % which python3.8
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % export PYSPARK_PYTHON=$(which python3.8)
(venv) gavin@GavindeMacBook-Pro test % echo ${PYSPARK_PYTHON}
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54) 
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-311b5800-4168-498b-987f-f714233cf50c;1.0
        confs: [default]
:: resolution report :: resolve 614199ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1

        ==== local-m2-cache: tried

          file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

        ==== local-ivy-cache: tried

          /Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/ivys/ivy.xml

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          /Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/jars/hudi-spark3.1.2-bundle_2.12.jar

        ==== central: tried

          https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

        ==== spark-packages: tried

          https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

                module not found: org.apache.spark#spark-avro_2.12;3.1.2

        ==== local-m2-cache: tried

          file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

        ==== local-ivy-cache: tried

          /Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/ivys/ivy.xml

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          /Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/jars/spark-avro_2.12.jar

        ==== central: tried

          https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

        ==== spark-packages: tried

          https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::          UNRESOLVED DEPENDENCIES         ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found

                :: org.apache.spark#spark-avro_2.12;3.1.2: not found

                ::::::::::::::::::::::::::::::::::::::::::::::


:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found, unresolved dependency: org.apache.spark#spark-avro_2.12;3.1.2: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1429)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/python/pyspark/shell.py", line 35, in <module>
    SparkContext._ensure_initialized()  # type: ignore
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/context.py", line 331, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
>>> exit()

观察到回去本地maven仓库查找jar包,于是使用maven下载需要的两个jar包(对应的spark版本是3.1.2):

        <dependency>
            <groupId>org.apache.hudi</groupId>
            <artifactId>hudi-spark3.1.2-bundle_2.12</artifactId>
            <version>0.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>3.1.2</version>
        </dependency>

也可以自己直接到网站下载:

Hudi jar download:

https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/

https://repo1.maven.org/maven2/org/apache/hudi/hudi/
在这里插入图片描述

maven中jar下载完成之后,key在本地仓库中看到jar:/Users/gavin/.m2/repository/org/apache/hudi/…

成功启动带hudi的pyspark

(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54) 
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
22/03/01 10:20:28 WARN Utils: Your hostname, GavindeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.24.227 instead (on interface en0)
22/03/01 10:20:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89;1.0
        confs: [default]
        found org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 in local-m2-cache
        found org.apache.spark#spark-avro_2.12;3.1.2 in local-m2-cache
        found org.spark-project.spark#unused;1.0.0 in local-m2-cache
downloading file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar ...
        [SUCCESSFUL ] org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar (54ms)
downloading file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar ...
        [SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar (2ms)
downloading file:/Users/gavin/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar ...
        [SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (2ms)
:: resolution report :: resolve 6622ms :: artifacts dl 62ms
        :: modules in use:
        org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 from local-m2-cache in [default]
        org.apache.spark#spark-avro_2.12;3.1.2 from local-m2-cache in [default]
        org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89
        confs: [default]
        3 artifacts copied, 0 already retrieved (38092kB/67ms)
22/03/01 10:20:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.9 (default, Oct 26 2021 07:25:54)
Spark context Web UI available at http://192.168.24.227:4040
Spark context available as 'sc' (master = local[*], app id = local-1646101237379).
SparkSession available as 'spark'.
>>> 

IDEA 代码方式

插入数据(表不存在则新建表,upsert)

此例讲述的是「upsert」类型的插入,即「存在符合条件的则更新,不存在则新增」,具体是使用什么类型的数据插入方式,是由参数「hoodie.datasource.write.operation」控制的,具体参数的说明可见https://hudi.apache.org/docs/configurations

根据官网的document(https://hudi.apache.org/docs/quick-start-guide),得到如下代码:

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars", "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                              "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    # pyspark
    tableName = "hudi_trips_cow"
    basePath = "file:///tmp/hudi_trips_cow"
    dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

    # pyspark
    inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
    df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

    hudi_options = {
   
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'uuid',
        'hoodie.datasource.write.partitionpath.field': 'partitionpath',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    df.write.format("hudi"). \
        options(**hudi_options). \
        mode("overwrite"). \
        save(basePath)

ps:插入的数据使用的是「org.apache.hudi.QuickstartUtils.DataGenerator()」生成的样例数据(官网的代码就是这么干的,具体数据内容可参见查询 章节)

运行代码后得到如下目录结构(妥妥的分区表目录结构):

gavin@GavindeMacBook-Pro apache % tree /tmp/hudi_trips_cow 
/tmp/hudi_trips_cow
├── americas
│   ├── brazil
│   │   └── sao_paulo
│   │       └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│   └── united_states
│       └── san_francisco
│           └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── asia
    └── india
        └── chennai
            └── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet

8 directories, 3 files
gavin@GavindeMacBook-Pro apache % 

扩展

除了基本的数据文件外,hudi还有一个metadata的隐藏文件「.hoodie」,文件具体内容再叙:

gavin@GavindeMacBook-Pro hudi_trips_cow % ll -a
total 0
drwxr-xr-x   5 gavin  wheel  160 Mar  1 10:51 .
drwxrwxrwt  10 root   wheel  320 Mar  1 11:26 ..
drwxr-xr-x  13 gavin  wheel  416 Mar  1 10:51 .hoodie
drwxr-xr-x   4 gavin  wheel  128 Mar  1 10:51 americas
drwxr-xr-x   3 gavin  wheel   96 Mar  1 10:51 asia
gavin@GavindeMacBook-Pro hudi_trips_cow % tree .hoodie 
.hoodie
├── 20220301105108074.commit
├── 20220301105108074.commit.requested
├── 20220301105108074.inflight
├── archived
└── hoodie.properties

查询数据(查询当前版本的数据)

查询代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext
    basePath = "file:///tmp/hudi_trips_cow"

    # pyspark
    tripsSnapshotDF = spark. \
        read. \
        format("hudi"). \
        load(basePath)
    # load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery

    count = tripsSnapshotDF.count()
    print(f'========hudi_trips_snapshot 表中共计[{count}]条数据')
    print('表结构如下:')
    tripsSnapshotDF.printSchema()

    tripsSnapshotDF.show()

    tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

    spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
    spark.sql(
        "select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

查询结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 11:18:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:18:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
========hudi_trips_snapshot 表中共计[10]条数据
表结构如下:
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- begin_lat: double (nullable = true)
 |-- begin_lon: double (nullable = true)
 |-- driver: string (nullable = true)
 |-- end_lat: double (nullable = true)
 |-- end_lon: double (nullable = true)
 |-- fare: double (nullable = true)
 |-- rider: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- partitionpath: string (nullable = true)

+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
随着互联网的发展,数据的不断膨胀,从刚开始的关系型数据库到非关系型数据库,再到大数据技术,技术的不断演进最终是随着数据膨胀而不断改变,最初的数据仓库能解决我们的问题,但是随着时代发展,企业已经不满足于数据仓库,希望有更强大的技术来支撑数据的存储,包括结构化,非结构化的数据等,希望能够积累企业的数据,从中挖掘出更大的价值。基于这个背景,数据湖的技术应运而生。本课程基于真实的企业数据湖案例进行讲解,结合业务实现数据湖平台,让大家在实践中理解和掌握数据湖技术,未来数据湖的需求也会不断加大,希望同学们抓住这个机遇。项目中将以热门的互联网电商业务场景为案例讲解,具体分析指标包含:流量分析,订单分析,用户行为分析,营销分析,广告分析等,能承载海量数据的实时分析,数据分析涵盖全端(PC、移动、小程序)应用。Apache Hudi代表Hadoop Upserts anD Incrementals,管理大型分析数据集在HDFS上的存储。Hudi的主要目的是高效减少摄取过程中的数据延迟。Hudi的出现解决了现有hadoop体系的几个问题:1、HDFS的可伸缩性限制 2、需要在Hadoop中更快地呈现数据 3、没有直接支持对现有数据的更新和删除 4、快速的ETL和建模 5、要检索所有更新的记录,无论这些更新是添加到最近日期分区的新记录还是对旧数据的更新,Hudi都允许用户使用最后一个检查点时间戳,此过程不用执行扫描整个源表的查询。 本课程包含的技术: 开发工具为:IDEA、WebStorm Flink1.9.0、HudiClickHouseHadoop2.7.5 Hbase2.2.6Kafka2.1.0 Hive2.2.0HDFS、MapReduceSpark、ZookeeperBinlog、Canal、MySQLSpringBoot2.0.2.RELEASE SpringCloud Finchley.RELEASEVue.js、Nodejs、HighchartsLinux Shell编程课程亮点: 1.与企业接轨、真实工业界产品 2.ClickHouse高性能列式存储数据库 3.大数据热门技术Flink4.Flink join 实战 5.Hudi数据湖技术6.集成指标明细查询 7.主流微服务后端系统 8.数据库实时同步解决方案 9.涵盖主流前端技术VUE+jQuery+Ajax+NodeJS 10.集成SpringCloud实现统一整合方案 11.互联网大数据企业热门技术栈 12.支持海量数据的实时分析 13.支持全端实时数据分析 14.全程代码实操,提供全部代码和资料 15.提供答疑和提供企业技术方案咨询企业一线架构师讲授,代码在老师的指导下企业可以复用,提供企业解决方案。  版权归作者所有,盗版将进行法律维权。  

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值