Apache Hudi - 3 - 在 AWS Glue 中的实践

AWS Glue 实践

Glue Job 配置

ps:IAM的权限配置请自行摸索,权限上基本是缺啥补啥,如果只是作为测试,直接给所有的权限(action和resource都是*)就可以一劳永逸了

Job Pramater

KeyValue
–confspark.serializer=org.apache.spark.serializer.KryoSerializer
–enable-glue-datacatalog

Dependent jars path

s3://gavin-test2/dependency_jars/hudi/spark-avro_2.11-2.4.3.jar,s3://gavin-test2/dependency_jars/hudi/hudi-spark-bundle_2.11-0.8.0.jar

jar 的下载路径:

Jar包下载链接
hudi-spark-bundle_2.11-0.8.0.jarhttps://search.maven.org/remotecontent?filepath=org/apache/hudi/hudi-spark-bundle_2.11/0.8.0/hudi-spark-bundle_2.11-0.8.0.jar
spark-avro_2.11-2.4.3.jarhttps://search.maven.org/remotecontent?filepath=org/apache/spark/spark-avro_2.11/2.4.3/spark-avro_2.11-2.4.3.jar

存储为非分区表

Python Code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

basePath = 's3://gavin-test2/tables/hudi/table1/'
table_name = 'table1'
database = 'default'
data = [('Alice', 1, '2022/02/28'), ('Jhone', 2, '2022/03/01')]
rdd = sc.parallelize(data)
schema = StructType(
    [
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("partitioin_path", StringType(), True),
    ]
)
src_df = spark.createDataFrame(rdd, schema)

hudi_options = {
            'hoodie.table.name': table_name,
            'hoodie.datasource.write.operation': 'insert',
            'hoodie.datasource.write.recordkey.field': 'name',
            'hoodie.datasource.write.table.name': table_name,
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.datasource.hive_sync.database': 'default',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.table': table_name,
            'hoodie.datasource.write.partitionpath.field': '',
            'hoodie.datasource.hive_sync.partition_fields': '',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
            'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
        }

src_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
AWS Glue Catalog

在这里插入图片描述

Query in Athena

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tAUZYAMB-1646662547718)(../../../resources/typora_img/image-20220307161327394.png)]

_hoodie_commit_time_hoodie_commit_seqno_hoodie_record_key_hoodie_partition_path_hoodie_file_namenameagepartitioin_path
2022030708083620220307080836_0_1Jhone8965ef34-4048-4420-8e69-562a478c3989-0_0-13-481_20220307080836.parquetJhone22022/03/01
2022030708083620220307080836_0_2Alice8965ef34-4048-4420-8e69-562a478c3989-0_0-13-481_20220307080836.parquetAlice12022/02/28
Files in S3 Bucket

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4qoRZ2nv-1646662547719)(../../../resources/typora_img/image-20220307163056583.png)]

存储为分区表

分区表可以选择时间字段作为分区字段,也可以选择非时间字段作为分区字段,本文使用形如「yyyy/mm/dd」的时间字符串作为分区字段;

Python Code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

basePath = 's3://gavin-test2/tables/hudi/table2/'
table_name = 'table2'
database = 'default'
data = [('Alice', 1, '2022/02/28'), ('Jhone', 2, '2022/03/01')]
rdd = sc.parallelize(data)
schema = StructType(
    [
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("partitioin_path", StringType(), True),
    ]
)
src_df = spark.createDataFrame(rdd, schema)

hudi_options = {
            'hoodie.table.name': table_name,
            'hoodie.datasource.write.operation': 'insert',
            'hoodie.datasource.write.recordkey.field': 'name',
            'hoodie.datasource.write.table.name': table_name,
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.datasource.hive_sync.database': 'default',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.table': table_name,
            'hoodie.datasource.write.partitionpath.field': 'partitioin_path',
            'hoodie.datasource.hive_sync.partition_fields': 'partitioin_path',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator',
            'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor'
        }

src_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
AWS Glue Catalog

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SbY2OJRi-1646662547720)(../../../resources/typora_img/image-20220307165709937.png)]

Query in Athena

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qfXh6Rq7-1646662547720)(../../../resources/typora_img/image-20220307165818410.png)]

_hoodie_commit_time_hoodie_commit_seqno_hoodie_record_key_hoodie_partition_path_hoodie_file_namenameagepartitioin_path
2022030708413420220307084134_0_1Jhone2022/03/01afb8d8e3-5f3e-4420-b390-1eeb20a59165-0_0-8-322_20220307084134.parquetJhone22022-03-01
2022030708413420220307084134_1_1Alice2022/02/28351c3c19-3c69-4661-8051-11a90c031112-0_1-8-323_20220307084134.parquetAlice12022-02-28
Files in S3 Bucket

在这里插入图片描述啊水电费

在这里插入图片描述

FAQ

ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystem

Error Info

在Hudi中设置了:DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY -> “false”,具体错误为:

java.lang.ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystem

这是由于Hiv3/Spark3移除了对于calcite包的依赖引起的

Solution:

我偷了个懒,将spark 的版本降到2.X

IllegalArgumentException: Partition path default is not in the form yyyy/mm/dd

Error Info

Caused by: java.lang.IllegalArgumentException: Partition path default is not in the form yyyy/mm/dd 
	at org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
	at org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
	... 42 more

Solution

由于配置'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor', 而org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor 中要求时间格式必须是这个类型

org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor#extractPartitionValuesInPath

在这里插入图片描述

参考

[1] Hudi 实践 | 在 AWS Glue 中使用 Apache Hudi: https://jishuin.proginn.com/p/763bfbd56de6

[2] 详解Apache Hudi如何配置各种类型分区: https://www.cnblogs.com/leesf456/p/13521694.html

[3] EMR + Hudi报ClassNotFoundException: RelDataTypeSystem错误的解决方法: https://blog.csdn.net/bluishglc/article/details/117441071

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
使用hudi-spark-client写数据到hudi表的步骤如下: 1. 首先,创建一个SparkSession对象,并配置相关的Spark和Hudi属性。例如: ```scala val spark = SparkSession.builder() .appName("HudiSparkClientExample") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.hive.convertMetastoreParquet", "false") .config("spark.sql.sources.partitionColumnTypeInference.enabled", "false") .config("spark.sql.hive.verifyPartitionPath", "false") .config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") .config("spark.hadoop.hive.exec.dynamic.partition", "true") .config("spark.sql.warehouse.dir", "hdfs://localhost:9000/user/hive/warehouse") .config("spark.sql.catalogImplementation", "hive") .enableHiveSupport() .getOrCreate() ``` 2. 创建一个DataFrame对象,用于存储要写入Hudi表的数据。 ```scala val data = Seq( (1, "John Doe", 25), (2, "Jane Smith", 30) ) val df = spark.createDataFrame(data).toDF("id", "name", "age") ``` 3. 使用`HoodieSparkSqlWriter`将DataFrame写入Hudi表。指定要写入的表名、要使用的主键列以及要使用的分区列。 ```scala df.write .format("org.apache.hudi") .option("hoodie.table.name", "my_hudi_table") .option("hoodie.datasource.write.precombine.field", "id") .option("hoodie.datasource.write.recordkey.field", "id") .option("hoodie.datasource.write.partitionpath.field", "age") .mode(SaveMode.Append) .save("hdfs://localhost:9000/path/to/hudi_table") ``` 4. 最后,关闭SparkSession对象。 ```scala spark.stop() ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值