原文:Hudi(10):Hudi集成Spark之并发控制-CSDN博客
目录
0. 相关文章链接
1. Hudi支持的并发控制
1.1. MVCC
Hudi的表操作,如压缩、清理、提交,hudi会利用多版本并发控制来提供多个表操作写入和查询之间的快照隔离。使用MVCC这种模型,Hudi支持并发任意数量的操作作业,并保证不会发生任何冲突。Hudi默认这种模型。MVCC方式所有的table service都使用同一个writer来保证没有冲突,避免竟态条件。
1.2. OPTIMISTIC CONCURRENCY
针对写入操作(upsert、insert等)利用乐观并发控制来启用多个writer将数据写到同一个表中,Hudi支持文件级的乐观一致性,即对于发生在同一个表中的任何2个提交(写入),如果它们没有写入正在更改的重叠文件,则允许两个写入都成功。此功能处于实验阶段,需要用到Zookeeper或HiveMetastore来获取锁。
2. 使用并发写方式
- 如果需要开启乐观并发写入,需要设置以下属性:
-
hoodie.write.concurrency.mode=optimistic_concurrency_control
-
hoodie.cleaner.policy.failed.writes=LAZY
-
hoodie.write.lock.provider=<lock-provider-classname>
- Hudi获取锁的服务提供两种模式使用zookeeper或HiveMetaStore:
相关zookeeper参数:
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url
hoodie.write.lock.zookeeper.port
hoodie.write.lock.zookeeper.lock_key
hoodie.write.lock.zookeeper.base_path
相关HiveMetastore参数,HiveMetastore URI是从运行时加载的hadoop配置文件中提取的:
hoodie.write.lock.provider=org.apache.hudi.hive.HiveMetastoreBasedLockProvider
hoodie.write.lock.hivemetastore.database
hoodie.write.lock.hivemetastore.table
3. 使用Spark DataFrame并发写入
(1)启动spark-shell
-
spark-shell \
-
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
-
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
-
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
(2)编写代码【核心为写入时的 hoodie 相关参数】
-
import org.apache.hudi.QuickstartUtils._
-
import scala.collection.JavaConversions._
-
import org.apache.spark.sql.SaveMode._
-
import org.apache.hudi.DataSourceReadOptions._
-
import org.apache.hudi.DataSourceWriteOptions._
-
import org.apache.hudi.config.HoodieWriteConfig._
-
val tableName = "hudi_trips_cow"
-
val basePath = "file:///tmp/hudi_trips_cow"
-
val dataGen = new DataGenerator
-
val inserts = convertToStringList(dataGen.generateInserts(10))
-
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
-
df.write.format("hudi").
-
.options(getQuickstartWriteConfigs)
-
.option(PRECOMBINE_FIELD_OPT_KEY, "ts")
-
.option(RECORDKEY_FIELD_OPT_KEY, "uuid")
-
.option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath")
-
.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
-
.option("hoodie.cleaner.policy.failed.writes", "LAZY")
-
.option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
-
.option("hoodie.write.lock.zookeeper.url", "hadoop1,hadoop2,hadoop3")
-
.option("hoodie.write.lock.zookeeper.port", "2181")
-
.option("hoodie.write.lock.zookeeper.lock_key", "test_table")
-
.option("hoodie.write.lock.zookeeper.base_path", "/multiwriter_test")
-
.option(TABLE_NAME, tableName)
-
.mode(Append)
-
.save(basePath)
(3)使用zk客户端,验证是否使用了zk
-
/opt/module/apache-zookeeper-3.5.7/bin/zkCli.sh
-
[zk: localhost:2181(CONNECTED) 0] ls /
(4)zk下产生了对应的目录,/multiwriter_test下的目录,为代码里指定的lock_key
[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test
4. 使用Delta Streamer并发写入
基于前面DeltaStreamer的例子,使用Delta Streamer消费kafka的数据写入到hudi中,这次加上并发写的参数。
1)进入配置文件目录,修改配置文件添加对应参数,提交到Hdfs上
-
cd /opt/module/hudi-props/
-
cp kafka-source.properties kafka-multiwriter-source.propertis
-
vim kafka-multiwriter-source.propertis
-
hoodie.write.concurrency.mode=optimistic_concurrency_control
-
hoodie.cleaner.policy.failed.writes=LAZY
-
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
-
hoodie.write.lock.zookeeper.url=hadoop1,hadoop2,hadoop3
-
hoodie.write.lock.zookeeper.port=2181
-
hoodie.write.lock.zookeeper.lock_key=test_table2
-
hoodie.write.lock.zookeeper.base_path=/multiwriter_test2
-
hadoop fs -put /opt/module/hudi-props/kafka-multiwriter-source.propertis /hudi-props
2)运行Delta Streamer
-
spark-submit \
-
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
-
/opt/module/spark-3.2.2/jars/hudi-utilities-bundle_2.12-0.12.0.jar \
-
--props hdfs://hadoop1:8020/hudi-props/kafka-multiwriter-source.propertis \
-
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
-
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
-
--source-ordering-field userid \
-
--target-base-path hdfs://hadoop1:8020/tmp/hudi/hudi_test_multi \
-
--target-table hudi_test_multi \
-
--op INSERT \
-
--table-type MERGE_ON_READ
3)查看zk是否产生新的目录
-
/opt/module/apache-zookeeper-3.5.7-bin/bin/zkCli.sh
-
[zk: localhost:2181(CONNECTED) 0] ls /
-
[zk: localhost:2181(CONNECTED) 1] ls /multiwriter_test2
注:其他Hudi相关文章链接由此进 -> Hudi文章汇总