Hudi搭建与使用
一、拉取hudi
git clone --branch release-0.11.0 https://gitee.com/apache/Hudi.git
或者
git clone --branch release-0.11.0 https://github.com/apache/hudi.git
#二、替换阿里云仓库
<repository>
<id>nexus-aliyun</id>
<name>nexus-aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
三、编译
选择spark2.4版本,scala2.11,flink1.13
mvn clean package -DskipTests -Dspark2.4 -Dscala-2.11 -Dflink1.13
四、启动及使用
1、spark-shell启动
spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars /usr/local/Hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.11.0.jar
1.1导入依赖、设置表名、基本路径和数据生成器以生成本指南的记录
// spark-shell
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
1.2生成一些新的行程,将它们加载到 DataFrame 中并将 DataFrame 写入 Hudi 表,如下所示。
// spark-shell
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode