环境与依赖
添加依赖:
<dependencies>
<!-- Spark core 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.2</version>
</dependency>
<!-- Spark SQL 依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<dependency>
<groupId>org.codehaus.janino</groupId>
<artifactId>janino</artifactId>
<version>3.0.12</version>
</dependency>
<!-- Hadoop Client 依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
</dependency>
<!-- hudi-spark3.2 -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark3.2-bundle_2.12</artifactId>
<version>0.12.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- assembly打包插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>com.example.MainClass</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
<!--Maven编译scala所需依赖-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
对表进行操作
默认情况下,如果提供了preCombineKey,则insert into的写操作类型为upsert,否则使用insert。
向非分区表插入数据
def main(args: Array[String]): Unit = {
//添加hdfs用户名
System.setProperty("HADOOP_USER_NAME", "root")
//创建SparkConf对象
val sparkConf: SparkConf = new SparkConf()
.setAppName("onHudi")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.hudi.catalog.HoodieCatalog")
.set("spark.sql.extensions","org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
//创建SparkSession对象
val spark: SparkSession = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
spark.sql("insert into spark_hudi.hudi_cow_nonpcf_tbl select 3, 'a1', 20")
spark.sql("select * from spark_hudi.hudi_cow_nonpcf_tbl where uuid = 3").show
spark.close()
}
其实看到这里大家应该也明白了,其实具体操作也都sql一样。如有需要执行的逻辑sql语句就在其中执行就可以了。
具体操作hudi的sql语句在上一篇博客中也有的:点击链接SQL方式对hudi表进行操作_open_test01的博客-CSDN博客
打包提交集群运行
添加maven打包依赖
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<useUniqueVersions>false</useUniqueVersions>
<classpathPrefix>lib/</classpathPrefix>
</manifest>
</archive>
</configuration>
</plugin>
点击打包操作
打包完成
提交运行
spark-submit \
--class spark_hudi.hudi \
/opt/jars/sparks-1.0-SNAPSHOT.jar