demo
总体说明
本例是一个demo, spark从hive中读数据,然后写入doris;
编译出fat jar, 使用spark-submit 提交任务即可:
/usr/local/spark3/bin/spark-submit --class xxx.spark_hive_doris \
/home/doris/spark-write2doris-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
建表测试
首先需要在hive中建表
create database if not exists test;
create table test.person(
id int,
name string,
age int
)
stored as orc;
插入数据:
insert into table test.person values(1, 'Haimeimei', 10);
insert into table test.person values(2, 'David', 11);
insert into table test.person values(3, 'Json', 12);
在Doris中也要创建好相应的输出表
CREATE TABLE IF NOT EXISTS test.person
(
id int,
name string,
age int
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH (id) BUCKETS 1
PROPERTIES(
"replication_allocation" = "tag.location.default: 1"
);
在Demo主类内,配置Doris相关配置
主类为 com.ctyun.bigdata.sql.spark_hive_doris
根据集群情况,修改以下变量:
dorisFeNodes:Doris FE http 地址,支持多个地址,使用逗号分隔
dorisTable: 表名,如:db1.tbl1
dorisUser: 访问Doris的用户名
dorisPwd: 访问Doris的密码
// Write to Doris
val dorisFeNodes = "127.0.0.1:8030"
val dorisUser = "xxxx"
val dorisPwd = "xxxxxxxxxx"
val dorisTable = "test.person"
检查 Doris 入库情况
运行任务后,查看doris输出表:
mysql> select * from test.person;
+------+-----------+------+
| id | name | age |
+------+-----------+------+
| 2 | David | 11 |
| 1 | Haimeimei | 10 |
| 3 | Json | 12 |
+------+-----------+------+
3 rows in set (0.02 sec)
显示结果如上,即为成功
POM 核心部分
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.2.2</version>
<scope>provided</scope>
</dependency>
<!-- doris-connector -->
<dependency>
<groupId>org.apache.doris</groupId>
<artifactId>spark-doris-connector-3.2_2.12</artifactId>
<version>1.1.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 这是个编译java代码的 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>8</source>
<target>8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 这是个编译scala代码的 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--maven-assembly-plugin不能打包spring Framework框架的项目,可以使用maven-shade-plugin插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<archive>
<manifest>
<mainClass>com.ctyun.bigdata.sql.spark_hive_doris</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
scala部分
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
*
* 从hive中读数据,然后写入doris;
* 本例是一个demo.
* 编译出fat jar, 使用spark-submit 提交任务即可:
* $SPARK_HOME/bin/spark-submit --class xxx.spark_hive_doris \
* /home/doris/spark-write2doris-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
*/
object spark_hive_doris {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("write2doris")
val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate
/**
*
* 首先需要在hive中建表
* create table test.person(
* id int,
* name string,
* age int
* )
* stored as orc;
*
* insert into table test.person values(1, 'Haimeimei', 10);
* insert into table test.person values(2, 'David', 11);
* insert into table test.person values(3, 'Json', 12);
*/
// Read from Hive
val sql =
"""
| select
| *
| from
| test.person
|""".stripMargin
val df = spark.sql(sql)
df.show()
// Write to Doris
val dorisFeNodes = "xxx:8030"
val dorisUser = "root"
val dorisPwd = "xxx"
val dorisTable = "test.person"
/**
*
* 首先,需要在Doris里创建好相应的table,
* 然后才能由spark触发写入。
* CREATE TABLE IF NOT EXISTS test.person
* (
* id int,
* name string,
* age int
* )
* DUPLICATE KEY(id)
* DISTRIBUTED BY HASH (id) BUCKETS 1
* PROPERTIES(
* "replication_allocation" = "tag.location.default: 1"
* );
*/
df.write
.format("doris")
.option("doris.fenodes", dorisFeNodes)
.option("doris.table.identifier", dorisTable)
.option("user", dorisUser)
.option("password", dorisPwd)
.option("sink.batch.size", 2)
.option("sink.max-retries", 2)
.save()
spark.stop()
}
}