背景
随着sqoop的停止更新,spark支持jdbc数据源,数据导入工具已经开始转移到了spark原生导入,本文基于该功能做些实践。
代码
java代码如下:
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import java.io.File;
import java.util.Properties;
@Slf4j
public class ImportMysqlTest {
public static void main(String[] args) {
SparkSession spark = null;
long start = System.currentTimeMillis();
try{
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
spark = SparkSession
.builder()
.master("yarn")
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
Properties connectionProperties = new Properties();
connectionProperties.put("user","*userName*");
connectionProperties.put("password","*****");
connectionProperties.put("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver");
System.out.println("------------------------------------------------------read from sqlserver begin---------------------------------------------------------------------");
//需要事先获取表的自增长字段(本文是auto_index)最大值与最小值
int lowerBound = 0;//auto_index最小值
int upperBound = 588_578_233;//auto_index最大值
//按照每个task查询记录数为70000=upperBound/numPartitions计算,分区数量numPartitions为
int numPartitions = 8400;
//将dataset写入到mysql中,mysql表必须不存在
spark.read().jdbc("jdbc:sqlserver://${ip}:1433;DatabaseName=${dbName}","${tableName}","auto_index",lowerBound,upperBound,numPartitions,connectionProperties)
.write().mode(SaveMode.Overwrite).saveAsTable("caoyong_test.ship_ib");//saveAsTable方法,如果hive中没有该表,则会新建
}catch (Exception e){
e.printStackTrace();
}finally {
if(spark!=null){
spark.close();
}
}
long end = System.currentTimeMillis();
System.out.println("------------------------------------------------------read from sqlserver end"+(end-start)/1000.0+"---------------------------------------------------------------------");
}
}
启动spark任务
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.cores=6 \ #每个executor的task并行度为6(线程)
--conf spark.default.parallelism=10 \ #这个参数似乎没有起作用
--driver-memory 2g \
--executor-memory 6g
--num-executors 3 \ #整个程序,需要在集群中启动3个excutor
--name import_ship_ibase_data \ #起个名字
--class com.lenovo.ai.bigdata.spark.data.mysql.ImportMysqlTest bigdata-0.0.1-SNAPSHOT.jar
验证
在运行过程中,通过ApplicationMaster视图,可以看到:
- 在下图中,可以看到,整个集群中任务总数,以及并行的task数量为19(个人认为是3*6=18,多了一个1是误报?或者是处理写入hive的task?)
- lowerBound,upperBound,numPartitions这三个参数主要的作用是分区,比如数据有1.000.000,upperBound=1.000,lowerBound=0,numPartitions=5,那么会将0-1.000部分数据分区为5个task,1.000-1.000.000作为单独的一个分区task进行执行。所以,尤其需要注意,最后一个分区很可能会因为数据量过大而失败。
- 在明细列表中,可以看到集群中一共启动了三个excutor,每个excutor并行的任务约等于6
- 整个导入过程,共有184,826,995条记录,一共用了25分钟
- sqlserver中的数据,如下图
- hive中的数据
[hadoop@node2 conf]$ beeline Beeline version 2.3.7 by Apache Hive beeline> !connect jdbc:hive2://node2:11240 Connecting to jdbc:hive2://node2:11240 Enter username for jdbc:hive2://node2:11240: hadoop Enter password for jdbc:hive2://node2:11240: ********** 2021-08-26 16:52:41,542 INFO jdbc.Utils: Supplied authorities: node2:11240 2021-08-26 16:52:41,543 INFO jdbc.Utils: Resolved authority: node2:11240 Connected to: Spark SQL (version 3.1.2) Driver: Hive JDBC (version 2.3.7) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://node2:11240> use caoyong_test; +---------+ | Result | +---------+ +---------+ No rows selected (0.349 seconds) 0: jdbc:hive2://node2:11240> show tables; +---------------+------------+--------------+ | database | tableName | isTemporary | +---------------+------------+--------------+ | caoyong_test | ship_ib | false | | caoyong_test | ship_ib2 | false | | caoyong_test | student | false | +---------------+------------+--------------+ 3 rows selected (0.49 seconds) 0: jdbc:hive2://node2:11240> select count(*) from ship_ib2; +------------+ | count(1) | +------------+ | 184826995 | +------------+ 1 row selected (37.696 seconds) 0: jdbc:hive2://node2:11240>
- 用时统计,一共用了25分钟(开始时间:16:21:59,结束时间:16:47:07 )
- sqlserver中的数据,如下图