spark提交任务有client和cluster两种模式
主要区别:是否将driver程序放在远程worker机器上执行。cluster模式由master挑选一个worker机器放置driver进程。
client模式,也叫交互模式,任务提交后客户端一直保持连接,并即时获得运行的信息。
cluster模式,也叫非交互模式,任务提交后由后台运行,关闭客户端不影响任务的执行,运行信息需要通过日志文件去查看
PS 参数解释
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
对于RDD创建临时表的影响
因为创建的临时表在driver内存中,所以client模式创建的临时表会占用本地内存,而cluster模式看起将临时表存储在了spark集群上。
以下例子在客户端运行,从odps加载数据到spark,再创建了临时表用于后续查询,因为没有打成jar在提交到集群,只能采用client模式。
pom配置
...
<java.version>1.8</java.version>
<spark.version>2.2.1</spark.version>
<scala.version>2.11</scala.version>
...
<!--spark-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.janino</groupId>
<artifactId>commons-compiler</artifactId>
<version>2.7.8</version>
</dependency>
spring配置
<bean id="sparkConf" class="org.apache.spark.SparkConf">
<property name="AppName" value="${spark.appName}"/>
<property name="Master" value="${spark.master}"/>
</bean>
<bean id="mySparkSession" class="com.***.***.service.impl.MySparkSession" init-method="init">
<property name="sparkConf" ref="sparkConf"/>
<property name="env" value="${env}"/>
</bean>
MySparkSession. java
...
logger.info("env:{}",env);
if("product".equalsIgnoreCase(env)){
logger.warn("env is product without spark",env);
return;
}
if(sparkConf == null){
throw new IllegalArgumentException("error sparkConf cannot be null");
}
if(!isLocalhost()) {
sparkConf.set("spark.executor.cores", "12");
sparkConf.set("spark.cores.max", "12");
sparkConf.set("spark.executor.memory", "1g");// 由 Spark 应用程序启动时的 –executor-memory 或 spark.executor.memory 参数配置
sparkConf.set("spark.default.parallelism", "32");//3*4*3 //从16调到100,到500 远程任务数,spark 分配任务延迟(scheduler delay)从几十秒下降到几秒
sparkConf.set("spark.default.parallelism", "108");//3*4*3 //从16调到100,到500 远程任务数,spark 分配任务延迟(scheduler delay)从几十秒下降到几秒
}else{
sparkConf.set("spark.executor.cores", "6");
sparkConf.set("spark.executor.memory", "1g");
sparkConf.set("spark.default.parallelism", "54");
}
sparkConf.set("spark.executor.extraJavaOptions", "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps");
sparkConf.set("spark.logConf", "true");//当SparkContext启动时,将有效的SparkConf记录为INFO。默认:false
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");//当SparkContext启动时,将有效的SparkConf记录为INFO。默认:false
// sparkConf.set("spark.deploy.mode","cluster"); 这种方式设置无效
sparkContext =new SparkContext(sparkConf);
sparkSession = SparkSession.builder().sparkContext(sparkContext).getOrCreate();
properties配置
spark.appName = xx-application
spark.master = spark://127.0.0.1:7077
从odps加载数据
public void loadTables(boolean reload,boolean isTest) {
long startTime = 0;
List<User> users = null;
for(TableMapInfo entry : tableInfos){
String odpsTableName = entry.getOdpsTableName();
String sparkTableName = entry.getSparkTableName();
Class theClass = entry.getBeanClass();
boolean hasPartition = entry.isHasPartition();
String partition = null;
if(hasPartition){
partition = DateUtils.format(DateUtils.addDays(new Date(),-1),DateUtils.PATTERN_YEAR2DAY);
partition = entry.getPartitionPrefix()+"="+partition;
}
loadTableByPage(odpsTableName,sparkTableName,theClass,partition,isTest?11000:Integer.MAX_VALUE);
}
}
private <T> T loadTableByPage(String odpsTableName,String sparkTableName,Class<T> theClass,String partition,int limit) {
long startTime;
long pageSize = 10000;
long pageNo = 1;
long startNo = 0;
try {
// 如何分批加载到spark避免缓存过大导致oom,用aDataset.unionAll
TableTunnel tableTunnel = new TableTunnel(aliyunOdpsClient.getOdps());
PartitionSpec partitionSpec = StringUtils.isBlank(partition) ? null : new PartitionSpec(partition);
TableTunnel.DownloadSession downloadSession = partitionSpec != null ? tableTunnel.createDownloadSession(aliyunOdpsClient.getAliyunOdpsProject(), odpsTableName, partitionSpec)
: tableTunnel.createDownloadSession(aliyunOdpsClient.getAliyunOdpsProject(), odpsTableName);
long recordCount = downloadSession.getRecordCount();
recordCount = Math.min(recordCount,limit);
Dataset<Row> userAll = null;
long totalPages = recordCount%pageSize==0?recordCount/pageSize:recordCount/pageSize +1;
List<T> beans = null;
for(long i=0;i<totalPages;++i){
startTime = System.currentTimeMillis();
startNo = i*pageSize;
beans = OdpsReaderUtil.readTableByPage(downloadSession,odpsTableName, partition, theClass,startNo,pageSize,recordCount);
if(i==0){
userAll = spark.session().createDataFrame(beans, theClass);
}else{
userAll = userAll.unionAll(spark.session().createDataFrame(beans, theClass));
}
beans = null;
logger.warn("{} {} recordTotal:{} page:{}/{} cost:{} ms ", odpsTableName, partition, recordCount, i+1,totalPages,(System.currentTimeMillis() - startTime));
}
startTime = System.currentTimeMillis();
userAll.createOrReplaceGlobalTempView(sparkTableName);
userAll.persist();
userAll = null;
System.gc();
Thread.sleep(10*1000);
logger.info("create table cost:{} ms", (System.currentTimeMillis() - startTime));
return null;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
查询临时表
@Override
public QueryResult<Patient> queryPatientsByRule(UserGroupExecRequest request) {
QueryResult<Patient> queryResult = new QueryResult<>();
Encoder<Patient> encoder = Encoders.bean(Patient.class);
Encoder<Total> encoderTotal = Encoders.bean(Total.class);
Map<String,Object> params = new HashMap<>();
params.put("limit",500000000);
params.put("sex",request.getSex());
params.put("beginAge",request.getBeginAge());
params.put("endAge",request.getEndAge());
params.put("isInHospital",request.getIsInHospital());
params.put("startTime",parseDate(request.getStartTime()) );
params.put("endTime",parseDate(request.getEndTime()) );
String sql = sqlMapBuilder.getSql(TABLE_PATIENT,"selectPatients",params);
String sqlCount = sqlMapBuilder.getSql(TABLE_PATIENT,"countPatients",params);
Dataset<Patient> patientDataset = spark.session().sql(sql).as(encoder);
queryResult.setList(patientDataset.collectAsList());
return queryResult;
}
sqlmap:
<sparkSqlMap namespace="patient">
<sql id="selectPatients">
<![CDATA[
select *
from global_temp.patient a
where 1=1
#if(${sex} && ${sex}!='')
and a.sex = '${sex}'
#end
#if(${beginAge} && ${beginAge}>=0)
and a.age >= '${beginAge}'
#end
#if(${endAge} && ${endAge}>=0)
and a.age <= '${endAge}'
#end
#if(${isInHospital})
and a.is_hospital = '${isInHospital}'
#end
#if(${startTime})
and a.bind_time >= ${startTime}
#end
#if(${endTime})
and a.bind_time <= ${endTime}
#end
limit ${limit}
]]>
</sql>
PS.spark应用概念
https://blog.csdn.net/zhujianlin1990/article/details/79977560
不足
driver程序本地应用中跑,创建的临时表占用大量本地jvm内存,如果driver分配到的内存少存在oom风险