引入问题
1.在spark streaming 中,如果消息抓取速度大于消费速度,这时候队列会有积压,如果这时候关闭了spark App,会导致队列中数据丢失
2.spark.streaming.stopGracefullyOnShutdown 这个参数在单机环境上有用,但是在yarn 集群上仍然没有用,即使yarn kill 是发送kill -15 信号量
1.初步思路
通过监控共享变量的形式, 由应用自身,去调用 JavaStreamingContext.stop(true,true),
使用自己编写的特定的脚本, 启动和关闭spark APP
2.共享变量的选择
a.redis
b.zookeeper
c.hdfs
3.具体实施
这里采用hdfs
monitor(groupId, path,()-> JavaStreamingContext.stop(true,true)).start();
private static Thread monitor(String groupId, String dirPath,Runnable task) { Thread t = new Thread(() -> { String path; if (dirPath.charAt(dirPath.length() - 1) == '/') { path = dirPath + groupId; } else { path = dirPath + "/" + groupId; } while (true) { String cmd = "hdfs dfs -ls " + path; try { Thread.sleep(5000L); Process process = Runtime.getRuntime().exec(cmd); InputStream in = process.getInputStream(); BufferedReader br = new BufferedReader(new InputStreamReader(in, "utf-8")); String line = br.readLine(); if (line == null || !line.contains(path)) { logger.warn("flag file not found |stop Spark App by monitor thread !", line); task.run(); } } catch (InterruptedException e) { logger.error("monitor thread has been interrupt | {}", e.getMessage()); } catch (Exception e) { logger.error("get flag file failed from hdfs | {}", cmd); } } }, groupId + "monitor_thread"); t.setDaemon(true); return t; }