前景说明
Spark提交到Yarn上运行的任务,由于我们项目的特殊性,现在要通过代码获取到任务的日志
如上图,需要获取到的是stderr和stdout的日志
有人可能会说,万一日志保留的时间太短,你获取不到怎么办?对于这种情况,我只获取到yarn的日志,如果你yarn上都查不到了,那我肯定获取不到日志(当然还有其他解决办法,比如开启日志聚合);
此文章解决的只是,yanr任务执行完毕,通过代码获取到对应的日志;其他情况暂不考虑
顺便说一下,获取日志的代码是通过scala语言去写的,而且粘贴出来的代码有所删减(直接复制估计用不了),这里只是提供一种解决办法
步骤一
通过appId获取app的信息
def getFlowLog(appID : String) : String = {
val url = "http://10.0.0.0:8088/ws/v1/cluster/apps/" + appID
val client = HttpClients.createDefault()
val get:HttpGet = new HttpGet(url)
val response:CloseableHttpResponse = client.execute(get)
val entity = response.getEntity
val str = EntityUtils.toString(entity,"UTF-8")
str
}
str的数据如下:其中amContainerLogs属性就是日志存放的路径(路径不完整,需要手动补全)
<app>
<id>application_1652178373342_4200</id>
<user>root</user>
<name>leiTestFlow_ChildFlow_3</name>
<queue>default</queue>
<state>FINISHED</state>
<finalStatus>SUCCEEDED</finalStatus>
<progress>100.0</progress>
<trackingUI>History</trackingUI>
<trackingUrl>http://master:8088/proxy/application_1652178373342_4200/A</trackingUrl>
<diagnostics/>
<clusterId>1652178373342</clusterId>
<applicationType>SPARK</applicationType>
<applicationTags/>
<startedTime>1656999960145</startedTime>
<finishedTime>1656999976012</finishedTime>
<elapsedTime>15867</elapsedTime>
<amContainerLogs>http://slave2:8042/node/containerlogs/container_1652178373342_4200_01_000001/root</amContainerLogs>
<amHostHttpAddress>slave2:8042</amHostHttpAddress>
<allocatedMB>-1</allocatedMB>
<allocatedVCores>-1</allocatedVCores>
<runningContainers>-1</runningContainers>
<memorySeconds>59446</memorySeconds>
<vcoreSeconds>29</vcoreSeconds>
<preemptedResourceMB>0</preemptedResourceMB>
<preemptedResourceVCores>0</preemptedResourceVCores>
<numNonAMContainerPreempted>0</numNonAMContainerPreempted>
<numAMContainerPreempted>0</numAMContainerPreempted>
</app>
步骤二
解析第一步获取到的数据(主要是获取amContainerLogs属性)
//appInfo:第一步获取到的app信息,此方法最终返回一个压缩流的输入流
//注意:这里之所以返回输入流,是因为我scala接口使用的akka返回值,如果要返回文件流,只能返回输入流
//可根据自己使用的技术修改返回值
def getFlowLogStream(appInfo:String):ByteArrayInputStream = {
val jsonMap: Map[String, Any] = JSON.parseFull(appInfo).get.asInstanceOf[Map[String, Any]]
val appMap:Map[String, Any] = MapUtil.get(jsonMap,"app").asInstanceOf[Map[String, Any]]
val logAddress = appMap.get("amContainerLogs") //获取任务的日志路径地址
//手动拼接两个日志路径
var logStdoutAddress = logAddress.mkString + "/stdout/?start=0"; //拼接第一个日志地址
var logStderrAddress = logAddress.mkString + "/stderr/?start=0"; //拼接第二个日志地址
// println("logStdoutAddress:"+logStdoutAddress)
// println("logStderrAddress:"+logStderrAddress)
val client = HttpClients.createDefault()
//请求Stdout日志
val logStdoutGet:HttpGet = new HttpGet(logStdoutAddress)
val logStdoutResponse:CloseableHttpResponse = client.execute(logStdoutGet)
val logStdoutInputStream = logStdoutResponse.getEntity.getContent //文件流
// val logStdout = IOUtils.toString(logStdoutResponse.getEntity.getContent, StandardCharsets.UTF_8) //为了测试使用,后续删除
// println("logStdout:"+logStdout)
//请求Stderr日志
val logStderrGet:HttpGet = new HttpGet(logStderrAddress)
val logStderrResponse:CloseableHttpResponse = client.execute(logStderrGet)
val logStderrInputStream = logStderrResponse.getEntity.getContent //文件流
// val logStderr = IOUtils.toString(logStderrResponse.getEntity.getContent, StandardCharsets.UTF_8) //为了测试使用,后续删除
// println("logStderr:"+logStderr)
var map = mutable.Map[String,InputStream]()
map +=("stdoutLog"-> logStdoutInputStream)
map +=("stderrLog"-> logStderrInputStream)
val byteArrayOutputStream = new ByteArrayOutputStream()
val zos = new ZipOutputStream(byteArrayOutputStream)
var zipEntry:ZipEntry = null
for (elem <- map) {
zipEntry = new ZipEntry(elem._1)
zos.putNextEntry(zipEntry)
var index :Int = 0;
var flag = true
while(flag){
val byte:Array[Byte] =new Array[Byte](1024*1024*50);
val inputStream: InputStream = elem._2
index = inputStream.read(byte,0,byte.length)
if (index>0){
zos.write(byte,0,index)
}else{
flag = false
}
}
zos.closeEntry()
}
zipEntry = new ZipEntry("flowName")
zos.putNextEntry(zipEntry)
zos.write(appMap.getOrElse("name","").toString.getBytes)
zos.closeEntry()
zos.close()
//最终压缩流中有3个内容 flowName:任务名; stdoutLog:日志文件流; stderrLog:日志文件流
val byteArrayInputStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray)
byteArrayInputStream
}
//方法调用,并构建返回值
val appInfo = getFlowLog(appID)
val byteArrayInputStream = getFlowLogStream(appInfo)
//这里因为我用的akka,需要构建返回值,所以getFlowLogStream返回结果只能是输入流
//但我后续查看,才发现有支持输出流的构建方法 StreamConverters.fromOutputStream(()=>OutputStream),
//但是此方法返回的值无法放到akka.HttpResponse中,
//算了,为了安全起见(毕竟返回输入流,是从其他地方粘贴过来的,万一改成输出流又可能有问题。。。),大家知道就好了
//如果用的不是akka.response返回数据,就可以考虑用StreamConverters.fromOutputStream(()=>OutputStream)
val returnValue: scaladsl.Source[ByteString, Future[IOResult]] = StreamConverters.fromInputStream(()=>byteArrayInputStream)
Future.successful(HttpResponse(SUCCESS_CODE, entity = HttpEntity(ContentTypes.`application/octet-stream`,returnValue)))
步骤三
调用接口的地方,解析返回的文件流
CloseableHttpResponse response = HttpUtils.doGet("接口地址", "接口参数", 5 * 1000);
if (response == null) {
logger.info("call failed, return is null ");
return null;
} else {
if (response.getStatusLine().getStatusCode() == 200) {
try {
InputStream inputStream = response.getEntity().getContent();
ZipInputStream zipInputStream = new ZipInputStream(inputStream);
ZipEntry zipEntry = null;
Map<String, String> tempMap = new HashMap<>();
while ( (zipEntry = zipInputStream.getNextEntry()) != null) {
//zipInputStream中有3个数据:两个日志流,一个flowName
String key = zipEntry.getName();
String content = IOUtils.toString(zipInputStream, StandardCharsets.UTF_8);
// System.out.println("logStreamMapKey:"+key);
// System.out.println("logStreamMapValue:"+content);
tempMap.put(key, content);
}
data.add(tempMap);
} catch (IOException e) {
logger.error("Interface call error", e);
}
} else {
logger.error("Interface call error,status is" + response.getStatusLine().getStatusCode());
return null;
}
}
最终结果如下图(我下面这个截图获取的不止一个任务的日志):
这种方式获取的是整个日志的页面(还带有html等标签),但做到这一步已经够我功能的实现了,所以如果大家只想获取日志的纯内容,还需要自行研究。。。
还有个小问题:好像hadoop3.0+版本的日志路径有所改变(没时间去查看hadoop3.0+,如果有知道的小伙伴可以留言告诉我一下,谢谢),所以根据自己使用的hadoop版本去改第二步中拼接的日志路径即可