解决SkyWalking在gateway组件中logback日志链路丢失
契机
吴晟大佬:Why ping me? The plugin is not written by me. I don’t know anything about agent running in Spring cloud.
一年半以前,公司就从单体应用逐步迁移到微服务管理。奈何业务绑定太深,一直在两边混合开发。由于迁移到微服务的进程完成了一大半了,之前在单体项目中打的链路追踪就不太好用了。随即想到了SkyWalking,随即发现了一个大坑,那就是在gateway组件中logback日志会丢失TraceId,网上搜了一圈都没有解决问题,我又跑到SkyWalking的issue中搜了一大圈,发现大佬以下回答,还是得自己想办法。
场景复现
版本
- spring-cloud-gateway-dependencies = 3.1.3
- skywalking-oap-server:9.3.0
- skywalking-ui:9.3.0
- apache-skywalking-java-agent:9.0.0
- 假设有gateway+user两个模块
skyWalking环境搭建
#先搞个elasticsearch 略
#docker搭建skywalking-oap-server
docker run -d \
--name skywalking-oap \
--privileged=true \
-p 11800:11800 \
-p 12800:12800 \
--restart always \
-e SW_CORE_RECORD_DATA_TTL=15 \
-e SW_CORE_METRICS_DATA_TTL=15 \
-e SW_ENABLE_UPDATE_UI_TEMPLATE="true" \
-e SW_HEALTH_CHECKER=default \
-e TZ=Asia/Shanghai \
-e SW_STORAGE=elasticsearch \
-e JAVA_OPTS="-Xms512m -Xmx1024m" \
-e SW_STORAGE_ES_CLUSTER_NODES=host.docker.internal:9200 \
-e SW_ES_PASSWORD=guest \
-e SW_ES_USER=elastic [apache/skywalking-oap-server:9.3.0](http://registry.docker-cn.com/apache/skywalking-oap-server:9.3.0)
#docker搭建skywalking-ui
docker run -d \
--name skywalking-ui \
--link skywalking-oap \
--privileged=true \
-p 18080:8080 \
--restart always \
-e TZ=Asia/Shanghai \
-e SW_HEALTH_CHECKER=default \
-e SW_OAP_ADDRESS=http://host.docker.internal:12800 apache/skywalking-ui:9.3.0
#下载agent
wget [https://dlcdn.apache.org/skywalking/java-agent/9.0.0/apache-skywalking-java-agent-9.0.0.tgz](https://dlcdn.apache.org/skywalking/java-agent/9.0.0/apache-skywalking-java-agent-9.0.0.tgz)
agent目录配置
#进入skywalking-agent目录
cd skywalking-agent
#拷贝optional-plugins里面的必要插件到activations
apm-toolkit-logback-1.x-activation-9.0.0.jar
apm-toolkit-logging-common-9.0.0.jar
apm-toolkit-trace-activation-9.0.0.jar
apm-toolkit-webflux-activation-9.0.0.jar
项目运行配置
gateway和user模块都配上一下参数,注意SW_AGENT_NAME要变一下
#vm
-javaagent:/Users/y/Documents/development/skywalking-agent/skywalking-agent.jar
#env
MYSQL_HOST=127.0.0.1;SW_AGENT_NAME=gw;SW_AGENT_COLLECTOR_BACKEND_SERVICES=127.0.0.1:11800;SW_AGENT_SPAN_LIMIT=2000
Maven依赖
<dependency>
<groupId>org.apache.skywalking</groupId>
<artifactId>apm-toolkit-trace</artifactId>
<version>9.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.skywalking</groupId>
<artifactId>apm-toolkit-logback-1.x</artifactId>
<version>9.0.0</version>
</dependency>
logback配置
<?xml version="1.0" encoding="UTF-8"?>
<configuration debug="false" scan="false">
<!-- 彩色日志格式 -->
<property name="CONSOLE_LOG_PATTERN" value="[%d{yyyy-MM-dd HH:mm:ss.SSS}] [%highlight(%-5level)] [%tid] [%thread] %cyan(%logger{50}:%L) - %msg%n"/>
<!-- 彩色日志依赖的渲染类 -->
<conversionRule conversionWord="clr" converterClass="org.springframework.boot.logging.logback.ColorConverter"/>
<conversionRule conversionWord="wex" converterClass="org.springframework.boot.logging.logback.WhitespaceThrowableProxyConverter"/>
<conversionRule conversionWord="wEx" converterClass="org.springframework.boot.logging.logback.ExtendedWhitespaceThrowableProxyConverter"/>
<appender name="console" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.TraceIdPatternLogbackLayout">
<pattern>${CONSOLE_LOG_PATTERN}</pattern>
</layout>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="console"/>
</root>
</configuration>
gateway模块请求日志打印代码
就是打印四条信息:请求开始,请求头,请求结束信息,请求结束返回
@Slf4j
@Component
public class ApiLogFilter implements GlobalFilter, Ordered {
private static final String START_TIME = "startTime";
private static final String X_REAL_IP = "X-Real-IP";
private final DataBuffer emptyBuffer = new NettyDataBufferFactory(ByteBufAllocator.DEFAULT).allocateBuffer(0);
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
//请求的基本信息
List<String> ips = exchange.getRequest().getHeaders().get(X_REAL_IP);
String ip = ips != null ? ips.get(0) : null;
String method = exchange.getRequest().getMethodValue();
String path = exchange.getRequest().getURI().getPath();
exchange.getAttributes().put(START_TIME, System.currentTimeMillis());
ServerHttpRequest request = exchange.getRequest();
if (path.contains("api-docs")) {
log.warn("{}正在请求接口文档{}", ip, path);
return chain.filter(exchange.mutate().request(request).build());
}
//打印请求开始日志
log.info("请求开始:\n\n方法:{}\nHost:{}\nip:{}\n请求接口:{}\n", method, exchange.getRequest().getURI().getHost(), ip,
path);
//打印header头
log.info("请求头:\n\n{}\n", JSONUtil.toJsonStr(exchange.getRequest().getHeaders()));
//包装response,在任意地方返回打印结束信息等
ServerHttpResponse originalResponse = exchange.getResponse();
DataBufferFactory bufferFactory = originalResponse.bufferFactory();
ServerHttpResponseDecorator decoratedResponse = new ServerHttpResponseDecorator(originalResponse) {
@Override
@NotNull
public Mono<Void> writeWith(@NotNull Publisher<? extends DataBuffer> body) {
Long startTime = exchange.getAttribute(START_TIME);
if (startTime != null) {
Long executeTime = (System.currentTimeMillis() - startTime);
int code = 500;
if (exchange.getResponse().getStatusCode() != null) {
code = exchange.getResponse().getStatusCode().value();
}
log.info("请求结束信息:\n\n响应状态码:{}\n请求耗时:{}ms\n", code, executeTime);
}
if (body instanceof Flux) {
Flux<? extends DataBuffer> fluxBody = Flux.from(body);
return super.writeWith(fluxBody.map(dataBuffer -> {
byte[] content = new byte[dataBuffer.readableByteCount()];
dataBuffer.read(content);
DataBufferUtils.release(dataBuffer);
String data = new String(content, StandardCharsets.UTF_8);//data
log.info("请求结束返回: \n\n{}\n", data);
return bufferFactory.wrap(content);
}));
} else {
log.error("<--- {} 响应code异常", getStatusCode());
}
MDC.clear();
return super.writeWith(body);
}
};
//打印请求参数
MultiValueMap<String, String> queryParams = request.getQueryParams();
if (!queryParams.isEmpty()) {
log.info("请求参数(url):\n\n{}\n", queryParams);
}
if (HttpMethod.POST.name().equals(method)) {
return DataBufferUtils.join(request.getBody()).defaultIfEmpty(emptyBuffer).flatMap(dataBuffer -> {
byte[] bytes = new byte[dataBuffer.readableByteCount()];
if (PrimitiveArrayUtil.isNotEmpty(bytes)) {
dataBuffer.read(bytes);
String bodyString = new String(bytes, StandardCharsets.UTF_8);
try {
bodyString = UriUtils.decode(bodyString, "UTF-8");
} catch (Exception e) {
log.error("请求参数编码错误有特殊符号", e);
}
log.info("请求参数(form):\n\n{}\n", bodyString);
exchange.getAttributes().put("POST_BODY", bodyString);
DataBufferUtils.release(dataBuffer);
Flux<DataBuffer> cachedFlux = Flux.defer(() -> {
DataBuffer buffer = exchange.getResponse().bufferFactory().wrap(bytes);
return Mono.just(buffer);
});
ServerHttpRequest mutatedRequest = new ServerHttpRequestDecorator(request) {
@Override
public Flux<DataBuffer> getBody() {
return cachedFlux;
}
};
return chain.filter(exchange.mutate().request(mutatedRequest).response(decoratedResponse).build());
} else {
return chain.filter(exchange.mutate().request(request).response(decoratedResponse).build());
}
});
}
return chain.filter(exchange.mutate().request(request).response(decoratedResponse).build());
}
@Override
public int getOrder() {
return Ordered.HIGHEST_PRECEDENCE;
}
}
user模块测试代码
@Slf4j
@RestController
@AllArgsConstructor
@Tag(name = "Test-测试通用")
@RequestMapping("/user/xxx/test")
public class TestController {
@NoLoginRequired
@Operation(summary = "testSkywalking")
@GetMapping(value = "/testSkywalking")
public R<String> testSkywalking() {
log.info("user-testSkywalking");
return R.ok();
}
}
访问测试
通过网关访问testSkywalking测试结果如下
gateway日志
[TID:N/A] [reactor-http-nio-4] ApiLogFilter:77 - 请求开始:
方法:GET
Host:localhost
ip:null
请求接口:/user/xxx/test/testSkywalking
[TID:N/A] [reactor-http-nio-4] com.tommi.pet.gateway.filter.ApiLogFilter:81 - 请求头:
{"Host":["localhost:9999"]}
[TID:N/A] [reactor-http-nio-6] com.tommi.pet.gateway.filter.ApiLogFilter:101 - 请求结束信息:
响应状态码:200
请求耗时:1062ms
[TID:N/A] [reactor-http-nio-6] com.tommi.pet.gateway.filter.ApiLogFilter:111 - 请求结束返回:
{"code":0,"message":null,"data":"1","success":true,"timestamp":1}
user日志
[TID:xx.17057720223290001] [XNIO-1 task-1] TestController:54 - user-testSkywalking
可以明显看到user模块的tid是正常输出的,可是gateway的日志死活都是N/A
尝试解决
分析问题
首先去去看看哪里输出的这个追踪码,没有找到9.0.0的源码,只找到8.7.0的源码,先搞下来研究研究:https://github.com/apache/skywalking/blob/v8.7.0/。然后定位到logback插件,去寻找tid在哪里输出的,最终找到如下:
apm-sniffer/apm-toolkit-activation/apm-toolkit-logback-1.x-activation/src/main/java/org/apache/skywalking/apm/toolkit/activation/log/logback/v1/x/PrintTraceIdInterceptor.java
public class PrintTraceIdInterceptor implements InstanceMethodsAroundInterceptor {
@Override
public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes,
MethodInterceptResult result) throws Throwable {
}
@Override
public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes,
Object ret) throws Throwable {
if (!ContextManager.isActive()) {
if (allArguments[0] instanceof EnhancedInstance) {
SkyWalkingContext skyWalkingContext = (SkyWalkingContext) ((EnhancedInstance) allArguments[0]).getSkyWalkingDynamicField();
if (skyWalkingContext != null) {
return "TID:" + skyWalkingContext.getTraceId();
}
}
}
return "TID:" + ContextManager.getGlobalTraceId();
}
@Override
public void handleMethodException(EnhancedInstance objInst, Method method, Object[] allArguments,
Class<?>[] argumentsTypes, Throwable t) {
}
}
观察代码无非就是skyWalkingContext.getTraceId()和ContextManager.xx输出的N/A。
Reator模式又有点陌生,无从下手。无意间从另外一篇博客中发现,其实exchange中是有这个tid的
曲线救国
gateway也就拦截器打印点日志,要不然就是exceptionHandler中打印。我要不然直接把这个值传给skywalking-logback插件就完了嘛。拿什么传呢,对的,使用MDC来传,同一个线程都可以传,并且改动最小。直接修改插件中PrintTraceIdInterceptor的源码如下,然后直接打包这个8.7的skywalking-logback插件替换掉原本apm-toolkit-logback-1.x-activation-9.0.0.jar插件(完全可用)
@Override
public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes,
Object ret) throws Throwable {
String result;
if (!ContextManager.isActive()) {
if (allArguments[0] instanceof EnhancedInstance) {
SkyWalkingContext skyWalkingContext = (SkyWalkingContext) ((EnhancedInstance) allArguments[0]).getSkyWalkingDynamicField();
if (skyWalkingContext != null) {
result = "TID:" + skyWalkingContext.getTraceId();
return result;
} else {
String tid = MDC.get("tid");
if (tid != null) {
return "TID:" + tid;
}
}
}
}
return "TID:" + ContextManager.getGlobalTraceId();
}
然后再搞一个工具类
@Slf4j
public class SkywalkingUtil {
/**
* tid放入MDC
*
* @param exchange
*/
public static void putTidIntoMdc(ServerWebExchange exchange) {
try {
Object entrySpanInstance = exchange.getAttributes().get("SKYWALKING_SPAN");
if (ObjectUtil.isEmpty(entrySpanInstance)) {
return;
}
Class<?> entrySpanClazz = entrySpanInstance.getClass().getSuperclass().getSuperclass();
Field field = entrySpanClazz.getDeclaredField("owner");
field.setAccessible(true);
Object ownerInstance = field.get(entrySpanInstance);
Class<?> ownerClazz = ownerInstance.getClass();
Method getTraceId = ownerClazz.getMethod("getReadablePrimaryTraceId");
String traceId = (String) getTraceId.invoke(ownerInstance);
MDC.put("tid", traceId);
} catch (Exception e) {
log.error("gateway追踪码获取失败", e);
}
}
}
然后在gateway拦截器日志打印的地方加上以下代码,但是请求开始和请求结束是两个不同的线程啊,无法使用MDC来传递,架不住打印日志的地方都使用的一个exchange,当然为了保险你可以在每一个gatway模块的log之前加上以下语句
//打印开始日志的地方
SkywalkingUtil.putTidIntoMdc(exchange);
......
//打印结束日志的地方
SkywalkingUtil.putTidIntoMdc(exchange);
验证
gateway日志
[TID:xx.17057741377320011] [reactor-http-nio-2] ApiLogFilter:77 - 请求开始:
方法:GET
Host:localhost
ip:null
请求接口:/user/user/pet/test/testSkywalking
[TID:xx.17057741377320011] [reactor-http-nio-2] ApiLogFilter:81 - 请求头:
{"Host":["localhost:9999"]}
[TID:xx.17057741377320011] [reactor-http-nio-7] ApiLogFilter:101 - 请求结束信息:
响应状态码:200
请求耗时:25ms
[TID:xx.17057741377320011] [reactor-http-nio-7] ApiLogFilter:111 - 请求结束返回:
{"code":0,"message":null,"data":"1748769599551352832","success":true,"timestamp":1705774137758,"extraResult":""}
user日志
[TID:xx.17057741377320011] [XNIO-1 task-1] TestController:54 - user-testSkywalking
确实可用,完美解决
总结
- 就是把apm-toolkit-logback-1.x-activation插件改了点代码替换掉
- 然后在gatway模块的日志前面用MDC传了点值进到插件里面
- 相当取巧,因为gatway也没啥日志
- 目前看来没什么问题