1背景
之前运维的同事时不时的提起尽快为我们的金融网关系统写个熔断模块,不然心理总觉得不踏实,担心哪天某个业务系统挂了,直接把网关给拖垮。趁着现在金融行业不景气,股民还是韭菜状态没心思炒股,系统流量不大,时间也比较充裕的大背景下就先把这事做了。
2思路
现在常见的一些熔断框架,包括使用手册、源码大概都看了下,思路都是大同小异,差不多都是围绕3种状态来考虑的。
这3个状态就决定了当前断路器的行为:
- 关闭状态:监控系统在当前时间窗口内的异常指标是否在指定的合理范围内;一旦超标就打开断路器;
- 打开状态:拒绝所有请求直到打开状态超时,这时断路器就切换成半打开状态;
- 半开状态:试探性的允许少量请求通过,并监控请求结果是否超出异常指标,如果不超出则认为系统恢复正常并关闭断路器,否则重新回到打开状态;
本来是想直接借助现有的熔断框架来开发,但是这些框架功能过于繁多,并通过加锁进行多线程并发处理,用于业务系统不错,但用在网关上还是重了些,只好自己设计了。
首先关闭状态。这个状态下需要一个窗口来采集样本数据,为指标计算和判断系统状态是否正常提供依据。该窗口定义,参考resilience4j提供了2种窗口类型,可以基于时间段,也可以基于基数。因为断路器本质上是根据当前异常指标来判断系统是否处于一个正常的工作状态,这点对于指标的时效性要求是比较高的,基于基数的窗口类型会因为指标时效太差导致无法准确判断当前的系统状态,所以这儿我只实现基于时间段的窗口。接下来还要定义一个窗口最小样本数,这个很好理解,如果样本数太少会导致系统状态判断不准,可能会导致误判。异常率的计算只要记录失败响应个数和响应总数就行了。
然后是打开状态。这时所有的请求会被拒绝通过,但是之前的响应结果还是会被继续统计。当然进入打开状态时断路器会定义一个打开超时时间,超时后会自动进入半开状态。
最后是半开状态。在这个状态下,我发现不同的断路器做法都不一样。有的直接进入关闭状态,有的是放一个请求试探一下,还有的是允许一定数量的请求通过并计算异常率来判断是打开还是关闭断路器。我的考虑是,判断系统是否恢复正常工作状态是还是需要根据最近一段时间内的样本进行计算判断的,但是要对样本数进行控制,因为此时系统可能还没恢复或正在进行预热,大量的请求进来会对系统造成较大的压力,所以这边我加了个流量控制。总之在这个半开时间窗口内,只要异常率低于打开断路器的异常率阈值就关闭断路器,否则重新打开。
3设计
因为要对窗口时间内的数据进行统计,可以通过一个循环数组来记录每个时间单位内的统计数据,然后用一个游标字段指向当前时间的数组元素,每次进行样本计算时就可以根据游标找到当前时间对于的数据元素进行处理。这里我使用了一个定时调度线程池每过一个单位时间就滑动游标,并更新整个窗口的指标数据。这样做其实就是滑动窗口的动作交给定时器线程完成,而不是由业务线程完成,这样做的好处就是减轻了业务线程的压力,也为无锁方式实现断路器做好了铺垫。
另外,为了提高性能,各项统计数据全部放在一个AtomicLong型变量中,这样也是为了方便多线程场景下的高效处理(这儿其实也可以用AtomicReference,但从内存占用,以及GC压力的角度考虑,显然AtomicLong更加轻量)。用Long型变量,还要考虑溢出的场景,不过20位的bit长度对于大多数场景已经够用了。请求总数记录的是半开窗口内的请求数据,用于半开状态下判断流量超限用的。
这儿还得考虑下Cache Line伪共享的问题,因为大部分操作是通过游标(cursor)来找到当前时间单位的统计值的,这个变量的读取操作非常频繁,如果cursor邻近内存块变化而导致cursor的cpu cache失效就会对性能造成影响,所以在cursor变量附近做了些字节填充,保证cursor的高效读取。查了下,我们机器的cache line都是64Byte的,所以这儿暂时只做了64Byte的填充。
这里的设计可以用下图来展示:
定时线程功能如下:
- 如果是关闭状态,检查当前窗口统计值是否超过异常率阈值,如果超过则更新断路器状态,将其打开;
- 如果当前是打开,则更新倒计时,如果已经超时则更新断路器状态为半开状态;
- 如果当前是半开状态,则更新倒计时,如果已经超时则根据异常率更新断路器状态,如果当前没有样本数据,则继续保持半开状态;
- 更新窗口统计值,将其减去游标下一单位的统计值(WindowStatistic-CircularBuffer(cursor+1));
- 将游标移至下一单位;
而业务线程就是上报请求和响应事件,并根据断路器的反馈执行下一步动作,对于请求事件处理,逻辑如下:
- 从窗口统计值中获取断路器状态,如果关闭则接受请求;如果是打开则直接拒绝;如果是半开,先要判断是否流量超限,超限就拒绝,不超限就接受;
- 如果断路器接受这个请求,则更新窗口统计值和当前单位时间的统计值;
对于响应事件,只要更新窗口统计值和当前单位时间的统计值即可;
从这2类线程的功能可以分析出,可以更新当前时间单位统计值和窗口汇总的统计值时会有并发冲突,因为这2个统计值都是用AtomicLong和AtomicLongArray来存放的,可以通过CAS的方式进行高效更新。
4Demo代码:
CircuitBreaker.java
package foo;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.AtomicLongArray;
class Log {
public static void print(String str) {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
String date = format.format(new Date());
synchronized (Log.class) {
System.out.println("###" + Thread.currentThread().getId() + " " + date + ": " + str);
}
}
}
class StatisticHelper {
private final static long TOTAL_REQUEST_FIELD_BITS = 20;
private final static long TOTAL_RESPONSE_FIELD_BITS = 21;
private final static long FAILURE_RESPONSE_FIELD_BITS = 21;
private final static long STATUS_FIELD_BITS = 2;
private final static long TOTAL_REQUEST_FIELD_MASK = (1L << TOTAL_REQUEST_FIELD_BITS) - 1;
private final static long TOTAL_RESPONSE_FIELD_MASK = (1L << TOTAL_RESPONSE_FIELD_BITS) - 1;
private final static long FAILURE_RESPONSE_FIELD_MASK = (1L << FAILURE_RESPONSE_FIELD_BITS) - 1;
private final static long TOTAL_REQUEST_FIELD_UNMASK = ~TOTAL_REQUEST_FIELD_MASK;
private final static long TOTAL_RESPONSE_FIELD_UNMASK = ~(TOTAL_RESPONSE_FIELD_MASK << TOTAL_REQUEST_FIELD_BITS);
private final static long FAILURE_RESPONSE_FIELD_UNMASK = ~(FAILURE_RESPONSE_FIELD_MASK << (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
static long getTotalRequest(long statistic) {
return TOTAL_REQUEST_FIELD_MASK & statistic;
}
static long setTotalRequest(long statistic, long totalRequest) {
return (TOTAL_REQUEST_FIELD_UNMASK & statistic) | (TOTAL_REQUEST_FIELD_MASK & totalRequest);
}
static long getTotalResponse(long statistic) {
return TOTAL_RESPONSE_FIELD_MASK & (statistic >>> TOTAL_REQUEST_FIELD_BITS);
}
static long setTotalResponse(long statistic, long totalResponse) {
return (TOTAL_RESPONSE_FIELD_UNMASK & statistic) | ((totalResponse & TOTAL_RESPONSE_FIELD_MASK) << TOTAL_REQUEST_FIELD_BITS);
}
static long getFailureResponse(long statistic) {
return FAILURE_RESPONSE_FIELD_MASK & (statistic >>> (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
}
static long setFailureResponse(long statistic, long failureResponse) {
return (FAILURE_RESPONSE_FIELD_UNMASK & statistic) | ((failureResponse & FAILURE_RESPONSE_FIELD_MASK) << (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
}
static long getStatus(long statistic) {
return statistic >>> (Long.SIZE - STATUS_FIELD_BITS);
}
static long setStatus(long statistic, long status) {
return ((statistic << STATUS_FIELD_BITS) >>> STATUS_FIELD_BITS) | (status << (Long.SIZE - STATUS_FIELD_BITS));
}
static void casUpdateStatus(AtomicLong statistic, long status) {
boolean isOk;
long value;
do {
value = statistic.get();
long actureValue = setStatus(value, status);
isOk = statistic.compareAndSet(value, actureValue);
}
while (!isOk);
}
static String toString(boolean showStatus, long statistic) {
String statusStr = "";
if (showStatus) {
int status = (int) getStatus(statistic);
switch (status) {
case (int) CircuitBreaker.STATUS_CLOSE:
statusStr = "CLOSE";
break;
case (int) CircuitBreaker.STATUS_OPEN:
statusStr = "OPEN";
break;
case (int) CircuitBreaker.STATUS_HALF_OPEN:
statusStr = "HAOP";
break;
}
}
long failureResponse = getFailureResponse(statistic);
long totalResponse = getTotalResponse(statistic);
long totalRequest = getTotalRequest(statistic);
StringBuilder sb = new StringBuilder();
if (showStatus) sb.append(statusStr).append("-");
sb.append(failureResponse).append("-");
sb.append(totalResponse).append("-");
sb.append(totalRequest);
return sb.toString();
}
static String toString(AtomicLongArray slidingWindow, int cursor) {
StringBuilder sb = new StringBuilder();
int len = slidingWindow.length();
String prefix = "";
for (int i = 0; i < len; i++) {
long value = slidingWindow.get(cursor >= i ? (cursor - i) : (cursor - i + len));
sb.append(prefix).append(i).append(")").append(toString(false, value));
prefix = "\n";
}
return sb.toString();
}
}
class SlidingWindowTask implements Runnable {
private CircuitBreaker circuitBreaker;
private int openDuration = 0;
private int halfOpenDuration = 0;
protected void slide() {
int cursor = circuitBreaker.cursor;
int windowSize = circuitBreaker.slidingWindow.length();
int nextCursor = (cursor + 1) % windowSize;
long nextValue = circuitBreaker.slidingWindow.get(nextCursor);
long nextTotalResponse = StatisticHelper.getTotalResponse(nextValue);
long nextFailureResponse = StatisticHelper.getFailureResponse(nextValue);
int tailHalfOpenCursor = cursor + 1 - circuitBreaker.halfOpenDuration;
if (tailHalfOpenCursor < 0) tailHalfOpenCursor = tailHalfOpenCursor + windowSize;
long tailHalfOpenValue = circuitBreaker.slidingWindow.get(tailHalfOpenCursor);
long nextTotalRequest = StatisticHelper.getTotalRequest(tailHalfOpenValue);
boolean isOk;
do {
long value = circuitBreaker.statistic.get();
long totalRequest = StatisticHelper.getTotalRequest(value);
long totalResponse = StatisticHelper.getTotalResponse(value);
long failureResponse = StatisticHelper.getFailureResponse(value);
long actureValue = StatisticHelper.setTotalRequest(value, totalRequest - nextTotalRequest);
actureValue = StatisticHelper.setTotalResponse(actureValue, totalResponse - nextTotalResponse);
actureValue = StatisticHelper.setFailureResponse(actureValue, failureResponse - nextFailureResponse);
isOk = circuitBreaker.statistic.compareAndSet(value, actureValue);
}
while (!isOk);
/*将下一个窗口统计数据清零,并将游标指向下一个,表示窗口的滑动*/
circuitBreaker.slidingWindow.set(nextCursor, 0);
circuitBreaker.cursor = nextCursor;
}
public SlidingWindowTask(final CircuitBreaker circuitBreaker) {
this.circuitBreaker = circuitBreaker;
}
@Override
public void run() {
long value = circuitBreaker.statistic.get();
long totalResponse = StatisticHelper.getTotalResponse(value);
long failureResponse = StatisticHelper.getFailureResponse(value);
long status = StatisticHelper.getStatus(value);
/*当前关闭状态:检查是否超过异常阈值, 如果超过则打开熔断器*/
if (status == CircuitBreaker.STATUS_CLOSE) {
long failureRate = failureResponse * 100 / totalResponse;
if (totalResponse >= circuitBreaker.minNumber && failureRate > circuitBreaker.failureRateThreshold) {
Log.print("(CLOSE) exceed failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_OPEN);
openDuration = circuitBreaker.openDuration;
}
}
/*当前打开状态: 检查是否已经打开超时,如果超时则进入半开状态*/
else if (status == CircuitBreaker.STATUS_OPEN) {
if (--openDuration == 0) {
Log.print("(OPEN) timeout 0/" + circuitBreaker.openDuration);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_HALF_OPEN);
halfOpenDuration = circuitBreaker.halfOpenDuration;
}
}
/*当前半开状态: 检查是否超过异常阈值, 如果超过则重新打开断路器, 否则关闭断路器
* 如果半开阶段没有请求响应,则无法判断是否超过异常阈值,所以不做任何操作继续保持半开状态。
* */
else {
if (--halfOpenDuration <= 0) {
long totalResponseSum = 0;
long failureResponseSum = 0;
int cursor = circuitBreaker.cursor;
int slidingWindowSize = circuitBreaker.slidingWindow.length();
for (int i = 0; i < circuitBreaker.halfOpenDuration; i++) {
long bucketValue = circuitBreaker.slidingWindow.get(cursor >= i ? (cursor - i) : (cursor - i + slidingWindowSize));
totalResponseSum += StatisticHelper.getTotalResponse(bucketValue);
failureResponseSum += StatisticHelper.getFailureResponse(bucketValue);
}
/*只有当半开阶段内响应数大于0才能检查是否超过异常阈值*/
if (totalResponseSum > 0) {
long failureRate = failureResponseSum * 100 / totalResponseSum;
/*超过异常阈值, 打开断路器,并设置打开持续时长*/
if (failureRate >= circuitBreaker.failureRateThreshold) {
Log.print("(HAOP) exceed failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_OPEN);
openDuration = circuitBreaker.openDuration;
}
/*小于异常阈值, 关闭断路器*/
else {
Log.print("(HAOP) beyond failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_CLOSE);
}
}
}
}
slide();
Log.print(StatisticHelper.toString(true, circuitBreaker.statistic.get()) + "\n" + StatisticHelper.toString(circuitBreaker.slidingWindow, circuitBreaker.cursor));
}
}
class CircuitBreakerScheduler {
private static int count = 0;
private ScheduledExecutorService scheduledExecutorService;
private int bucketDuration;
public CircuitBreakerScheduler(int poolSize, int bucketDuration) {
scheduledExecutorService = Executors.newScheduledThreadPool(poolSize, (runnable) -> {
Thread t = new Thread(runnable, "scheduler-" + count++);
t.setDaemon(true);
return t;
});
this.bucketDuration = bucketDuration;
}
public void registry(final CircuitBreaker circuitBreaker) {
scheduledExecutorService.scheduleAtFixedRate(new SlidingWindowTask(circuitBreaker), bucketDuration, bucketDuration, TimeUnit.SECONDS);
}
}
public class CircuitBreaker {
/*断路器状态常量*/
final static long STATUS_CLOSE = 0b00;
final static long STATUS_OPEN = 0b10;
final static long STATUS_HALF_OPEN = 0b11;
/*请求结果*/
final static int SUCCESS = 0; /*请求通过*/
final static int FAILURE_CIRCUIT_BREAKER_OPENED = 1; /*断路器打开,请求不通过*/
final static int FAILURE_CIRCUIT_BREAKER_HALF_OPENED = 2; /*断路器半开,请求被限流不通过*/
/*配置数据*/
int slidingWindowSize; /*滑动窗口大小*/
int minNumber; /*最少样本数,只有窗口样本数大于该值,才检查是否打开断路器*/
int failureRateThreshold; /*异常率阈值(%)*/
int openDuration; /*断路器打开时长(秒)*/
int halfOpenDuration; /*断路器半开时长(秒)*/
int haflOpenMaxNumber; /*断路器半开时长内最大请求数*/
/*核心数据*/
AtomicLong statistic; /*窗口汇总统计数据*/
AtomicLongArray slidingWindow; /*滑动窗口,记录每个时间单位内的统计数据*/
private long p1, p2, p3, p4, p5, p6, p7; /*64Byte cache line 填充*/
private int p0;
volatile int cursor; /*当前时间对应的窗口下标*/
private long p8, p9, p10, p11, p12, p13, p14; /*64Byte cache line 填充*/
public CircuitBreaker(int slidingWindowSize, int minNumber, int failureRateThreshold, int openDuration, int halfOpenDuration, int haflOpenMaxNumber) {
this.slidingWindowSize = slidingWindowSize;
this.minNumber = minNumber;
this.failureRateThreshold = failureRateThreshold;
this.openDuration = openDuration;
this.haflOpenMaxNumber = haflOpenMaxNumber;
/*半开时长不能大于窗口大小*/
this.halfOpenDuration = halfOpenDuration < slidingWindowSize ? halfOpenDuration : slidingWindowSize;
this.statistic = new AtomicLong();
this.slidingWindow = new AtomicLongArray(slidingWindowSize);
this.cursor = 0;
}
public int onRequest() {
boolean isOk;
/*更新窗口汇总的统计数据*/
do {
long value = statistic.get();
long status = StatisticHelper.getStatus(value);
/*关闭: 更新汇总统计数据*/
if (status == STATUS_CLOSE) {
long actureTotalRequest = StatisticHelper.getTotalRequest(value) + 1;
long actureValue = StatisticHelper.setTotalRequest(value, actureTotalRequest);
isOk = statistic.compareAndSet(value, actureValue);
}
/*打开: 拒绝请求*/
else if (status == STATUS_OPEN) {
return FAILURE_CIRCUIT_BREAKER_OPENED;
}
/*半开: 检查流量超限,未超限则更新汇总统计数据*/
else {
long actureTotalRequest = StatisticHelper.getTotalRequest(value) + 1;
/*流量超限*/
if (actureTotalRequest > haflOpenMaxNumber) return FAILURE_CIRCUIT_BREAKER_HALF_OPENED;
else {
long actureValue = StatisticHelper.setTotalRequest(value, actureTotalRequest);
isOk = statistic.compareAndSet(value, actureValue);
}
}
}
while (!isOk);
/*更新滑动窗口数据*/
do {
long value = slidingWindow.get(cursor);
long totalRequest = StatisticHelper.getTotalRequest(value);
long actureValue = StatisticHelper.setTotalRequest(value, totalRequest + 1);
isOk = slidingWindow.compareAndSet(cursor, value, actureValue);
}
while (!isOk);
return SUCCESS;
}
public void onResponse(boolean isSuccess) {
Log.print("RESPONSE: " + isSuccess);
boolean isOk;
/*更新窗口汇总的统计数据*/
do {
long value = statistic.get();
long actureTotalResponse = StatisticHelper.getTotalResponse(value);
long actureValue = StatisticHelper.setTotalResponse(value, actureTotalResponse + 1);
if (!isSuccess) {
long actureFailureResponse = StatisticHelper.getFailureResponse(value);
actureValue = StatisticHelper.setFailureResponse(actureValue, actureFailureResponse + 1);
}
isOk = statistic.compareAndSet(value, actureValue);
}
while (!isOk);
/*更新滑动窗口数据*/
do {
long value = slidingWindow.get(cursor);
long totalResponse = StatisticHelper.getTotalResponse(value);
long actureValue = StatisticHelper.setTotalResponse(value, totalResponse + 1);
if (!isSuccess) {
long failureResponse = StatisticHelper.getFailureResponse(value);
actureValue = StatisticHelper.setFailureResponse(actureValue, failureResponse + 1);
}
isOk = slidingWindow.compareAndSet(cursor, value, actureValue);
}
while (!isOk);
}
}
App.java
package foo;
public class App {
public static void main(String[] args) throws Exception {
CircuitBreakerScheduler scheduler = new CircuitBreakerScheduler(1, 1);
CircuitBreaker circuitBreaker = new CircuitBreaker(3, 1, 50, 3, 3, 1);
scheduler.registry(circuitBreaker);
Thread t1 = new Thread(() -> {
try {
for (int i = 0; i < 10; i++) {
int result = circuitBreaker.onRequest();
Log.print("REQUEST: " + result);
Thread.sleep(1000);
}
} catch (Exception ignore) {
}
});
Thread t2 = new Thread(() -> {
try {
for (int i = 0; i < 10; i++) {
circuitBreaker.onResponse(i%2==0?true:false);
Thread.sleep(1000);
}
} catch (Exception ignore) {
}
});
t1.start();
t2.start();
t1.join();
t2.join();
}
}
最后,希望该模块下个月上线顺利。٩(●̮̃•)۶