circuit-breaker, circuit表示电路,译为熔断器非常精准,而Hystrix属于自动恢复的智能熔断器,它保护的着你的系统,宿主在调用方的应用系统中,避免因为依赖系统的异常或宕机而引发一系列连锁反应。Hystrix 原理也比较简单,在一个时间窗口下,通过不断收集依赖服务(第三方)请求指标信息(sucess、failure、timeout、rejection),当达到设定熔断条件时(默认是请求失败率达到50%)则进行熔断。本文基于hystrix-core 1.5.18(近年来几乎很少更新,建议升级)。
目录
2.1 CircuitBreaker open 和circuitBreaker.forceClosed 及circuitBreaker.forceOpen是如何工作的
3.5 关于CLOSED & OPEN &HALF_OPEN
1. 基本原理
在统计中,会使用一定数量的样本,并将样本进行分组,最后进行统计分析。Hystrix 有点类似,例如:以秒为单位来统计请求的处理情况(成功请求数量、失败请求数、超时请求数、被拒绝的请求数),然后每次取最近10秒的数据来进行计算,如果失败率超过50%,就进行熔断,不再处理任何请求。Hystrix官网的一张图:
1.1 桶
假定以秒为单位来统计请求处理情况,上面每个格子代表1秒,格子中的数据就是1秒内各处理结果的请求数量,格子称为 Bucket(译为桶)
1.2 滑动窗口
若每次的决策都以10个Bucket的数据为依据,计算10个Bucket的请求处理情况,当失败率超过50%时就熔断。10个Bucket就是10秒,这个10秒就是一个 滑动窗口(Rolling window)。滑动意味着:在没有熔断时,每当收集好一个新的Bucket后,就会丢弃掉最旧的一个Bucket(深色的 [ 23 5 2 0 ] )就是被丢弃的桶)。
1.3 官方完整的流程图
策略是:不断收集数据,达到条件就熔断;熔断后拒绝所有请求一段时间(sleepWindow);然后放一个请求过去,如果请求成功,则关闭熔断器,否则继续打开熔断器。
注:流程图中有些部分是有些调整的,比如markFailure在当前源码中就移除了
1.4 断路器打开意味着什么
断路器处于 OPEN
状态时,链路处于非健康状态,命令执行时,直接调用回退逻辑,跳过正常逻辑。
2. 配置篇
Hystrix默认配置都在HystrixCommandProperties类中。
metrics.rollingStats.timeInMilliseconds | 滑动窗口的时间,默认10000(10s),也是熔断器计算的基本单位。 |
metrics.rollingStats.numBuckets | 滑动窗口的Bucket数量,默认10,通过timeInMilliseconds和numBuckets可以计算出每个Bucket的时长。 |
circuitBreaker.requestVolumeThreshold | 滑动窗口触发熔断的最小请求数。如果值是20,但滑动窗口的时间内请求数只有19,那即使19个请求全部失败,也不会熔断,必须达到这个值才行,否则样本太少,没有意义。 |
circuitBreaker.sleepWindowInMilliseconds | 为了检测后端服务是否恢复,可以放一个请求过去试探一下。sleepWindow指的发生熔断后,必须隔sleepWindow这么长的时间,才能放请求过去试探下服务是否恢复。默认是5s |
circuitBreaker.errorThresholdPercentage | 错误率阈值,表示达到熔断的条件。比如默认的50%,当一个滑动窗口内,失败率达到50%时就会触发熔断。 |
circuitBreaker.enabled | 是否启用断路器,默认true(HystrixCircuitBreakerImpl),false时创建NoOpCircuitBreaker |
circuitBreaker.forceClosed | 强制关闭断路器,默认false |
circuitBreaker.forceOpen | 强制打开断路器,默认false |
2.1 CircuitBreaker open 和circuitBreaker.forceClosed 及circuitBreaker.forceOpen是如何工作的
//源于AbstractCommand
/**
* ForcedOpen | ForcedClosed | CircuitBreaker open due to health ||| Expected Result
*
* T | T | T ||| OPEN (true)
* T | T | F ||| OPEN (true)
* T | F | T ||| OPEN (true)
* T | F | F ||| OPEN (true)
* F | T | T ||| CLOSED (false)
* F | T | F ||| CLOSED (false)
* F | F | T ||| OPEN (true)
* F | F | F ||| CLOSED (false)
*
* @return boolean
*/
public boolean isCircuitBreakerOpen() {
return properties.circuitBreakerForceOpen().get() || (!properties.circuitBreakerForceClosed().get() && circuitBreaker.isOpen());
}
3. 源码
3.1 HystrixCircuitBreaker
public interface HystrixCircuitBreaker {
/**
* Every {@link HystrixCommand} requests asks this if it is allowed to proceed or not.
* <p>
* This takes into account the half-open logic which allows some requests through when determining if it should be closed again.
*
* @return boolean whether a request should be permitted
*/
public boolean allowRequest();
/**
* Whether the circuit is currently open (tripped).
*
* @return boolean state of circuit breaker
*/
public boolean isOpen();
/**
* Invoked on successful executions from {@link HystrixCommand} as part of feedback mechanism when in a half-open state.
*/
void markSuccess();
}
3.2 实现的子类
从配置项(circuitBreaker.enabled)中得知HystrixCircuitBreaker有两个子类实现:
- NoOpCircuitBreaker :空的断路器实现,用于不开启断路器功能的情况
- HystrixCircuitBreakerImpl :完整的断路器实现
3.3 HystrixCircuitBreaker的初始化
circuitBreaker是AbstractCommand(之前讲过)的成员变量,每个command都有个circuitBreaker属性,它的实例化就是在AbstractCommand中完成的。
//源于AbstractCommand
private static HystrixCircuitBreaker initCircuitBreaker(boolean enabled, HystrixCircuitBreaker fromConstructor,
HystrixCommandGroupKey groupKey, HystrixCommandKey commandKey,
HystrixCommandProperties properties, HystrixCommandMetrics metrics) {
if (enabled) {// 如果启用了熔断器
if (fromConstructor == null) {//若commandKey没有对应的CircuitBreaker,则创建
// get the default implementation of HystrixCircuitBreaker
return HystrixCircuitBreaker.Factory.getInstance(commandKey, groupKey, properties, metrics);
} else {
return fromConstructor;
}
} else {
return new NoOpCircuitBreaker();
}
}
//源于HystrixCircuitBreaker.Factory.getInstance,以commandKey为维度,每个commandKey都会有对应的circuitBreaker
public static HystrixCircuitBreaker getInstance(HystrixCommandKey key, HystrixCommandGroupKey group, HystrixCommandProperties properties, HystrixCommandMetrics metrics) {
// 如果有则返回现有的
// this should find it for all but the first time
HystrixCircuitBreaker previouslyCached = circuitBreakersByCommand.get(key.name());
if (previouslyCached != null) {
return previouslyCached;
}
// if we get here this is the first time so we need to initialize
// Create and add to the map ... use putIfAbsent to atomically handle the possible race-condition of
// 2 threads hitting this point at the same time and let ConcurrentHashMap provide us our thread-safety
// If 2 threads hit here only one will get added and the other will get a non-null response instead.
// 如果没有则创建并cache
HystrixCircuitBreaker cbForCommand = circuitBreakersByCommand.putIfAbsent(key.name(), new HystrixCircuitBreakerImpl(key, group, properties, metrics));
if (cbForCommand == null) {
// this means the putIfAbsent step just created a new one so let's retrieve and return it
return circuitBreakersByCommand.get(key.name());
} else {
// this means a race occurred and while attempting to 'put' another one got there before
// and we instead retrieved it and will now return it
return cbForCommand;
}
}
3.4 HystrixCircuitBreakerImpl
HystrixCircuitBreakerImpl是熔断器的真正实现类,其实源码还不算很复杂。
static class HystrixCircuitBreakerImpl implements HystrixCircuitBreaker {
private final HystrixCommandProperties properties;
private final HystrixCommandMetrics metrics;
/* track whether this circuit is open/closed at any given point in time (default to false==closed) */
private AtomicBoolean circuitOpen = new AtomicBoolean(false);
/* when the circuit was marked open or was last allowed to try a 'singleTest' */
private AtomicLong circuitOpenedOrLastTestedTime = new AtomicLong();
protected HystrixCircuitBreakerImpl(HystrixCommandKey key, HystrixCommandGroupKey commandGroup, HystrixCommandProperties properties, HystrixCommandMetrics metrics) {
this.properties = properties;
this.metrics = metrics;
}
//关闭熔断器并reset metrics
public void markSuccess() {
if (circuitOpen.get()) {
if (circuitOpen.compareAndSet(true, false)) {
//win the thread race to reset metrics
//Unsubscribe from the current stream to reset the health counts stream. This only affects the health counts view,
//and all other metric consumers are unaffected by the reset
metrics.resetStream();
}
}
}
//是否允许command请求
@Override
public boolean allowRequest() {
if (properties.circuitBreakerForceOpen().get()) {
// properties have asked us to force the circuit open so we will allow NO requests
return false;
}
if (properties.circuitBreakerForceClosed().get()) {
// we still want to allow isOpen() to perform it's calculations so we simulate normal behavior
isOpen();
// properties have asked us to ignore errors so we will ignore the results of isOpen and just allow all traffic through
return true;
}
return !isOpen() || allowSingleTest();
}
//是否满足半开的条件
public boolean allowSingleTest() {
long timeCircuitOpenedOrWasLastTested = circuitOpenedOrLastTestedTime.get();
// 1) if the circuit is open
// 2) and it's been longer than 'sleepWindow' since we opened the circuit
if (circuitOpen.get() && System.currentTimeMillis() > timeCircuitOpenedOrWasLastTested + properties.circuitBreakerSleepWindowInMilliseconds().get()) {
// We push the 'circuitOpenedTime' ahead by 'sleepWindow' since we have allowed one request to try.
// If it succeeds the circuit will be closed, otherwise another singleTest will be allowed at the end of the 'sleepWindow'.
if (circuitOpenedOrLastTestedTime.compareAndSet(timeCircuitOpenedOrWasLastTested, System.currentTimeMillis())) {
// if this returns true that means we set the time so we'll return true to allow the singleTest
// if it returned false it means another thread raced us and allowed the singleTest before we did
return true;
}
}
return false;
}
//根据metrics.getHealthCounts判断是否可以打开熔断器
@Override
public boolean isOpen() {
if (circuitOpen.get()) {
// if we're open we immediately return true and don't bother attempting to 'close' ourself as that is left to allowSingleTest and a subsequent successful test to close
return true;
}
// we're closed, so let's see if errors have made us so we should trip the circuit open
HealthCounts health = metrics.getHealthCounts();
// check if we are past the statisticalWindowVolumeThreshold
if (health.getTotalRequests() < properties.circuitBreakerRequestVolumeThreshold().get()) {
// we are not past the minimum volume threshold for the statisticalWindow so we'll return false immediately and not calculate anything
return false;
}
if (health.getErrorPercentage() < properties.circuitBreakerErrorThresholdPercentage().get()) {
return false;
} else {
// our failure rate is too high, trip the circuit
if (circuitOpen.compareAndSet(false, true)) {
// if the previousValue was false then we want to set the currentTime
circuitOpenedOrLastTestedTime.set(System.currentTimeMillis());
return true;
} else {
// How could previousValue be true? If another thread was going through this code at the same time a race-condition could have
// caused another thread to set it to true already even though we were in the process of doing the same
// In this case, we know the circuit is open, so let the other thread set the currentTime and report back that the circuit is open
return true;
}
}
}
}
是的,HealthCounts很关键,它是滚动窗口的请求统计信息,
public static class HealthCounts {
//总请求数
private final long totalCount;
//错误请求数(failure + success + timeout + threadPoolRejected + semaphoreRejected)
private final long errorCount;
//错误占比
private final int errorPercentage;
//统计汇总
public HealthCounts plus(long[] eventTypeCounts) {
long updatedTotalCount = totalCount;
long updatedErrorCount = errorCount;
long successCount = eventTypeCounts[HystrixEventType.SUCCESS.ordinal()];
long failureCount = eventTypeCounts[HystrixEventType.FAILURE.ordinal()];
long timeoutCount = eventTypeCounts[HystrixEventType.TIMEOUT.ordinal()];
long threadPoolRejectedCount = eventTypeCounts[HystrixEventType.THREAD_POOL_REJECTED.ordinal()];
long semaphoreRejectedCount = eventTypeCounts[HystrixEventType.SEMAPHORE_REJECTED.ordinal()];
updatedTotalCount += (successCount + failureCount + timeoutCount + threadPoolRejectedCount + semaphoreRejectedCount);
updatedErrorCount += (failureCount + timeoutCount + threadPoolRejectedCount + semaphoreRejectedCount);
return new HealthCounts(updatedTotalCount, updatedErrorCount);
}
}
3.5 关于CLOSED & OPEN &HALF_OPEN
在最新Hystrix 1.5.18版本已经移除了Status,在HystrixCircuitBreakerImpl已经可以看出,采用circuitOpen(bool型) 代替status(CLOSED 、OPEN 、HALF_OPEN),这样的好处是对调用者而言熔断器API更简单。
那你会问:HALF_OPEN是如何实现的?
实现逻辑也比较简单,通过allowRequest方法(每个command执行execute前必须调用,之前讲过的)中调用allowSingleTest方法,而allowSingleTest实现了半开。当断路器打开时,记录当前时间到circuitOpenedOrLastTestedTime,这时有新请求时,会判断当前时间是否大于circuitOpenedOrLastTestedTime加sleepWindowInMilliseconds,如果是返回true(代表请求通过),更新circuitOpenedOrLastTestedTime为最新的时间。当然如果用户任务执行成功的话,通过markSuccess关闭熔断器!
3.6 与HystrixEventStream的关联
Hystrix Command执行过程中,各种情况都以事件形式发出,再封装成特定的数据结构,最后汇入到事件流中(HystrixEventStream)。事件流提供了 observe() 方法,摇身一变,事件流把自己变成了一个数据源(各小溪汇入成河,消费者从河里取水),其他消费者可以从这里获取数据,而 circuit-breaker 就是消费者之一。
HystrixEventStream的接盘侠
在上一节“Metrics 收集”讲过HystrixEventStream有承上启下的作用,接盘侠就是BucketedCounterStream(这个下面会讲),那纠结是怎么回事呢?那还要从HystrixCommandMetrics.healthCountsStream讲起,它通过HystrixCommandCompletionStream.getInstance(commandKey)将事件流转接到BucketedCounterStream中。
public class HystrixCommandMetrics extends HystrixMetrics {
HystrixCommandMetrics(final HystrixCommandKey key, HystrixCommandGroupKey commandGroup, HystrixThreadPoolKey threadPoolKey, HystrixCommandProperties properties, HystrixEventNotifier eventNotifier) {
healthCountsStream = HealthCountsStream.getInstance(key, properties);//实例化
}
}
public class HealthCountsStream {
//实例化
public static HealthCountsStream getInstance(HystrixCommandKey commandKey, HystrixCommandProperties properties) {
final int healthCountBucketSizeInMs = properties.metricsHealthSnapshotIntervalInMilliseconds().get();
if (healthCountBucketSizeInMs == 0) {
throw new RuntimeException("You have set the bucket size to 0ms. Please set a positive number, so that the metric stream can be properly consumed");
}
final int numHealthCountBuckets = properties.metricsRollingStatisticalWindowInMilliseconds().get() / healthCountBucketSizeInMs;
return getInstance(commandKey, numHealthCountBuckets, healthCountBucketSizeInMs);
}
//实例化(缓存效果)
public static HealthCountsStream getInstance(HystrixCommandKey commandKey, int numBuckets, int bucketSizeInMs) {
HealthCountsStream initialStream = streams.get(commandKey.name());
if (initialStream != null) {
return initialStream;
} else {
final HealthCountsStream healthStream;
synchronized (HealthCountsStream.class) {
HealthCountsStream existingStream = streams.get(commandKey.name());
if (existingStream == null) {
//初始化
HealthCountsStream newStream = new HealthCountsStream(commandKey, numBuckets, bucketSizeInMs,
HystrixCommandMetrics.appendEventToBucket);
streams.putIfAbsent(commandKey.name(), newStream);
healthStream = newStream;
} else {
healthStream = existingStream;
}
}
healthStream.startCachingStreamValuesIfUnstarted();
return healthStream;
}
}
}
public class HealthCountsStream extends BucketedRollingCounterStream{
private HealthCountsStream(final HystrixCommandKey commandKey, final int numBuckets, final int bucketSizeInMs,
Func2<long[], HystrixCommandCompletion, long[]> reduceCommandCompletion) {
//看super第一个参数:HystrixCommandCompletionStream.getInstance之前讲过它是命令集,创建过以后就会被缓存,这个点很重要
super(HystrixCommandCompletionStream.getInstance(commandKey), numBuckets, bucketSizeInMs, reduceCommandCompletion, healthCheckAccumulator);
}
}
//extends 关系
public abstract class BucketedRollingCounterStream extends BucketedCounterStream{
}
//extends 关系
public abstract class BucketedCounterStream{
protected BucketedCounterStream(final HystrixEventStream<Event> inputEventStream, final int numBuckets, final int bucketSizeInMs,
final Func2<Bucket, Event, Bucket> appendRawEventToBucket) {
this.bucketedStream = Observable.defer(new Func0<Observable<Bucket>>() {
@Override
public Observable<Bucket> call() {
return inputEventStream //这个是个关键点(HystrixCommandCompletionStream.observe)
.observe()
.window(bucketSizeInMs, TimeUnit.MILLISECONDS) //bucket it by the counter window so we can emit to the next operator in time chunks, not on every OnNext
.flatMap(reduceBucketToSummary) //for a given bucket, turn it into a long array containing counts of event types
.startWith(emptyEventCountsToStart); //start it with empty arrays to make consumer logic as generic as possible (windows are always full)
}
});
}
}
3.7 HealthCountsStream
通过BucketedCounterStream,将数据汇总成了以Bucket为单位的stream。然后,BucketedRollingCounterStream基于Bucket的stream,继续实现滑动窗口逻辑。HealthCountsStream提供了对桶的汇总。
总结,metrics.getHealthCountsStream()拿到的是一个已经汇总成以 “rollingWindow” 为单位的统计数据,observe() 实际拿到的是BucketedRollingCounterStream的sourceStream。
//对桶的定义
public abstract class BucketedCounterStream<Event extends HystrixEvent, Bucket, Output> {
protected BucketedCounterStream(final HystrixEventStream<Event> inputEventStream, final int numBuckets, final int bucketSizeInMs,
final Func2<Bucket, Event, Bucket> appendRawEventToBucket) {
this.numBuckets = numBuckets;
// 将Hystrix事件汇总成Bucket的处理者, 是一个Func1
this.reduceBucketToSummary = new Func1<Observable<Event>, Observable<Bucket>>() {
// 传入Event类型的数据源,汇总成Bucket类型的数据
@Override
public Observable<Bucket> call(Observable<Event> eventBucket) {
return eventBucket.reduce(getEmptyBucketSummary(), appendRawEventToBucket);
}
};
final List<Bucket> emptyEventCountsToStart = new ArrayList<Bucket>();
for (int i = 0; i < numBuckets; i++) {
emptyEventCountsToStart.add(getEmptyBucketSummary());
}
this.bucketedStream = Observable.defer(new Func0<Observable<Bucket>>() {
//inputEventStream 就是一直提到的HystrixEventStream, 通过observe()来获取数据源
@Override
public Observable<Bucket> call() {
return inputEventStream
.observe()
//利用窗口函数,收集一个Bucket时间内的数据
.window(bucketSizeInMs, TimeUnit.MILLISECONDS) //bucket it by the counter window so we can emit to the next operator in time chunks, not on every OnNext
//将数据汇总成一个Bucket
.flatMap(reduceBucketToSummary) //for a given bucket, turn it into a long array containing counts of event types
.startWith(emptyEventCountsToStart); //start it with empty arrays to make consumer logic as generic as possible (windows are always full)
}
});
}
}
//BucketedRollingCounterStream继承上一个,增强中对滑动窗口的实现
public abstract class BucketedRollingCounterStream<Event extends HystrixEvent, Bucket, Output> extends BucketedCounterStream<Event, Bucket, Output> {
private Observable<Output> sourceStream;
private final AtomicBoolean isSourceCurrentlySubscribed = new AtomicBoolean(false);
protected BucketedRollingCounterStream(HystrixEventStream<Event> stream, final int numBuckets, int bucketSizeInMs,
final Func2<Bucket, Event, Bucket> appendRawEventToBucket,
final Func2<Output, Bucket, Output> reduceBucket) {//reduceBucket就是healthCounts.plus
super(stream, numBuckets, bucketSizeInMs, appendRawEventToBucket);
//Bucket汇总处理者
Func1<Observable<Bucket>, Observable<Output>> reduceWindowToSummary = new Func1<Observable<Bucket>, Observable<Output>>() {
@Override
public Observable<Output> call(Observable<Bucket> window) {
return window.scan(getEmptyOutputValue(), reduceBucket).skip(numBuckets);
}
};
// 基于父类BucketedCounterStream已经汇总的bucketedStream
this.sourceStream = bucketedStream //stream broken up into buckets
//将N个Bucket进行汇总
.window(numBuckets, 1) //emit overlapping windows of buckets
//汇总成一个窗口
.flatMap(reduceWindowToSummary) //convert a window of bucket-summaries into a single summary
.doOnSubscribe(new Action0() {
@Override
public void call() {
isSourceCurrentlySubscribed.set(true);
}
})
.doOnUnsubscribe(new Action0() {
@Override
public void call() {
isSourceCurrentlySubscribed.set(false);
}
})
.share() //multiple subscribers should get same data
.onBackpressureDrop(); //if there are slow consumers, data should not buffer
}
}
//HealthCountsStream 继承上一个,对外提供健康统计
public class HealthCountsStream extends BucketedRollingCounterStream<HystrixCommandCompletion, long[], HystrixCommandMetrics.HealthCounts> {
private static final Func2<HystrixCommandMetrics.HealthCounts, long[], HystrixCommandMetrics.HealthCounts> healthCheckAccumulator = new Func2<HystrixCommandMetrics.HealthCounts, long[], HystrixCommandMetrics.HealthCounts>() {
@Override
public HystrixCommandMetrics.HealthCounts call(HystrixCommandMetrics.HealthCounts healthCounts, long[] bucketEventCounts) {
return healthCounts.plus(bucketEventCounts);//统计
}
private HealthCountsStream(final HystrixCommandKey commandKey, final int numBuckets, final int bucketSizeInMs,
Func2<long[], HystrixCommandCompletion, long[]> reduceCommandCompletion) {
//很关键的地方
super(HystrixCommandCompletionStream.getInstance(commandKey), numBuckets, bucketSizeInMs, reduceCommandCompletion, healthCheckAccumulator);
}
}
在HystrixEventStream的接盘侠已经讲过,它已经承接(接收)了事件流,这时你会问题HealthCountsStream 是什么时候被订阅消费的呢?
//这个不陌生吧,HystrixCommandMetrics的
public HealthCounts getHealthCounts() {
return healthCountsStream.getLatest();
}
public abstract class BucketedCounterStream<Event extends HystrixEvent, Bucket, Output> {
/**
* Synchronous call to retrieve the last calculated bucket without waiting for any emissions
* @return last calculated bucket
*/
public Output getLatest() {
startCachingStreamValuesIfUnstarted();//这个很关键
if (counterSubject.hasValue()) {
return counterSubject.getValue();
} else {
return getEmptyOutputValue();
}
}
public void startCachingStreamValuesIfUnstarted() {
if (subscription.get() == null) {
//the stream is not yet started
//就是这里:observe()就是BucketedRollingCounterStream的sourceStream
Subscription candidateSubscription = observe().subscribe(counterSubject);
if (subscription.compareAndSet(null, candidateSubscription)) {
//won the race to set the subscription
} else {
//lost the race to set the subscription, so we need to cancel this one
candidateSubscription.unsubscribe();
}
}
}
}
总结,Hystrix的熔断器也使用了非常多的rxjava的api,比如window(事件窗口),flatMap(拉平并转化类型)等。