2.1 Runtime performance optimizations
2.1.1 Biased locking ——偏向锁
Biased Locking is a class of optimizations that improves uncontended synchronization performance by eliminating atomic operations associated with the Java language’s synchronization primitives. These optimizations rely on the property that not only are most monitors uncontended, they are locked by at most one thread during their lifetime.
偏向锁是一类提高不存在竞争的锁的性能的优化方式。这些优化依赖于monitors大多数在不竞争状态,而且总是被一个线程锁住。
An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations can be performed by that thread without using atomic operations resulting in much better performance, particularly on multiprocessor machines.
一个对象偏向于第一个获取到该对象的monitor的线程,获取monitor的方式可以是monitorenter指令或者synchronized 方法调用;后续与monitor相关的操作可以不使用原子操作而继续由该线程执行,从而导致更好的性能,特别是在多处理器机器上。
Locking attempts by threads other than the one toward which the object is "biased" will cause a relatively expensive operation whereby the bias is revoked. The benefit of the elimination of atomic operations must exceed the penalty of revocation for this optimization to be profitable.
如果有不是该对象偏向的线程的线程尝试获取锁,将会撤销偏向锁,而这需要一个相对昂贵的操作。消除原子操作的好处必须超过撤销偏向锁的坏处,才能让这个优化有利可图。
Applications with substantial amounts of uncontended synchronization may attain significant speedups while others with certain patterns of locking may see slowdowns.
具有大量的锁不存在竞争状态的应用程序可能会获得显著的加速,而其他具有某些锁定模式的应用程序可能会看到速度减慢。
Biased Locking is enabled by default in Java SE 6 and later. To disable Biased Locking, please add to the command line -XX:-UseBiasedLocking .
For more on Biased Locking, please refer to the ACM OOPSLA 2006 paper by Kenneth Russell and David Detlefs: " Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing".
2.1.2 Lock coarsening
There are some patterns of locking where a lock is released and then reacquired within a piece of code where no observable operations occur in between. The lock coarsening optimization technique implemented in hotspot eliminates the unlock and relock operations in those situations (when a lock is released and then reacquired with no meaningful work done in between those operations). It basically reduces the amount of synchronization work by enlarging an existing synchronized region. Doing this around a loop could cause a lock to be held for long periods of times, so the technique is only used on non-looping control flow.
This feature is on by default. To disable it, please add the following option to the command line: -XX:-EliminateLocks
2.1.3 Adaptive spinning
Adaptive spinning is an optimization technique where a two-phase spin-then-block strategy is used by threads attempting a contended synchronized enter operation. This technique enables threads to avoid undesirable effects that impact performance such as context switching and repopulation of Translation Lookaside Buffers (TLBs). It is “adaptive" because the duration of the spin is determined by policy decisions based on factors such as the rate of success and/or failure of recent spin attempts on the same monitor and the state of the current lock owner.
======================================
转载自:Synchronization optimizations
Whenever mutable variables are shared between threads, you must use synchronization to ensure that updates made by one thread are visible to other threads on a timely basis. The primary means of synchronization is the use of synchronized
blocks, which provide both mutual exclusion and visibility guarantees. (Other forms of synchronization include volatile
variables, the Lock
objects injava.util.concurrent.locks
, and atomic variables.) When two threads want to access a shared mutable variable, not only must both threads use synchronization, but if they are using synchronized
blocks, those synchronized
blocks must use the same lock object.
In practice, locking falls into two categories: mostly contended and mostly uncontended. Locks that are mostly contended are the "hot" locks in an application, such as the locks that guard the shared work queue for a thread pool. The data protected by these locks is needed constantly by many threads, and you can expect that when you go to acquire this lock, you may have to wait until someone else is done with it. Mostly uncontended locks are those that guard data that is not accessed so frequently, so that most of the time when a thread goes to acquire the lock, no other thread is holding that lock. Most locks are not frequently contended, so improving the performance of uncontended locking can improve overall application performance substantially.
The JVM has separate code paths for contended ("slow path") and uncontended ("fast path") lock acquisition. A lot of effort has already gone into optimizing the fast path; Mustang further improves both the fast path and the slow path and adds a number of optimizations that can eliminate some locking entirely.
Lock elision
The Java Memory Model says that one thread exiting a synchronized block happens-before another thread enters a synchronized block protected by that same lock; this means that whatever memory operations are visible to thread A when it exits a synchronized
block protected by lock M are visible to thread B when it enters a synchronized
block protected by M, as shown in Figure 1. For synchronized
blocks that use different locks, we cannot assume anything about their ordering -- it is as if there was no synchronization at all.
Figure 1. Synchronization and visibility in the Java Memory Model
![Synchronization and visibility in the Java Memory Model](https://i-blog.csdnimg.cn/blog_migrate/6a4acd1a85b49fd10a86af772f45c268.png)
It stands to reason, then, that if a thread enters a synchronized
block protected by a lock that no other thread will ever synchronize on, then that synchronization has no effect and can therefore be removed by the optimizer. (The Java Language Specification explicitly allows this optimization.) This scenario may sound unlikely, but there are cases when this is clear to the compiler. Listing 1 shows a simplified example of a thread-local lock object:
Listing 1. Using a thread-local object as a lock
1
2
3
|
synchronized (new Object()) {
doSomething();
}
|
Because the reference to the lock object disappears before any other thread could possibly use it, the compiler can tell that the above synchronization can be removed because it is impossible for two threads to synchronize using the same lock. While no one would ever directly use the idiom in Listing 1, this code is very similar to the case where the lock associated with a synchronized
block can be proven to be a thread-local variable. "Thread-local" doesn't necessarily mean that it is implemented with the ThreadLocal
class; it could be any variable that the compiler can prove is never accessed by another thread. Objects referenced by local variables and that never escape from their defining scope meet this test -- if an object is confined to the stack of some thread, no other thread can ever see a reference to that object. (The only way an object can be shared across threads is if a reference to it is published to the heap.)
As luck would have it, escape analysis, which we discussed last month, provides the compiler with exactly the information it needs to optimize away synchronized
blocks that use thread-local lock objects. If the compiler can prove (using escape analysis) that an object is never published to the heap, it must be a thread-local object and therefore any synchronized
blocks that use that object as a lock will have no effect under the Java Memory Model (JMM) and can be eliminated. This optimization is called lock elision and is another of the JVM optimizations slated for Mustang.
The use of locking with thread-local objects happens more often than you might think. There are many classes, such as StringBuffer
and java.util.Random
, which are thread-safe because they might be used from multiple threads, but they are often used in a thread-local manner.
Consider the code in Listing 2, which uses a Vector
in the course of building up a string value. ThegetStoogeNames()
method creates a Vector
, adds several strings to it, and then calls toString()
to convert it into a string. Each of the calls to one of the Vector
methods -- three calls to add()
and one to toString()
-- requires acquiring and releasing the lock on the Vector
. While all of the lock acquisitions will be uncontended and therefore fast, the compiler can actually eliminate the synchronization entirely using lock elision.
Listing 2. Candidate for lock elision
1
2
3
4
5
6
7
|
public String getStoogeNames() {
Vector v = new Vector();
v.add("Moe");
v.add("Larry");
v.add("Curly");
return v.toString();
}
|
Because no reference to the Vector
ever escapes the getStoogeNames()
method, it is necessarily thread-local and therefore any synchronized
block that uses it as a lock will have no effect under the JMM. The compiler can inline the calls to the add()
and toString()
methods, and then it will recognize that it is acquiring and releasing a lock on a thread-local object and can optimize away all four lock-unlock operations.
I've stated in the past that trying to avoid synchronization is generally a bad idea. The arrival of optimizations such as lock elision adds yet one more reason to avoid trying to eliminate synchronization -- the compiler can do it automatically where it is safe, and leave it in place otherwise.
Adaptive locking
In addition to escape analysis and lock elision, Mustang has some other optimizations for locking performance as well. When two threads contend for a lock, one of them will get the lock and the other will have to block until the lock is available. There are two obvious techniques for implementing blocking; having the operating system suspend the thread until it is awakened later or using spin locks. A spin lock basically amounts to the following code:
1
2
|
while (lockStillInUse)
;
|
While spin locks are CPU-intensive and appear inefficient, they can be more efficient than suspending the thread and subsequently waking it up if the lock in question is held for a very short time. There is significant overhead to suspending and rescheduling a thread, which involves work by the JVM, operating system, and hardware. This presents a problem for JVM implementers: if locks are only held for a short time, then spinning is more efficient; if locks are held for a long time, then suspension is more efficient. Without information on what the distribution of lock times will be for a given application, most JVMs are conservative and simply suspend a thread when it fails to acquire a lock.
However, it is possible to do better. For each lock, the JVM can adaptively choose between spinning and suspension based on the behavior of past acquires. It can try spinning and, if that succeeds after a short period of time, keep using that approach. If, on the other hand, a certain amount of spinning fails to acquire the lock, it can decide that this is a lock that is generally held for "a long time" and recompile the method to use only suspension. This decision can be made on a per-lock or per-lock-site basis.
The result of this adaptive approach is better overall performance. After the JVM has had some time to acquire some profiling information on lock usage patterns, it can use spinning for the locks that are generally held for a short time and suspension for the locks that are generally held for a long time. An optimization like this would not be possible in a statically compiled environment because the information about lock usage patterns is not available at static compilation time.
Lock coarsening
Another optimization that can be used to reduce the cost of locking is lock coarsening. Lock coarsening is the process of merging adjacent synchronized blocks that use the same lock object. If the compiler cannot eliminate the locking using lock elision, it may be able to reduce the overhead by using lock coarsening.
The code in Listing 3 is not necessarily a candidate for lock elision (although after inliningaddStoogeNames()
, the JVM might still be able to elide the locking), but it could still benefit from lock coarsening. The three calls to add()
alternate between acquiring the lock on the Vector
, doing something, and releasing the lock. The compiler can observe that there is a sequence of adjacent blocks that operate on the same lock, and merge them into a single block.
Listing 3. Candidate for lock coarsening
1
2
3
4
5
|
public void addStooges(Vector v) {
v.add("Moe");
v.add("Larry");
v.add("Curly");
}
|
Not only does this reduce the locking overhead, but by merging the blocks, it turns the contents of the three method calls into one big synchronized
block, giving the optimizer a much larger basic block to work with. So lock coarsening can also enable other optimizations that have nothing to do with locking.
Even if there are other statements between the synchronized
blocks or method calls, the compiler can still perform lock coarsening. The compiler is allowed to move statements into a synchronized
block -- just not out of it. So if there were intervening statements between the calls to add()
, the compiler could still turn the whole of addStooges()
into one big synchronized
block that uses theVector
as the lock object.
Lock coarsening may entail a trade-off between performance and responsiveness. By combining the three lock-unlock pairs into a single lock-unlock pair, performance is improved by reducing instruction count and reducing the amount of synchronization traffic on the memory bus. The cost is that the duration for which the lock is held may be lengthened, increasing the amount of time other threads might be locked out and also increasing the probability of lock contention. However, the lock is held for a relatively short period of time in either case, and the compiler can apply heuristics based on the length of the code protected by synchronization to make a reasonable trade-off here. (And depending on what other optimizations are enabled by the larger basic block, the duration for which the lock is held may not even be lengthened under coarsening.)
Conclusion
Uncontended locking performance has improved with nearly every JVM version. Mustang continues this trend by improving both the raw performance of uncontended and contended locking and introducing optimizations that can eliminate many lock operations.