原文链接:http://www.ibm.com/developerworks/java/library/j-threads2/index.html
When we say a program is "too slow," we are generally referring to one of two performance attributes -- latency or scalability.Latency describes how long it takes for a given task to complete, whereas scalability describes how a program's performance varies under increasing load or given increased computing resources. A high degree of contention is bad for both latency and scalability.
Why contention is such a problem
Contended synchronizations are slow because they involve multiple thread switches and system calls. When multiple threads contend for the same monitor, the JVM has to maintain a queue of threads waiting for that monitor (and this queue must be synchronized across processors), which means more time spent in the JVM or OS code and less time spent in your program code. Moreover, contention impairs scalability because it forces the scheduler to serialize operations, even if a free processor is available. When one thread is executing a synchronized block, any thread waiting to enter that block is stalled. If no other threads are available for execution, then processors may sit idle.
If we want to write scalable multithreaded programs, we must reduce contention for critical resources. There are a number of techniques for doing so, but before you can apply any of them, you need to take a good look at your code and figure out under what conditions you will be synchronizing on common monitors. Determining what locks are bottlenecks can be quite difficult; sometimes locks are hidden within class libraries or implicitly specified through synchronized methods, and therefore are less obvious when reviewing the code. Moreover, the current state of tools for detecting contention is quite poor.
One obvious technique for reducing the likelihood of contention is to make synchronized blocks as short as possible. The shorter the time a thread holds a given lock, the lower the probability that another thread will request it while the first thread is holding it. So while you should use synchronization to access or update shared variables, it is usually better to do any thread-safe pre-processing or post-processing outside of the synchronized block.
Listing 1 demonstrates this technique. Our application maintains a HashMap
for representing attributes of various entities; one such attribute is the list of access rights that a given user has. The access rights are stored as a comma-separated list of rights. The method userHasAdminAccess()
looks up a user's access rights in the global attributes table, and looks to see if the user has the access called "ADMIN".
Listing 1. Spending more time in a synchronized block than necessary
public boolean userHasAdminAccess(String userName) { synchronized (attributesMap) { String rights = attributesMap.get("users." + userName + ".accessRights"); if (rights == null) return false; else return (rights.indexOf("ADMIN") >= 0); } } |
This version of userHasAdminAccess
is thread-safe, but holds the lock for much longer than necessary. To create the concatenated string "users.brian.accessRights"
, the compiler will create a temporary StringBuffer
object, callStringBuffer.append
three times, and then call StringBuffer.toString
, which means at least two object creations and several method calls. It will then call HashMap.get
to retrieve the string, and then call String.indexOf
to extract the desired rights identifier. As a percentage of the total work done in this method, the pre- and post-processing are significant; because they are thread-safe, it makes sense to move them out of the synchronized block, as shown in Listing 2.
Listing 2. Reducing the time spent in a synchronized block
public boolean userHasAdminAccess(String userName) { String key = "users." + userName + ".accessRights"; String rights; synchronized (attributesMap) { rights = attributesMap.get(key); } return ((rights != null) && (rights.indexOf("ADMIN") >= 0)); } |
On the other hand, it's possible to take this technique too far. If you have two operations that require synchronization separated by a small block of thread-safe code, you are generally better just using a single synchronized block.
Technique 2: Reducing lock granularity
Another valuable technique for reducing contention is to spread your synchronizations over more locks. For example, suppose that you have a class that stores user information and service information in two separate hash tables, as shown in Listing 3.
Listing 3. An opportunity for reducing lock granularity
public class AttributesStore { private HashMap usersMap = new HashMap(); private HashMap servicesMap = new HashMap(); public synchronized void setUserInfo(String user, UserInfo userInfo) { usersMap.put(user, userInfo); } public synchronized UserInfo getUserInfo(String user) { return usersMap.get(user); } public synchronized void setServiceInfo(String service, ServiceInfo serviceInfo) { servicesMap.put(service, serviceInfo); } public synchronized ServiceInfo getServiceInfo(String service) { return servicesMap.get(service); } } |
Here, the accessor methods for user and service data are synchronized, which means that they are synchronizing on theAttributesStore
object. While this is perfectly thread-safe, it increases the likelihood of contention for no real benefit. If a thread is executing setUserInfo
, it means that not only will other threads be locked out of setUserInfo
and getUserInfo
, as is desired, but they will also be locked out of getServiceInfo
and setServiceInfo
.
This problem can be avoided by having the accessor simply synchronize on the actual objects being shared (the userMap
andservicesMap
objects), as shown in Listing 4.
Listing 4. Reducing lock granularity
public class AttributesStore { private HashMap usersMap = new HashMap(); private HashMap servicesMap = new HashMap(); public void setUserInfo(String user, UserInfo userInfo) { synchronized(usersMap) { usersMap.put(user, userInfo); } } public UserInfo getUserInfo(String user) { synchronized(usersMap) { return usersMap.get(user); } } public void setServiceInfo(String service, ServiceInfo serviceInfo) { synchronized(servicesMap) { servicesMap.put(service, serviceInfo); } } public ServiceInfo getServiceInfo(String service) { synchronized(servicesMap) { return servicesMap.get(service); } } } |
Now threads accessing the services map will not contend with threads trying to access the users map. (In this case, the same effect could also be obtained by creating the maps using the synchronized wrapper mechanism provided by the Collections framework, Collections.synchronizedMap
.) Assuming that requests against the two maps are evenly distributed, in this case this technique would cut the number of potential contentions in half.
Applying Technique 2 to HashMap
One of the most common contention bottlenecks in server-side Java applications is the HashMap
. Applications use HashMap
to cache all sorts of critical shared data (user profiles, session information, file contents), and the HashMap.get
method may correspond to many bytecode instructions. For example, if you are writing a Web server, and all your cached pages are stored in a HashMap
, every request will want to acquire and hold the lock on that map, and it will become a bottleneck.
We can extend the lock granularity technique to handle this situation, although we must be careful as there are some potential Java Memory Model (JMM) hazards associated with this approach. The LockPoolMap
in Listing 5 exposes thread-safe get()
andput()
methods, but spreads the synchronization over a pool of locks, reducing contention substantially.
LockPoolMap
is thread-safe and functions like a simplified HashMap
, but has more attractive contention properties. Instead of synchronizing on the entire map on each get()
or put()
operation, the synchronization is done at the bucket level. For each bucket, there's a lock, and that lock is acquired when traversing a bucket either for read or write. The locks are created when the map is created (there would be JMM problems if they were not.)
If you create a LockPoolMap
with many buckets, many threads will be able to use the map concurrently with a much lower likelihood of contention. However, the reduced contention does not come for free. By not synchronizing on a global lock, it becomes much more difficult to perform operations that act on the map as a whole, such as the size()
method. An implementation of size()
would have to sequentially acquire the lock for each bucket, count the number of nodes in that bucket, and release the lock and move on to the next bucket. But once the previous lock is released, other threads are now free to modify the previous bucket. By the time size()
finishes calculating the number of elements, it could well be wrong. However, theLockPoolMap
technique works quite well in some situations, such as shared caches.
Listing 5. Reducing locking granularity on a HashMap
import java.util.*; /** * LockPoolMap implements a subset of the Map interface (get, put, clear) * and performs synchronization at the bucket level, not at the map * level. This reduces contention, at the cost of losing some Map * functionality, and is well suited to simple caches. The number of * buckets is fixed and does not increase. */ public class LockPoolMap { private Node[] buckets; private Object[] locks; private static final class Node { public final Object key; public Object value; public Node next; public Node(Object key) { this.key = key; } } public LockPoolMap(int size) { buckets = new Node[size]; locks = new Object[size]; for (int i = 0; i < size; i++) locks[i] = new Object(); } private final int hash(Object key) { int hash = key.hashCode() % buckets.length; if (hash < 0) hash *= -1; return hash; } public void put(Object key, Object value) { int hash = hash(key); synchronized(locks[hash]) { Node m; for (m=buckets[hash]; m != null; m=m.next) { if (m.key.equals(key)) { m.value = value; return; } } // We must not have found it, so put it at the beginning of the chain m = new Node(key); m.value = value; m.next = buckets[hash]; buckets[hash] = m; } } public Object get(Object key) { int hash = hash(key); synchronized(locks[hash]) { for (Node m=buckets[hash]; m != null; m=m.next) if (m.key.equals(key)) return m.value; } return null; } } |
Table 1 compares the performance of three shared map implementations; a synchronized HashMap
, an unsynchronized HashMap
(not thread-safe), and a LockPoolMap
. The unsynchronized version is present only to show the overhead of contention. A test that does random put()
and get()
operations on the map was run, with a variable number of threads, on a dual-processor system Linux system using the Sun 1.3 JDK. The table shows the run time for each combination. This test is somewhat of an extreme case; the test programs do nothing but access the map, and so there will be many more contentions than there would be in a realistic program, but it is designed to illustrate the performance penalty of contention.
Table 1. Scalability comparison between
HashMap
and LockPoolMap
Threads | Unsynchronized HashMap (unsafe) | Synchronized HashMap | LockPoolMap |
---|---|---|---|
1 | 1.1 | 1.4 | 1.6 |
2 | 1.1 | 57.6 | 3.7 |
4 | 2.1 | 123.5 | 7.7 |
8 | 3.7 | 272.3 | 16.7 |
16 | 6.8 | 577.0 | 37.9 |
32 | 13.5 | 1233.3 | 80.5 |
While all the implementations exhibit similar scaling characteristics for large numbers of threads, the HashMap
implementation exhibits a huge performance penalty when going from one thread to two, because there will be a contention on every singleput()
and get()
operation. With more than one thread, the LockPoolMap
technique is approximately 15 times faster than theHashMap
technique. This difference reflects the time lost to scheduling overhead and to idle time spent waiting to acquire locks. The advantage of LockPoolMap
would be even larger on a system with more processors.
Another technique that may improve performance is called "lock collapsing" (see Listing 6). Recall that the methods of the Vector
class are nearly all synchronized. Imagine that you have a Vector
of String
values, and you are searching for the longest String
. Suppose further that you know that elements will be added only at the end, and that they will not be removed, making it (mostly) safe to access the data as shown in the getLongest()
method, which simply loops through the elements of the Vector
, callingelementAt()
to retrieve each one.
The getLongest2()
method is very similar, except that it obtains the lock on the Vector
before starting the loop. The result of this is that when elementAt()
attempts to acquire the lock, the JVM sees that the current thread already has it, and will not contend. It lengthens the synchronized block, which appears to be in opposition to the "get in, get out" principle, but because it is avoiding so many potential synchronizations, it can be considerably faster, as less time will be lost to the scheduling overhead.
On a dual-processor Linux system running the Sun 1.3 JDK, a test program with two threads that simply looped callinggetLongest2()
was more than 10 times faster than one that called getLongest()
. While both programs had the same degree of serialization, much less time was lost to scheduling overhead. Again, this is an extreme example, but it shows that the scheduling overhead of contention is not trivial. Even when run with a single thread, the collapsed version was some 30 percent faster: it is much faster to acquire a lock that you already hold than one that nobody holds.
Listing 6. Lock collapsing.
Vector v; ... public String getLongest() { int maxLen = 0; String longest = null; for (int i=0; i<v.size(); i++) { String s = (String) v.elementAt(i); if (s.length() > maxLen) { maxLen = s.length(); longest = s; } } return longest; } public String getLongest2() { int maxLen = 0; String longest = null; synchronized (v) { for (int i=0; i<v.size(); i++) { String s = (String) v.elementAt(i); if (s.length() > maxLen) { maxLen = s.length(); longest = s; } } return longest; } } |
Contended synchronization can have a serious impact on the scalability of your programs. Even worse, unless you perform realistic load testing, contention-related performance problems do not always present themselves during the development and testing process. The techniques presented in this article are effective for reducing the cost of contention in your programs, and increasing the load they can bear before exhibiting nonlinear scaling behavior. But before you can apply these techniques, you first have to analyze your program to determine where contention is likely to occur.
In the last installment in this series, we'll examine ThreadLocal
, an oft-neglected facility of the Thread API. By using ThreadLocal
, we can reduce contention by giving each thread its own copy of certain critical objects. Stay tuned!
- Java Performance Tuning by Jack Shirazi (O'Reilly & Associates, 2000) provides guidance on eliminating performance issues in the Java platform.
- Java Platform Performance: Strategies and Tactics by Steve Wilson and Jeff Kesselman (Addison-Wesley, 2000) offers techniques for the experienced Java programmer to build speedy and efficient Java code.
- Java Performance and Scalability, Volume 1: Server-Side Programming Techniques by Dov Bulka (Addison-Wesley, 2000) provides a wealth of tips and tricks designed to help you increase the performance of your apps.
- Brian Goetz' recent article "Double-checked locking: Clever, but broken" (JavaWorld, February 2001) explores the JMM in detail and the surprising consequences of failing to synchronize in certain situations.
- Doug Lea's Concurrent Programming in Java, Second Edition (Addison-Wesley, 1999) is a masterful book on the subtle issues surrounding multithreaded programming in Java.
- In his article "Writing multithreaded Java applications" (developerWorks, February 2001), Alex Roetter introduces the Java Thread API, outlines issues involved in multithreading, and offers solutions to common problems.
- The performance modeling and analysis team at IBM Thomas J. Watson Research Center is researching several projects in the areas of performance and performance management.
- Find more Java technology resources on the developerWorksJava technology zone.