Analysis of Haswell’s Transactional Memory

最新推荐文章于 2023-04-23 17:00:41 发布

carltraveler

最新推荐文章于 2023-04-23 17:00:41 发布

阅读量878

点赞数

分类专栏： asm

asm 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

原著：David Kanter

原名：Analysis of Haswell’s Transactional Memory

原文：http://www.realworldtech.com/haswell-tm/

翻译：CoryXie <cory.xie@gmail.com>

附上原文：

Two of my personal areas of interest and expertise are speculative multithreading (SpMT) and transactional memory (or TM). Both are techniques designed to make multi-core processors and parallel programming more amenable to developers. For several years, I was the co-founder of Strandera, a start up that was developing speculative multithreading, based on transactional memory and dynamic binary translation. Strandera was a wonderful opportunity for me to meet many of the leading academic and commercial researchers in these areas and a number of x86 microprocessor architects over the last 5 years.

A while back, my friend and colleague Andreas Stiller reported a rumor that Intel’s upcoming Haswell microarchitecture would feature transactional memory. This rumor proved to be true recently, when Intel confirmed that Haswell included Transactional Synchronization Extensions, or TSX. This was incredibly exciting for me, and I am happy to take advantage of my expertise and experience to discuss these innovations.

Academic Research

One of the most profound challenges for modern software developers is the emergence of multi-core processors. Over the last 10 years, individual CPU cores have grown faster, but applications must be optimized for parallel execution to take full advantage of the power and performance benefits of Moore’s Law. Previously, parallel programming techniques were mostly used in scientific computing or mission critical and performance sensitive workloads like databases and ERP systems. The multi-core trend meant that developers working on consumer applications were thrust into a world of multithreaded programming and concurrency for the first time.

Transactional memory is a software technique that simplifies writing concurrent programs. TM draws on concepts first developed and established in the database community, which has been dealing with concurrency for roughly 30 years. The idea is to declare a region of code as a transaction. A transaction executes and atomically commits all the results to memory (when the transaction succeeds) or aborts and cancels all the results (if the transaction fails). The key for TM is to provide the Atomicity, Consistency and Isolation qualities that make databases and SQL accessible to ordinary developers. These transactions can safely execute in parallel, which replaces existing painful and bug-prone techniques such as locks and semaphores. There is also a potential performance benefit. Locks are pessimistic and assume that the locking thread will write to the data, so the progress of other threads is blocked. Two transactions which access locked value can proceed in parallel, and a rollback only occurs if one of the transactions writes to the data.

Speculative multithreading (SpMT) focuses on using multiple hardware threads (or multiple cores) to work together and accelerate a single software thread. In essence, the single software thread is speculatively split into multiple threads that can be executed in parallel. Transactions are a natural fit for speculative threads, since it offers an easy way to rollback incorrect speculation. The key advantage of SpMT is that it enables existing single threaded code (which is the vast majority of all software) to reap the benefit of multi-core processors.

Both SpMT and especially TM have a long and distinguished history in academia, which I am exquisitely familiar with from my work at Strandera. TM was first popularized in a paper by Maurice Herlihy and Eliot Moss in 1993, well before multi-core processors. SpMT dates back to Guri Sohi’s Multiscalar project in 1995. Over time both of these areas become very popular for research, and nearly every computer architecture department has projects looking at these two topics. Around 2000, researchers began to look at TM as a technique for multi-core programming and many realized that it was a natural fit with SpMT.

Software TM has been available for a number of years, and many programmers have played around with TM libraries. However, the performance overhead is crippling large, 2-10× slowdowns are common, which precludes widespread use. For TM to be more than a research toy, hardware acceleration and support is necessary.

A technique related to transactional memory, called lock elision, was pioneered by Ravi Rajwar and James Goodman in 2001. Rather than enabling transactions, the idea was to design hardware to handle locks more efficiently. If two threads are predicted to only read locked data, the lock is removed (or elided) and the two threads can execute in parallel. Similar to TM, if a thread writes to the data, the processor must rollback and re-execute using locks for correctness. Lock elision has the advantage of being easily integrated with legacy code, whereas TM requires substantial changes to software.

Commercial Development

One of the earliest implementations of transactional memory was the gated store buffer used in Transmeta’s Crusoe and Efficeon processors. However, this was only used to facilitate speculative optimizations for binary translation, rather than any form of SpMT or exposing it directly to programmers. Azul Systems also implemented hardware TM to accelerate their Java appliances, but this was similarly hidden from outsiders. For all intents and purposes, TM mostly languished as merely another interesting idea in academia.

Fortunately, Sun Microsystems decided to implement hardware TM and a limited form of SpMT in the high-end Rock microprocessor. Marc Tremblay, the chief architect, discussed this in papers and talks at numerous conferences. The TM was fairly modest and Sun’s developers showed that it could be used for lock elision and more complex hybrid TM systems, where transactions are handled with a combination of hardware and software. Unfortunately, Rock was cancelled by Sun in 2009. While products never made it out, a number of prototype systems were available to researchers, which was a tremendous help to the community.

In 2009, AMD proposed the Advanced Synchronization Facility, a set of x86 extensions that provide a very limited form of hardware TM support. The goal was to provide hardware primitives that could be used for higher level synchronization such as software TM or lock-free algorithms. However, AMD has not announced whether ASF will be used in products, and if so, in what timeframe.

More recently, IBM announced in 2011 that Blue Gene/Q had hardware support for both TM and SpMT. The TM could be configured in two modes. The first is an unordered and single version mode where a write from one transaction causes a conflict with any transactions reading the same memory address. The second mode is for SpMT and is an ordered, multi-versioned TM. Speculative threads can have different versions of the same memory address, and hardware keep tracks of the age of each thread. The younger threads can access data from older threads (but not the other way around), and writes to the same address are based on thread order. In some cases, dependencies between threads can cause the younger versions to abort.

The most recent development is of course, Intel’s TSX and the implementation in Haswell. Haswell is the first x86 processor to feature hardware transactional memory. Intel’s TSX specification describes how the TM is exposed to programmers, but withholds details on the actual TM implementation. The first section of this article discusses the software interfaces for Intel’s TM. The second section builds on this to analyze the likely implementation details of Intel’s TM, based on my experience.

Intel’s Transactional Synchronization Extensions

The TSX specification provides two interfaces for programmers to take advantage of transactional memory. The first is Hardware Lock Elision (HLE), which maps very closely to the previous work by Rajwar and Goodman. This should be no surprise, as Ravi Rajwar has worked at Intel for quite some time. The second mode is Restricted Transactional Memory (RTM), which resembles classical TM proposals. Both use new instructions to take advantage of the underlying TM hardware, but the objectives are fairly different.

The underlying TM tracks the read-set and write-set of a transaction, at a 64B cache line granularity. The read-set and write-set are respectively all the cache lines that the transaction has read from, or written to during execution. A transaction encounters a conflict if a cache line in its read-set is written by another thread, or if a cache line in its write-set is read or written by another thread. For those familiar with TM terminology, this is known as strong isolation, since non-transactional memory accesses can cause a transaction to abort. Conflicts typically cause the transaction to abort, and false conflicts within a cache line can occur.

TSX can have transactions nested inside each other, which is conceptually handled by flattening the nest into a single transaction. However, there is an implementation specific limit to the amount of nesting, exceeding this limit will cause an abort. Any abort inside a nested transaction will abort all the nested transactions.

Transactions can only be used with write-back cacheable memory operations, but are available at all privilege levels. Not all x86 instructions can be used safely inside of a transaction. There are several x86 instructions that will cause any transaction (for HLE or RTM) to abort, in particular CPUID and PAUSE.

In addition, there are a number of instructions that may cause an abort on specific implementations. These instructions include x87 and MMX, mixed access to XMM and YMM registers, updates to non-status parts of EFLAGs, updating segment, debug or control registers, ring transitions, cache and TLB control instructions, any non-writeback memory type accesses, processor state save, interrupts, I/O, virtualization (VMX), trusted execution (SMX) and several miscellaneous types. Note that this means that a context switch, which typically uses state saving instructions, will almost always abort a transaction. Generally, TSX implementations are intended to be upwards compatible with respect to instructions. For example, if one implementation adds support for VZEROUPPER in a transaction, that capability will not be removed in any future versions.

There are also run-time behaviors that may cause a transaction to abort. Most faults and traps will cause an abort, while synchronous exceptions and asynchronous events may cause an abort. Self-modifying code and accesses to non-write-back memory types may also cause aborts.

There are also implementation specific limits to the size of transactions, and various microarchitectural buffers can be limiting factors. Generally, Intel provides no guarantees about transactional execution. While this is frustrating to programmers because it requires a non-transactional fallback path, this avoids any backwards compatibility issues in the future. It also avoids deadlocking the system if the programmer writes a transaction which cannot possibly succeed.

Restricted Transactional Memory

Ironically, RTM is the more powerful and disruptive of the two techniques, but is conceptually simpler. RTM exposes nested transactional memory to the programmer, which is a very powerful model for expressing parallelism. The downside is that it does require developing separate code for transactional execution, rather than inserting hints into existing code.

RTM is an option for developers, but not a requirement. Code that is outside of a transaction does not have to obey the restrictions mentioned previously. This is necessary, given all the requirements for transactions. It also gives Intel the luxury of introducing new x86 features, without worrying about TM implementation details or overhead.

There are three new instructions, XBEGIN, XEND and XABORT. The XBEGIN instruction starts a transaction and also provides a 16-bit or 32-bit offset to a fallback address. If at any time, the transaction aborts, the thread will resume execution at the fallback address. Throughout the transactional execution, memory accesses are tracked using the read-set and write-set to detect conflicts.

If a conflict occurs during a transaction, it may trigger an abort. However, the actual outcome is implementation specific. For example, if two transactions conflict, both could be aborted. A more sensible approach is to abort only one of the transactions, but consistent rules to determine which one is successful would be helpful.

The XEND instruction indicates the end of a transaction. If the XEND instruction is the last in a nest of transactions (i.e. the outermost transaction), then it will attempt to atomically commit the changes. A successful commit will overwrite the old architectural state including both registers and memory, with new values generated during the transaction. From an ordering perspective, all the memory operations appear to have executed atomically. If the commit fails, then the changes to architectural state are aborted.

An abort simply rolls back any and all changes to architectural state that occurred during transactional execution, restoring the state prior to the first XBEGIN instruction (i.e. the outermost transaction). This includes any writes to the architectural registers or memory. Any abort within a nested set of transactions will cause the entire nest to abort. In addition, the abort updates the EAX register with an 8-bit value that potentially describes the cause of the abort. Last, the execution is redirected to the fallback point. If a nested set of transactions aborts, then the fallback address is taken from the first XBEGIN instruction (i.e. the outermost transaction).

The XABORT instruction immediately triggers an abort, just as if a commit had been unsuccessful. An explicit abort command is useful when the programmers can determine that a transaction is going to fail, without any help from the hardware. Aborting the transaction early can help reduce the performance and power penalty.

Hardware Lock Elision

HLE is much easier to integrated with existing x86 software and is also backwards compatible with x86 microprocessors that do not have TSX support. Like its namesake, the goal is to improve the performance of synchronization by enabling simultaneous non-conflicting accesses to shared data.

HLE introduces two new instruction hint prefixes, called XACQUIRE and XRELEASE that are used to denote the bounds of the lock elision. XACQUIRE is a prefix for instructions that acquire a lock. It indicates the start of a region for lock elision. The memory address of the lock instruction is added to the read-set of a transaction, but does not write new data to the lock address.

The thread then enters transactional execution and continues on to the instructions inside the HLE region, adding memory accesses to the read-set and write-set. Any read to the lock address from within the HLE region will return the new data, but any read from another thread will return the old data. This enables many threads to simultaneously acquire a lock and make non-conflicting memory accesses to shared data.

XRELEASE is a prefix that is used for the instruction that releases the lock address, and it marks the end of the HLE region. Since the lock address was not written to at the start of the region, no further writes are needed to restore it to the original value. At the outermost XRELEASE, the processor attempts to commit the transaction; if successful then the HLE region was executed without acquiring or releasing the lock.

If a conflict occurs, then the processor will restore the architectural register state prior to XACQUIRE and discard any writes to memory from the HLE region. The thread will execute the region again, without HLE and with the standard pessimistic locking behavior. Any lock which crosses a cache line cannot be elided and will automatically trigger re-execution without HLE.

There is also an implementation specific limit to the number of locks that can be elided simultaneously; any further locks will be executed normally. While nested HLE is available, recursive locking is not supported. Writing to the lock address from within the HLE region will cause an HLE abort.

Another potential source of HLE aborts is locks that reside on the same cache line. Since Intel’s TM can only track at cache line granularity, two locks on the same cache line will cause aliasing and unnecessary HLE aborts.

Hardware without HLE support will simply ignore the hint prefixes and also execute the region with standard behavior. One of the key benefits is that HLE is compatible with existing lock-based programming and takes little developer effort to implement.

Haswell Background

While the Intel TSX documentation is fairly precise in describing the semantics for both RTM and HLE, critical details are not available for the underlying TM and the specific implementation in Haswell. However, there is sufficient information to speculate about the nature of Haswell’s TM. To review what is known, Haswell implements an unordered, single version, nested TM with strong isolation. The TM tracks the read-set and write-set at a fixed 64B cache line granularity.

As a starting assumption, Haswell’s memory hierarchy probably resembles Sandy Bridge in broad strokes, with roughly 32KB L1 and 256KB L2 caches per core (shared by two threads), and an L3 cache shared by all cores. Haswell almost certainly extends Intel’s existing cache behavior and MESIF coherency protocol to support transactional memory semantics. It is far too risky to totally redefine the coherency protocol, which would also vastly increase the challenge of maintaining full compatibility. It is also entirely out of character for Intel, which tends to favor incremental enhancements rather than radical redesigns for mainstream x86 products.

Intel’s server teams take the existing CPU core and build customized blocks including the L3 cache, QPI logic, I/O, memory controllers and power management, tied together with a high performance interconnect fabric. The Haswell coherency changes are very likely restricted to the L1 and L2 caches, strictly within the core. This has the advantage that Intel’s server group can leverage the work in Haswell with minimal additional investment and effort. Most importantly, restricting changes to the core avoids extra validation, which is one of the largest contributors to cost and schedule for server processors.

Haswell’s Transactional Memory

Based on the assumptions above, we can sketch out the most likely implementation details for Haswell’s transactional memory. The most important missing details relate to TM version management, conflict handling and the commit and abort operations.

Haswell’s transactional memory is most likely a deferred update system using the per-core caches for transactional data and register checkpoints. Specifically, the tags for each cache line in the L1D and L2 probably contain bits that indicate whether the line belongs to the read-set (RS) or write-set (WS) for the two threads that can execute on a Haswell core. A store during a transaction will simply write to the cache line, and the shared L3 cache holds the original data (which may require an update to the L3 for the first write in a transaction to a Modified line). This is consistent with the shared and inclusive L3 cache that has been used since Nehalem. To avoid data corruption, any transactional data (i.e. lines in the RS or WS) must stay in the L1D or L2 and not be evicted to the L3 or memory. The architectural registers are checkpointed and stored on-chip, and the transaction can read and write from registers freely.

Alternatively, it is possible that transactions are restricted to Haswell’s L1D cache. This would simplify the design, but make the TM much less robust. The L1D is really not associative enough for modest sized transactions; associativy evictions can occur with as few as 9 cache lines, even though there about 256 lines available per thread. Using the L2 pretty much eliminates most associativity evictions and each thread will have about 2.2K lines on average and a hard maximum of around 4.5K lines. Again, Intel is likely to be using both the L1D and the L2 for transactional data, to avoid the potential associativity problems.

Conflict Detection

The advantage of the cache-based system I outlined above is that Haswell’s conflict detection can be handled through the existing coherency protocol with only minor enhancements to the memory ordering buffers, L1D and L2 caches. A conflict can occur in exactly three ways, all of which can be easily detected.

The first case to consider is an RS cache line that is written by another core. In Intel’s MESIF protocol, writing to a potentially shared cache line requires invalidating any copies in other private caches. The other core will send a read-for-ownership request to the shared L3 cache, which will check the snoop filter and transmit an invalidate to any cache with a copy of the line. If an L1D or L2 cache receives an invalidate for an RS cache line, this indicates that another thread is attempting to write and a conflict has occurred.

The second case to consider is a WS cache line that is accessed (read or write) by another core. For the transaction to write to the cache line in the first place, no other core can have a copy, which is standard semantics for the Modified coherency state. However, the inclusive L3 cache will retain the out-of-date, original version of the line prior to the transaction. Thus if another core attempts to access the WS cache line, it will miss in the local L1D and L2 caches, and probe the L3. However, the L3 will access the snoop filter and then send the read (or read-for-ownership) request to the core that has the cache line in the WS. If an L1D or L2 cache receives any read request (or an invalidate) for a line in the WS, then a conflict has occurred.

Those two cases handle a different core interfering with a transaction. The third possibility is that another thread on the same Haswell core, which shares the L1D and L2 caches, interferes with the transaction. In this scenario, the other thread will hit in the shared L1D or L2 cache, rather than triggering any coherency traffic through the L3 cache. However, this can be handled in the load pipelines by checking whether the cache line is in the WS for the other thread; and in the store pipeline by checking whether the cache line is in the RS or WS for the other thread.

The last precaution is that if any RS or WS cache line is evicted to the L3 or memory, the transaction must be aborted. In the case of a WS line, this is to prevent the transactional data from overwriting the original copy that is held in the L3 cache. Strictly speaking, RS lines could be held in the L3 cache, but this would complicate matters considerably and it is much easier to simply abort the transaction. It is highly likely that Intel has modified the cache line replacement and eviction policies in the L2 to avoid evicting transactional data. Another advantage of using the L2 cache for transactions is that the L2 eviction algorithms can tolerate significantly more latency than the L1D cache, where they are on the critical path for any L1D miss.

Commit and Abort

To commit a transaction in the system outlined above is a straight forward matter. The L1D and L2 cache controllers make sure that any cache line in the WS is in the Modified state and zero out the WS bits. Similarly, any cache line that is in the RS (but not the WS) must be in the Shared state and the RS bits are zeroed. The old register file checkpoint is removed, and the existing contents of the register file become the architectural state.

The coherency-based conflict detection scheme can identify conflicts early, as soon as they occur (as opposed to at commit time). Once the conflict has been detected early, it is most likely that the transaction is aborted immediately. However, other problems might cause deferred aborts in some implementations.

Aborting transactions is even simpler than committing. The cache controllers change all the WS lines to the Invalid state and zero out the WS bits. The controllers must also zero out the RS bits, but it is not necessary to invalidate the cache lines (although Haswell might do this). Last, the architectural register file is restored from the old checkpoint.

Intel’s Restricted Transactional Memory more or less provides a direct interface to the underlying TM in Haswell using the appropriate instructions. Hardware Lock Elision probably uses the same basic techniques, but requires a little extra buffering. Any locks which are elided are probably pinned in Haswell’s store buffer (or perhaps a small dedicated structure that is accessed in parallel). Thus any access within an HLE region reads the new lock value, while the old value is kept in the cache so that other threads can simultaneously execute.

TSX Analysis

The success of TSX depends entirely on how well Haswell works. Fortunately, Intel’s transactional memory in Haswell seems to be straight forward, logical and relatively simple; as long as transactions can span both the L1D and L2 caches, the performance should be good. With the benefit of several implementations in hindsight (Azul, Sun’s Rock, Transmeta), TSX seems to have avoided many problems. Previous hardware TM systems were plagued by associativity conflicts, which Intel probably dealt with by using the L2 cache for transactional data. TSX abort operations can return a code that indicates the proximate cause, to help diagnosing hardware and software bugs, and debugger support is an integral part of the specification. Building Hardware Lock Elision on top of the TM system is a clever move that helps existing code easily reap performance gains with minimal efforts, which enables developers to test out TSX before committing to writing code for Restricted Transactional Memory.

The fact that RTM has no guarantees of forward progress and requires a non-transactional fallback path is a bit of a mixed bag. On one hand, it creates an extra burden for programmers, which is quite unattractive. However, it also neatly sidesteps the problem of dealing with poorly written transactional code and compatibility. Someone will eventually write horrible transactional code that deadlocks any feasible system. It is unreasonable to force programmers to deal with subtle details like cache associativity conflicts, but the underlying hardware has finite resources. Without some form of virtualized transactional memory, there is always a concern about transactions that cannot succeed.

Generally, Intel’s TSX should be helpful for improving the programmability and scalability for concurrent workloads. Even with a modest number of threads, locks can easily limit the benefits from additional cores. While that is not a problem for 2-4 core processors, it is a much bigger factor going forward. Extremely popular applications such as MySQL have well-known locking issues that HLE or RTM could significantly alleviate.

There are immediate applications in a few areas particularly focused on low level software systems. First, as Azul and Sun already demonstrated, lock elision and TM are powerful tools for scalable garbage collection in Java, and other dynamic languages. Additionally, all the existing research on software TM systems can take advantage of Intel’s RTM to reduce overhead. Research from the University of Toronto has shown that software TM can provide a mild performance gain for a Quake game server (compared to lock-based versions), and Intel’s RTM should have far better results. It will be quite fascinating to see what various software vendors can do with the building blocks that Intel has provided.

Transactional Memory Evolution

Looking forward, there are a number of potential improvements for Intel’s transactional memory. The first and most obvious is moving towards a multi-versioned TM that is more amenable to speculative multithreading. It was somewhat disappointing that Intel did not go down this path initially, even though it is potentially more complex. Perhaps IBM will be able to demonstrate enough benefits with Blue Gene/Q to motivate more sophisticated TM systems. Other refinements might include partial aborts for nested transactions and programmer control over handling conflicting transactions.

The other avenue for Intel is proliferating transactional memory throughout the x86 product line. Today that is mainly SoCs targeted at mobile devices and the upcoming many-core Knight’s Corner. It is hard to see an immediate application for the extremely power sensitive derivatives of Atom, but that likely depends on the adoption of TSX for mainstream products. If TSX can demonstrate a compelling benefit, perhaps for running Microsoft’s .Net languages or Android’s Dalvik, future mobile CPUs might very well ship with transactional memory support.

Transactional memory is a natural and obvious fit for future many-core products. Intel’s RTM is undeniably an easier programming model than many alternatives and can augment existing languages like C++ through libraries and extensions. Moreover, the potential to significantly improve scalability is incredibly attractive for a design with dozens of cores and hundreds of threads. For example, a presentation from Doug Carmean emphasized the challenge in handling contended locks in the early versions of Larrabee. TSX can potentially remove lock contention as an issue for the majority of programs, except where a true dependency exists and serialization is required for correctness. Overall, it would be surprising if the successor to Knight’s Corner did not support some version of TSX.

Beyond Intel, the rest of the industry seems likely to adopt variations of transactional memory in the coming years. Oracle has publicly stated that future SPARC processors will include memory versioning (another term for TM). IBM is already shipping Blue Gene/Q, although it is unclear when (or if) POWER or zArchitecture processors will adopt TM. ARM’s position on TM is unclear, but may evolve once the ARMv8 extension has been finalized.

AMD is continuing to work on ASF, which has several substantive differences from TSX; privileged instructions, forward progress guarantees and an absence of register rollback on abort are just a few examples. AMD and Intel face a choice, fragmenting the x86 ecosystem with incompatible extensions, or harmonizing ASF and TSX. From an industry standpoint, the latter is clearly preferable, and there is a risk that software vendors may be unwilling to tolerate such incompatibilities. However, it could take years for Intel and AMD to align their implementations.

Intel adopting transactional memory is a new era for computer architecture. Haswell represents a nearly two decade journey from the initial popularization of transactional memory in academia, to mainstream adoption in a microprocessor. For those of us who have worked on transactional memory, this is an exceptionally exciting moment. It is a time to look back on the contributions that brought the industry here and begin to ponder the opportunities that may unfold in the coming years.