Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
6.2.1 Locality of References to Program Data
A function such as sumvec that visits each element of a vector sequentially is said to have a stride-1 reference pattern (with respect to the element size). We refer to stride-1 reference patterns as sequential reference patterns. Visiting every kth element of a contiguous vector is called a stride-k reference pattern. In general, as the stride increases, the spatial locality decreases.
6.2.2 Locality of Instruction Fetches
In Figure 6.19 the instructions in the body of the for loop are executed in sequential memory order, and thus the loop enjoys good spatial locality. Since the loop body is executed multiple times, it also enjoys good temporal locality.
6.2.3 Summary of Locality
Simple rules for evaluating the locality in a program:
1.Programs that repeatedly reference the same variables enjoy good temporal locality.
2.For programs with stride-k reference patterns, the smaller the stride the better the spatial locality. Programs with stride-1 reference patterns have good spatial locality. Programs that hop around memory with large strides have poor spatial locality.
3.Loops have good temporal and spatial locality with respect to instruction fetches. The smaller the loop body and the greater the number of loop iterations, the better the locality.
6.3 The Memory Hierarchy
6.3.1 Caching in the Memory Hierarchy
A cache is a small, fast storage device that acts as a staging area for the data objects stored in a larger, slower device. The process of using a cache is known as caching.
The central idea of a memory hierarchy is that for each k, the faster and smaller storage device at level k serves as a cache for the larger and slower storage device at level k + 1. That is, each level in the hierarchy caches data objects from the next lower level.
Data is always copied back and forth between level k and level k + 1 in block-sized transfer units. While the block size is fixed between any particular pair of adjacent levels in the hierarchy, other pairs of levels can have different block sizes. Transfers between L1 and L0 typically use one-word blocks. Transfers between L5 and L4 use blocks with hundreds or thousands of bytes. Devices lower in the hierarchy have longer access times, and thus tend to use larger block sizes in order to shorten these longer access times.
Cache Hit: When a program needs a particular data object d from level k + 1, it first looks for d in one of the blocks currently stored at level k. If d happens to be cached at level k, then we have what is called a cache hit. The program reads d directly from level k, which by the nature of the memory hierarchy is faster than reading d from level k + 1.
Cache Misses: If the data object d is not cached at level k, then we have what is called a cache miss. When there is a miss, the cache at level k fetches the block containing d from the cache at level k + 1, possibly overwriting an existing block if the level k cache is already full.
This process of overwriting an existing block is known as replacing or evicting the block. The block that is evicted is sometimes referred to as a victim block. The decision about which block to replace is governed by the cache’s replacement policy. A cache with a least-recently used (LRU) replacement policy would choose the block that was last accessed the furthest in the past.
After the cache at level k has fetched the block from level k + 1, the program can read d from level k as before.
Kinds of Cache Misses
If the cache at level k is empty, then any access of any data object will miss. An empty cache is sometimes referred to as a cold cache, and misses of this kind are called compulsory misses or cold misses. Cold misses are important because they are often transient events that might not occur in steady state, after the cache has been warmed up by repeated memory accesses.
Whenever there is a miss, the cache at level k must implement some placement policy that determines where to place the block it has retrieved from level k + 1. The most flexible placement policy is to allow any block from level k + 1 to be stored in any block at level k. For caches high in the memory hierarchy (close to the CPU) that are implemented in hardware and where speed is at a premium, this policy is usually too expensive to implement because randomly placed blocks are expensive to locate. So hardware caches typically implement a more restricted placement policy that restricts a particular block at level k + 1 to a small subset of the blocks at level k. For example, a block i at level k + 1 must be placed in block (i mod 4) at level k.
Restrictive placement policies of this kind lead to a type of miss known as a conflict miss, in which the cache is large enough to hold the referenced data objects, but because they map to the same cache block, the cache keeps missing.
Programs often run as a sequence of phases (e.g., loops) where each phase accesses some reasonably constant set of cache blocks. For example, a nested loop might access the elements of the same array over and over again. This set of blocks is called the working set of the phase. When the size of the working set exceeds the size of the cache, the cache will experience what are known as capacity misses. In other words, the cache is just too small to handle this particular working set.
The essence of the memory hierarchy is that the storage device at each level is a cache for the next lower level. At each level, some form of logic must manage the cache. That is, something has to partition the cache storage into blocks, transfer blocks between different levels, decide when there are hits and misses, and then deal with them. The logic that manages the cache can be hardware, software, or a combination of the two.
6.3.2 Summary of Memory Hierarchy Concepts
6.4 Cache Memories
Because of the increasing gap between CPU and main memory, system designers insert a small SRAM cache memory, called an L1 cache (Level 1 cache) between the CPU register file and main memory, as shown in Figure 6.26. The L1 cache can be accessed nearly as fast as the registers, typically in 2 to 4 clock cycles.
As the performance gap between the CPU and main memory continued to increase, system designers responded by inserting an additional larger cache, called an L2 cache, between the L1 cache and main memory, that can be accessed in about 10 clock cycles. Some modern systems include an additional even larger cache, called an L3 cache, which sits between the L2 cache and main memory in the memory hierarchy and can be accessed in 30 or 40 cycles.
6.4.1 Generic Cache Memory Organization
Consider a computer system where each memory address has m bits that form M = 2m unique addresses. As illustrated in Figure 6.27(a), a cache for such a machine is organized as an array of S = 2s cache sets. Each set consists of E cache lines. Each line consists of a data block of B = 2b bytes, a valid bit that indicates whether or not the line contains meaningful information, and t = m − (b + s) tag bits that uniquely identify the block stored in the cache line.
In general, a cache’s organization can be characterized by the tuple (S, E, B, m). The size (or capacity) of a cache, C, is stated in terms of the aggregate size of all the blocks. The tag bits and valid bit are not included. C = S × E × B.
When the CPU is instructed by a load instruction to read a word from address A of main memory, it sends the address A to the cache. If the cache is holding a copy of the word at address A, it sends the word immediately back to the CPU.
How does the cache know whether it contains a copy of the word at address A?
The cache is organized so that it can find the requested word by inspecting the bits of the address, similar to a hash table with a hash function.
The parameters S and B induce a partitioning of the m address bits into the three fields shown in Figure 6.27(b). The s set index bits in A form an index into the array of S sets. The first set is set 0, the second set is set 1, and so on. When interpreted as an unsigned integer, the set index bits tell us which set the word must be stored in. Once we know which set the word must be contained in, the t tag bits in A tell us which line in the set contains the word. A line in the set contains the word if and only if the valid bit is set and the tag bits in the line match the tag bits in the address A. Once we have located the line identified by the tag in the set identified by the set index, then the b block offset bits give us the offset of the word in the B-byte data block.
6.4.2 Direct-Mapped Caches
Caches are grouped into different classes based on E, the number of cache lines per set. A cache with exactly one line per set (E = 1) is known as a direct-mapped cache.
Suppose we have a system with a CPU, a register file, an L1 cache, and a main memory. When the CPU executes an instruction that reads a memory word w, it requests the word from the L1 cache. If the L1 cache has a cached copy of w, then we have an L1 cache hit, and the cache quickly extracts w and returns it to the CPU. Otherwise, we have a cache miss, and the CPU must wait while the L1 cache requests a copy of the block containing w from the main memory. When the requested block finally arrives from memory, the L1 cache stores the block in one of its cache lines, extracts word w from the stored block, and returns it to the CPU.
The process that a cache goes through of determining whether a request is a hit or a miss and then extracting the requested word consists of three steps: (1) set selection, (2) line matching, and (3) word extraction.
Set Selection in Direct-Mapped Caches
The cache extracts the s set index bits from the address for w. These bits are interpreted as an unsigned integer that corresponds to a set number.
Line Matching in Direct-Mapped Caches
Now we have selected set i in the previous step, the next step is to determine if a copy of the word w is stored in one of the cache lines contained in set i.
In a direct-mapped cache, this is fast because there is exactly one line per set. A copy of w is contained in the line if and only if the valid bit is set and the tag in the cache line matches the tag in the address of w.
Word Selection in Direct-Mapped Caches
Once hit, w is somewhere in the block. This last step determines where the desired word starts in the block. The block offset bits provide us with the offset of the first byte in the desired word, which is 1002 = 4 in example.
Line Replacement on Misses in Direct-Mapped Caches
If the cache misses, then it needs to retrieve the requested block from the next level in the memory hierarchy and store the new block in one of the cache lines of the set indicated by the set index bits.
If the set is full of valid cache lines, then one of the existing lines must be evicted. For a direct-mapped cache, the current line is replaced by the newly fetched line.
Suppose we have a direct-mapped cache described by (S, E, B, m) = (4, 1, 2, 4). That is, the cache has four sets, one line per set, 2 bytes per block, and 4-bit addresses. We will also assume that each word is a single byte.
The concatenation of the tag and index bits uniquely identifies each block in memory. For example, block 0 consists of addresses 0 and 1, block 1 consists of addresses 2 and 3, and so on.
Since there are eight memory blocks but only four cache sets, multiple blocks map to the same cache set. For example, blocks 0 and 4 both map to set 0, blocks 1 and 5 both map to set 1, and so on.
Blocks that map to the same cache set are uniquely identified by the tag. For example, block 0 has a tag bit of 0 while block 4 has a tag bit of 1, block 1 has a tag bit of 0 while block 5 has a tag bit of 1, and so on.
Initially, the cache is empty (i.e., each valid bit is zero). Each row in the table represents a cache line. The first column indicates the set that the line belongs to. The next three columns represent the actual bits in each cache line.
1.Read word at address 0.
Since the valid bit for set 0 is zero, this is a cache miss. The cache fetches block 0 from memory and stores the block in set 0. Then the cache returns m (the contents of memory location 0) from block of the newly fetched cache line.
2.Read word at address 1.
This is a cache hit. The cache immediately returns m from block of the cache line. The state of the cache does not change.
3.Read word at address 13.
Since the cache line in set 2 is not valid, this is a cache miss. The cache loads block 6 into set 2 and returns m from block of the new cache line.
4.Read word at address 8.
This is a miss. The cache line in set 0 is indeed valid, but the tags do not match. The cache loads block 4 into set 0 and returns m from block of the new cache line.
5.Read word at address 0
This is another miss, due to the fact that we just replaced block 0 during the previous reference to address 8. This kind of miss, where we have plenty of room in the cache but keep alternating references to blocks that map to the same set, is an example of a conflict miss.
Conflict misses in direct-mapped caches typically occur when programs access arrays whose sizes are a power of 2. For example, consider a function that computes the product of two vectors:
Suppose that floats are 4 bytes, that x is loaded into the 32 bytes of contiguous memory starting at address 0, and that y starts immediately after x at address 32. Suppose that a block is 16 bytes (big enough to hold four floats) and that the cache consists of two sets, for a total cache size of 32 bytes. We will assume that the variable sum is actually stored in a CPU register and thus does not require a memory reference.
Given these assumptions, each x[i] and y[i] will map to the identical cache set:
At run time, the first iteration of the loop references x, a miss that causes the block containing x–x to be loaded into set 0. The next reference is to y, another miss that causes the block containing y–y to be copied into set 0, overwriting the values of x that were copied in by the previous reference. During the next iteration, the reference to x misses, which causes the x–x block to be loaded back into set 0, overwriting the y–y block. So now we have a conflict miss, and in fact each subsequent reference to x and y will result in a conflict miss as we thrash back and forth between blocks of x and y. The term thrashing describes any situation where a cache is repeatedly loading and evicting the same sets of cache blocks.
Even though the program has good spatial locality and we have room in the cache to hold the blocks for both x[i] and y[i], each reference results in a conflict miss because the blocks map to the same cache set.
One solution is to put B bytes of padding at the end of each array. For example, instead of defining x to be float x, we define it to be float x. Assuming y starts immediately after x in memory, we have the following mapping of array elements to sets:
With the padding at the end of x, x[i] and y[i] now map to different sets, which eliminates the thrashing conflict misses.
Why caches use the middle bits for the set index instead of the high-order bits?
If the high-order bits are used as an index, then some contiguous memory blocks will map to the same cache set. For example, the first four blocks map to the first cache set, the second four blocks map to the second set, and so on. If a program has good spatial locality and scans the elements of an array sequentially, then the cache can only hold a block-sized chunk of the array at any point in time. This is an inefficient use of the cache. Contrast this with middle-bit indexing, where adjacent blocks always map to different cache lines.
6.4.3 Set Associative Caches
A cache with 1 < E < C/B is often called an E-way set associative cache.
Set Selection in Set Associative Caches
Set selection is identical to a direct-mapped cache, with the set index bits identifying the set.
Line Matching and Word Selection in Set Associative Caches
It must check the tags and valid bits of multiple lines in order to determine if the requested word is in the set.
An associative memory is an array of (key, value) pairs that takes input as the key and returns a value from one of the (key, value) pairs that matches the input key. We can think of each set in a set associative cache as a small associative memory where the keys are the concatenation of the tag and valid bits, and the values are the contents of a block.
Any line in the set can contain any of the memory blocks that map to that set. So the cache must search each line in the set, searching for a valid line whose tag matches the tag in the address. If the cache finds such a line, then we have a hit and the block offset selects a word from the block, as before.
Line Replacement on Misses in Set Associative Caches
The simplest replacement policy is to choose the line to replace at random. A least-frequently-used (LFU) policy will replace the line that has been referenced the fewest times over some past time window. A least-recently-used (LRU) policy will replace the line that was last accessed the furthest in the past. All of these policies require additional time and hardware.
6.4.4 Fully Associative Caches
A fully associative cache consists of a single set (i.e., E = C/B) that contains all of the cache lines.
Set Selection in Fully Associative Caches
Notice that there are no set index bits in the address, which is partitioned into only a tag and a block offset.
Line Matching and Word Selection in Fully Associative Caches
Because the cache circuitry must search for many matching tags in parallel, it is difficult and expensive to build an associative cache that is both large and fast. So fully associative caches are only appropriate for small caches, such as the translation lookaside buffers (TLBs) in virtual memory systems that cache page table entries (Section 9.6.2).
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1.