Translation Lookaside Buffer (TLB)

最新推荐文章于 2023-08-11 23:12:11 发布

刘军卫

最新推荐文章于 2023-08-11 23:12:11 发布

阅读量2.1k

点赞数

分类专栏： linux2.6.xx内核代码分析文章标签： translation buffer structure interface allocation access

本文链接：https://blog.csdn.net/ustc_dylan/article/details/4321454

版权

linux2.6.xx内核代码分析专栏收录该内容

72 篇文章 9 订阅

订阅专栏

the region offset of the linear page table, and the offset of the PTE within the linear page table. In the last summand, the factor of 8 is used because each PTE has a size of 8 bytes, and the modulo operation truncates away the most significant address bits that are not mapped by the page table. The VHPT walker then attempts to read the PTE stored at this address. Because this is again a virtual address, the CPU goes through the normal virtual-to-physical address translation. If a TLB entry exists for va³, the translation succeeds and the walker can read the PTE from physical memory and install the PTE for va. However, if the TLB entry for va³is also missing, the walker gives up and requests assistance by raising a VHPT TRANSLATION FAULT.

Let us emphasize that the VHPT walker never walks the Linux page table. It could not possibly do so because it has no knowledge of the page-table structure used by Linux. For example, it does not know how many levels the page-table tree has or how big each directory is. But why use the VHPT walker at all given that it can handle a TLB miss only if the TLB entry for the linear page table is already present? The reason is spatial locality of reference. Consider that the TLB entry that maps a particular PTE actually maps an entire page of PTEs. Thus, after a TLB entry for a page-table page is installed, all TLB misses that access PTEs in the same page can be handled entirely by the VHPT walker, avoiding costly TLB miss faults. For example, with a page size of 8 Kbytes, each page-table TLB entry maps 8 Kbytes/8 = 1024 PTEs and hence 1024 · 8 Kbytes = 8 Mbytes of memory. In other words, when accessing memory sequentially, the VHPT walker reduces TLB miss faults from one per page to only one per 1024 pages! Given the high cost of fielding a fault on modern CPUs, the VHPT walker clearly has the potential to dramatically increase performance.

On the other hand, if memory is accessed in an extremely sparse pattern, the linear page table can be disadvantageous because the TLB entries for the page table take up space without being of much benefit. For example, again assuming a page size of 8 Kbytes, the most extreme case would occur when one byte is accessed every 8 Mbytes. In this case, each memory access would require two TLB entries: one for the page being accessed and one for the corresponding page-table page. Thus, the effective size of the TLB would be reduced by a factor of two! Fortunately, few applications exhibit such extreme access patterns for prolonged periods of time, so this is usually not an issue.

On a final note, it is worth pointing out that the virtually-mapped linear page table used by Linux/ia64 is not a self-mapped virtual page table (see Section 4.3.2). The two are very similar in nature, but the IA-64 page table does not have a self-mapped entry in the global directory. The reason is that none is needed: The self-mapped entry really is needed only if the virtual page table is used to access global- and middle-directory entries. Since Linux/ia64 does not do that, it needs no self-mapping entry. Another way to look at this situation is to think of the virtual page table as existing in the TLB only: If a page in the virtual page table happens to be mapped in the TLB, it can be used to access the PTE directory, but if it is not mapped, an ordinary page-table walk is required.

Linux/ia64 and the region and protection key registers

Let us now return to Figure 4.29 on page 177 and take a closer look at the workings of the region and protection key registers and how Linux uses them. Both register files are under complete control of the operating system; the IA-64 architecture does not dictate a particular way of using them. However, they clearly were designed with a particular use in mind. Specifically, the region registers provide a means to share the TLB across multiple processes (address spaces). For example, if a unique region ID is assigned to each address space, the TLB entries for the same virtual page number vpn of different address spaces can reside in the TLB at the same time because they remain distinguishable, thanks to the region ID. With this use of region IDs, a context switch no longer requires flushing of the entire TLB. Instead, it is simply necessary to load the region ID of the new process into the appropriate region registers. This reduction in TLB flushing can dramatically improve performance for certain applications. Also, because each region has its own region register, it is even possible to have portions of different address spaces active at the same time. The Linux kernel takes advantage of this by permanently installing the region ID of the kernel in rr5–rr7 and installing the region ID of the currently running process in rr0–rr4. With this setup, kernel TLB entries and the TLB entries of various user-level processes can coexist without any difficulty and without wasting any TLB entries.

Whereas region registers make it possible to share the entire TLB across processes, protection key registers enable the sharing of individual TLB entries across processes, even if the processes must have distinct access rights for the page mapped by the TLB entry. To see how this works, suppose that a particular TLB entry maps a page of a shared object. The TLB entry for this page would be installed with the access rights (+rights) set according to the needs of the owner of the object. The key field would be set to a value that uniquely identifies the shared object. The operating system could then grant a process restricted access to this object by using one of the protection key registers to map the object's unique ID to an appropriate set of negative rights (-rights). This kind of fine-grained sharing of TLB entries has the potential to greatly improve TLB utilization, e.g., for shared libraries. However, the short-format mode of the VHPT walker cannot take advantage of the protection key registers and, for this reason, Linux disables them by clearing the pk bit in the processor status register (see Chapter 2, IA-64 Architecture).

4.4.2 Maintenance of TLB coherency

For proper operation of the Linux kernel to be guaranteed, the TLB must be kept coherent with the page tables. For this purpose, Linux defines an interface that abstracts the platform differences of how entries are flushed from the TLB. The interface used for this purpose, shown in Figure 4.33, is called the TLB flush interface (file include/asm/pgalloc.h).

Figure 4.33. Kernel interface to maintain TLB coherency.

The first routine in this interface is flush tlb page(). It flushes the TLB entry of a particular page. The routine takes two arguments, a vm-area pointer vma and a virtual address addr. The latter is the address of the page whose TLB entry is being flushed, and the former points to the vm-area structure that covers this page. Because the vm-area structure contains a link back to the mm structure to which it belongs, the vma argument indirectly also identifies the address space for which the TLB entry is being flushed.

The second routine, flush_tlb_range(), flushes the TLB entries that map virtual pages inside an arbitrary address range. It takes three arguments: an mm structure pointer mm, a start address start, and an end address end. The mm argument identifies the address space for which the TLB entries should be flushed, and start and end identify the first and the last virtual page whose TLB entries should be flushed.

The third routine, flush_tlb_pgtables, flushes TLB entries that map the virtually-mapped linear page table. Platforms that do not use a virtual page table do not have to do anything here. For the other platforms, argument mm identifies the address space for which the TLB entries should be flushed, and arguments start and end specify the virtual address range for which the virtual-page-table TLB entries should be flushed.

The fourth routine, flush_tlb_mm(), flushes all TLB entries for a particular address space. The address space is identified by argument mm, which is a pointer to an mm structure. Depending on the capabilities of the platform, this routine can either truly flush the relevant TLB entries or simply assign a new address-space number to mm.

Note that the four routines discussed so far may flush more than just the requested TLB entries. For example, if a particular platform does not have an instruction to flush a specific TLB entry, it is safe to implement flush_tlb_page() such that it flushes the entire TLB.

The next routine, flush_tlb_all(), flushes the entire TLB. This is a fallback routine in the sense that it can be used when none of the previous, more fine-grained routines are suitable. By definition, this routine flushes even TLB entries that map kernel pages. Because this is the only routine that does this, any address-translation-related changes to the page-table-mapped kernel segment or the kmap segment must be followed by a call to this routine. For this reason, calls to vmalloc() and vfree() are relatively expensive.

The last routine in this interface is update mmu_cache(). Instead of flushing a TLB entry, it can be used to proactively install a new translation. The routine takes three arguments: a vm-area pointer vma, a virtual address addr, and a page-table entry pte. The Linux kernel calls this routine to notify platform-specific code that the virtual page identified by addr now maps to the page frame identified by pte. The vma argument identifies the vm-area structure that covers the virtual page. This routine gives platform-specific code a hint when the page table changes. Because it gives only a hint, a platform is not required to do anything. The platform could use this routine either for platform-specific purposes or to aggressively update the TLB even before the new translation is used for the first time. However, it is important to keep in mind that installation of a new translation generally displaces an existing entry from the TLB, so whether or not this is a good idea depends both on the applications in use and on the performance characteristics of the platform.

IA-64 implementation

On Linux/ia64, flush_tlb_mm() is implemented such that it forces the allocation of a new address-space number (region ID) for the address space identified by argument mm. This is logically equivalent to purging all TLB entries for that address space, but it has the advantage of not requiring execution of any TLB purge instructions.

The flush_tlb_all() implementation is based on the ptc.e (purge translation cache entry) instruction, which purges a large section of the TLB. Exactly how large a section of the TLB is purged depends on the CPU model. The architected sequence to flush the entire TLB is as follows:

long flags, i, j, addr = BASE_ADDR;
local irq_save( flags); /* disable interrupts */
for ( i = 0; i < COUNT0; ++ i, addr += STRIDE0) {
   for ( j = 0; j < COUNT1; ++ j, addr += STRIDE1)
     ptc e( addr);
}
local irq restore( flags);       /* reenable interrupts */

Here, BASE_ADDR, COUNT0, STRIDE0, COUNT1, and STRIDE1 are CPU-model-specific values that can be obtained from PAL firmware (see Chapter 10, Booting). The advantage of using such an architected loop instead of an instruction that is guaranteed to flush the entire TLB is that ptc.e is easier to implement on CPUs with multiple levels and different types of TLBs. For example, a particular CPU model might have two levels of separate instruction and data TLBs, making difficult to clear all TLBs atomically with a single instruction. However, in the particular case of Itanium, COUNT0 and COUNT1 both have a value of 1, meaning that a single ptc.e instruction flushes the entire TLB.

All other flush routines are implemented with either the ptc.l (purge local translation cache) or the ptc.ga (purge global translation cache and ALAT) instruction. The former is used on UP machines, and the latter is used on MP machines. Both instructions take two operands—a start address and a size—which define the virtual address range from which TLB entries should be purged. The ptc.l instruction affects only the local TLB, so it is generally faster to execute than ptc.ga, which affects the entire machine (see the architecture manual for exact definition [26]). Only one CPU can execute ptc.ga at any given time. To enforce this, the Linux/ia64 kernel uses a spinlock to serialize execution of this instruction.

Linux/ia64 uses update_mmu_cache() to implement a cache flushing that we discuss in more detail in Section 4.6. The routine could also be used to proactively install a new translation in the TLB. However, Linux gives no indication whether the translation is needed for instruction execution or for a data access, so it would be unclear whether the translation should be installed in the instruction or data TLB (or both). Because of this uncertainty and because installing a translation always entails the risk of evicting another, perhaps more useful, TLB entry, it is better to avoid proactive installation of TLB entries.

4.4.3 Lazy TLB flushing

To avoid the need to flush the TLB on every context switch, Linux defines an interface that abstracts differences in how address-space numbers (ASNs) work on a particular platform. This interface is called the ASN interface (file include/asm/mmu_context.h), shown in Figure 4.34. Support for this interface is optional in the sense that platforms with no ASN support simply define empty functions for the routines in this interface and instead define flush_tlb_mm() in such a way that it flushes the entire TLB.

Figure 4.34. Kernel interface to manage address-space numbers.

Each mm structure contains a component called the mm context, which has a platform-specific type called the mm context type (mm_context_t in file include/asm/mmu.h). Often, this type is a single word that stores the ASN of the address space. However, some platforms allocate ASNs in a CPU-local fashion. For those, this type is typically an array of words such that the ith entry stores the ASN that the address space has on CPU i.

When we take a look at Figure 4.34, we see that it defines four routines. The first routine, init_new_context(), initializes the mm context of a newly created address space. It takes two arguments: a task pointer task and an mm structure pointer mm. The latter is a pointer to the new address space, and the former points to the task that created it. Normally, this routine simply clears the mm context to a special value (such as 0) that indicates that no ASN has been allocated yet. On success, the routine should return a value of 0.

The remaining routines all take one argument, an mm structure pointer mm that identifies the address space and, therefore, the ASN that is to be manipulated. The get mmu-context() routine ensures that the mm context contains a valid ASN. If the mm context is already valid, nothing needs to be done. Otherwise, a free (unused) ASN needs to be allocated and stored in the mm context. It would be tempting to use the process ID (pid) as the ASN. However, this does not work because the execve() system call creates a new address space without changing the process ID.

Routine reload_context() is responsible for activating on the current CPU the ASN represented by the mm context. Logically, this activation entails writing the ASN to the CPU's asn register, but the exact details of how this is done are, of course, platform specific. When this routine is called, the mm context is guaranteed to contain a valid ASN.

Finally, when an ASN is no longer needed, Linux frees it by calling destroy_context(). This call marks the ASN represented by the mm context as available for reuse by another address space and should free any memory that may have been allocated in get mmu_context(). Even though the ASN is available for reuse after this call, the TLB may still contain old translations with this ASN. For correct operation, it is essential that platform-specific code purges these old translations before activating a reused ASN. This is usually achieved by allocating ASNs in a round-robin fashion and flushing the entire TLB before wrapping around to the first available ASN.

IA-64 implementation

Linux/ia64 uses region IDs to implement the ASN interface. The mm context in the mm structure consists of a single word that holds the region ID of the address space. A value of 0 means that no ASN has been allocated yet. The init_new_context() routine can therefore simply clear the mm context to 0.

The IA-64 architecture defines region IDs to be 24 bits wide, but, depending on CPU model, as few as 18 bits can be supported. For example, Itanium supports just the architectural minimum of 18 bits. In Linux/ia64, region ID 0 is reserved for the kernel, and the remaining IDs are handed out by get mmu_context() in a round-robin fashion. After the last available region ID has been handed out, the entire TLB is flushed and a new range of available region IDs is calculated such that all region IDs currently in use are outside this range. Once this range has been found, the get mmu_context() continues to hand out region IDs in a round-robin fashion until the space is exhausted again, at which point the steps of flushing the TLB and finding an available range of region IDs are repeated.

Region IDs are 18 to 24 bits wide but only 15 to 21 bits are effectively available for Linux. The reason is that IA-64 requires the TLB to match the region ID and the virtual page number (see Figure 4.29 on page 177) but not necessarily the virtual region number (vrn). Thus, a TLB may not be able to distinguish, e.g., address 0x2000000000000000 from address 0x4000000000000000 unless the region IDs in rr1 and rr2 are different. To ensure this, Linux/ia64 encodes vrn in the three least significant bits of the region ID.

Note that the region IDs returned by get mmu_context() are shared across all CPUs. Such a global region ID allocation policy is ideal for UP and small to moderately large MP machines. A global scheme is advantageous for MP machines because it makes possible the use of the ptc.ga instruction to purge translations from all TLBs in the machine. On the downside, global region ID allocation is a potential point of contention and, perhaps worse, causes the region ID space to be exhausted faster the more CPUs there are in the machine. To see this, assume there are eight distinct region IDs and that a CPU on average creates a new address space once a second. With a single CPU, the TLB would have to be flushed once every eight seconds because the region ID space has been exhausted. In a machine with eight CPUs, each creating an address space once a second, a global allocation scheme would require that the TLB be flushed once every second. In contrast, a local scheme could get by with as little TLB flushing as the UP case, i.e., one flush every eight seconds. But it would have to use an interprocessor interrupt (IPI, see Chapter 8, Symmetric Multiprocessing) instead of ptc.ga to perform a global TLB flush (because each CPU uses its own set of region IDs). In other words, deciding between the local and the global scheme involves a classic tradeoff between smaller fixed overheads (ptc.ga versus IPI) and better scalability (region ID space is exhausted with a rate proportional to total versus per-CPU rate at which new address spaces are created).

On IA-64, reload_context() has the effect of loading region registers rr0 to rr4 according to the value in the mm context. As explained in the previous paragraph, the value actually stored in the region registers is formed by shifting the mm context value left by three bits and encoding the region's vrn value in the least significant bits.

The IA-64 version of destroy_context() does not need to do anything: get mmu_context() does not allocate any memory, so no memory needs to be freed here. Similarly, the range of available region IDs is recalculated only after the existing range has been exhausted, so there is no need to flush old translations from the TLB here.

刘军卫

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Translation Lookaside Buffer (TLB)

the region offset of the linear page table, and the offset of the PTE within the linear page table. In the last summand, the factor of 8 is used because each PTE has a size of 8 bytes, and the modulo
复制链接

扫一扫

专栏目录