深入理解Linux内核第3版.pdf
Understanding the Linux Kernel, 3rd Edition
Preface
The Audience for This Book
we try to go beyond superficial features. We offer a background, such as the history of major features and the reasons why they were used
Organization of the Material
We tried a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent.
Level of Description
Overview of the Book
Conventions in This Book
How to Contact Us
Chapter 1. Introduction
1.1. Linux Versus Other Unix-Like Kernels
Linux regards lightweight processes as the basic execution context and handles them via the nonstandard clone( ) system call
1.2. Hardware Dependency
1.3. Linux Versions
1.4. Basic Operating System Concepts
1.4.1. Multiuser Systems
1.4.2. Users and Groups
1.4.2. Processes
A process can be defined either as "an instance of a program in execution" or as the "execution context" of a running program.
1.4.2. kenerl architecture
monolithic/microkenerl(module)
1.5. An Overview of the Unix Filesystem
1.5.1. Files
1.5.2. Hard and Soft Links
1.5.3. File Types
1.5.4. File Descriptor and Inode
1.5.5. Access Rights and File Mode
When a file is created by a process, its owner ID is the UID of the process.
Its owner user group ID can be either the process group ID of the creator process or the user group ID of the parent directory,
depending on the value of the sgid flag of the parent directory.
1.5.6. File-Handling System Calls
1.5.6.1. Opening a file
1.5.6.2. Accessing an opened file
1.5.6.3. Closing a file
1.5.6.4. Renaming and deleting a file
1.6. An Overview of Unix Kernels(需要再次理解阅读)
1.6.1. The Process/Kernel Model
kernel routines can be activated in several ways:
1:A process invokes a system call.
2:The CPU executing the process signals an exception, which is an unusual condition such as an invalid instruction.
The kernel handles the exception on behalf of the process that caused it.
3:A peripheral device issues an interrupt signal to the CPU to notify it of an event such as a request for attention,
a status change, or the completion of an I/O operation.
Each interrupt signal is dealt by a kernel program called an interrupt handler.
Because peripheral devices operate asynchronously with respect to the CPU, interrupts occur at unpredictable times.
4:A kernel thread is executed. Because it runs in Kernel Mode, the corresponding program must be considered part of the kernel.
1.6.2. Process Implementation
When the kernel stops the execution of a process, it saves the current contents of several processor registers in the process descriptor.
These include:
1:The program counter (PC) and stack pointer (SP) registers
2:The general purpose registers
3:The floating point registers
4:The processor control registers (Processor Status Word) containing information about the CPU state
5:The memory management registers used to keep track of the RAM accessed by the process
1.6.3. Reentrant Kernels
1.6.4. Process Address Space
1.6.5. Synchronization and Critical Regions
1.6.5.1. Kernel preemption disabling
1.6.5.2. Interrupt disabling
1.6.5.3. Semaphores
1.6.5.4. Spin locks
1.6.5.5. Avoiding deadlocks
1.6.6. Signals and Interprocess Communication
1.6.7. Process Management
1.6.7.1. Zombie processes
1.6.7.2. Process groups and login sessions
1.6.8. Memory Management
1.6.8.1. Virtual memory
1.6.8.2. Random access memory usage
1.6.8.3. Kernel Memory Allocator
1.6.8.4. Process virtual address space handling
1.6.8.5. Caching
1.6.9. Device Drivers
Chapter 2. Memory Addressing
2.1. Memory Addresses
(1)Logical address
(2)Linear address
(3)Physical address
The Memory Management Unit (MMU) transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit.
a second hardware circuit called a paging unit transforms the linear address into a physical address .
Figure 2-1. Logical address translation
2.2. Segmentation in Hardware
2.2.1. Segment Selectors and Segmentation Registers
(1)Segment Selectors
(2 )Segmentation Registers
To make it easy to retrieve segment selectors quickly, the processor provides segmentation registerswhose only purpose is to hold Segment Selectors
cs, ss, ds, es, fs,gs.
2.2.2. Segment Descriptors
Global Descriptor Table (GDT )
Local Descriptor Table(LDT).
Code Segment Descriptor
Data Segment Descriptor
Task State Segment Descriptor (TSSD)
Local Descriptor Table Descriptor (LDTD)
2.2.3. Fast Access to Segment Descriptors
2.2.4. Segmentation Unit
2.3. Segmentation in Linux
The 2.6 version of Linux uses segmentation only when required by the 80 x 86 architecture
2.3.1. The Linux GDT
1. A Task State Segment (TSS)
2. kernel code and data segments
3.A segment including the default Local Descriptor Table (LDT),
4.Three Thread-Local Storage (TLS) segments
5. Three segments related to Advanced Power Management (APM ):
6.Five segments related to Plug and Play (PnP ) BIOS services
7.A special TSS segment used by the kernel to handle "Double fault " exceptions
a few entries in the GDT may depend on the process that the CPU is executing (LDT and TLS Segment Descriptors).
2.3.2. The Linux LDTs
2.4. Paging in Hardware
page frames/page
2.4.1. Regular Paging
2.4.2. Extended Paging
2.4.3. Hardware Protection Scheme
2.4.4. An Example of Regular Paging
A simple example will help in clarifying how regular paging works. Let's assume that the kernel
assigns the linear address space between 0x20000000 and 0x2003ffff to a running process.[ ] This
space consists of exactly 64 pages. We don't care about the physical addresses of the page frames
containing the pages; in fact, some of them might not even be in main memory. We are interested
only in the remaining fields of the Page Table entries.
[ ] As we shall see in the following chapters, the 3 GB linear address space is an upper limit, but a User Mode process is allowed to
reference only a subset of it.
Let's start with the 10 most significant bits of the linear addresses assigned to the process, which
are interpreted as the Directory field by the paging unit. The addresses start with a 2 followed by
zeros, so the 10 bits all have the same value, namely 0x080 or 128 decimal. Thus the Directory field
in all the addresses refers to the 129th entry of the process Page Directory. The corresponding entry
must contain the physical address of the Page Table assigned to the process (see Figure 2-9). If no
other linear addresses are assigned to the process, all the remaining 1,023 entries of the Page
Directory are filled with zeros.
The values assumed by the intermediate 10 bits, (that is, the values of the Table field) range from 0
to 0x03f, or from 0 to 63 decimal. Thus, only the first 64 entries of the Page Table are valid. The
remaining 960 entries are filled with zeros.
Suppose that the process needs to read the byte at linear address 0x20021406. This address is
handled by the paging unit as follows:
1. The Directory field 0x80 is used to select entry 0x80 of the Page Directory, which points to the
Page Table associated with the process's pages.
2.
The Table field 0x21 is used to select entry 0x21 of the Page Table, which points to the page
frame containing the desired page.
3.
Finally, the Offset field 0x406 is used to select the byte at offset 0x406 in the desired page
frame.
If the Present flag of the 0x21 entry of the Page Table is cleared, the page is not present in main
memory; in this case, the paging unit issues a Page Fault exception while translating the linear
address. The same exception is issued whenever the process attempts to access linear addresses
outside of the interval delimited by 0x20000000 and 0x2003ffff, because the Page Table entries not
assigned to the process are filled with zeros; in particular, their Present flags are all cleared.
Figure 2-9. An example of paging
2.4.5. The Physical Address Extension (PAE) Paging Mechanism
2.4.6. Paging for 64-bit Architectures
2.4.7. Hardware Cache/L1-cache/
The cache memory stores the actual lines of memory. The cache controller stores an array of entries, one entry for each line of the
cache memory. Each entry includes a tag and a few flags that describe the status of the cache line.
The tag consists of some bits that allow the cache controller to recognize the memory location
currently mapped by the line. The bits of the memory's physical address are usually split into three
groups: the most significant ones correspond to the tag, the middle ones to the cache controller
subset index, and the least significant ones to the offset within the line.
write-through:
the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations
write-back:
the cache line is updated and the contents of the RAM are left
unchanged. After a write-back, of course, the RAM must eventually be updated. The cache controller
writes the cache line back into RAM only when the CPU executes an instruction requiring a flush of
cache entries or when a FLUSH hardware signal occurs (usually after a cache miss).
2.4.8. Translation Lookaside Buffers (TLB)//
Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is
used for the first time, the corresponding physical address is computed through slow accesses to the
Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to
the same linear address can be quickly translated.
2.5. Paging in Linux
Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear
addresses into physical ones makes the following design objectives feasible:
1.Assign a different physical address space to each process, ensuring an efficient protection
against addressing errors.
2.Distinguish pages (groups of data) from page frames (physical addresses in main memory).
This allows the same page to be stored in a page frame, then saved to disk and later reloaded
in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).
pgd
2.5.1. The Linear Address Fields
PAGE_SHIFT/PMD_SHIFT/PUD_SHIFT/PGDIR_SHIFT
PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD
2.5.2. Page Table Handling
(1):type-conversion macros
_ _ pte, _ _ pmd, _ _ pud, _ _ pgd _ _ pgprot//(protect)
pte_val, pmd_val, pud_val, pgd_val,pgprot_val
(2):macros and functions to read or modify page table entries
pte_none, pmd_none, pud_none pgd_none
pte_clear, pmd_clear, pud_clear pgd_clear
set_pte, set_pmd, set_pud set_pgd
pte_same(a,b)
pmd_large(e)
pmd_bad pud_bad pgd_bad
pte_present
The pmd_bad macro is used by functions to check Page Middle Directory entries passed as input
parameters. It yields the value 1 if the entry points to a bad Page Table that is, if at least one of the
following conditions applies:
(1)The page is not in main memory (Present flag cleared).
(2)The page allows only Read access (Read/Write fl