4-FILE SYSTEMS

最新推荐文章于 2023-03-29 00:15:00 发布

gaoxiangnumber1

最新推荐文章于 2023-03-29 00:15:00 发布

阅读量1.5k

点赞数 1

分类专栏：现代操作系统-第4版文章标签： github

本文链接：https://blog.csdn.net/gaoxiangnumber1/article/details/52612531

版权

现代操作系统-第4版专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

Files are logical units of information created by processes. Processes can read existing files and create new ones if need be. Information stored in files must be persistent(not be affected by process creation and termination). A file should disappear only when its owner explicitly removes it.

4.1 FILES

4.1.1 File Naming

When a process creates a file, it gives the file a name. When the process terminates, the file continues to exist and can be accessed by other processes using its name.
All OS allow strings of one to eight letters as legal file names. Frequently digits and special characters are also permitted. Many file systems support names as long as 255 characters. UNIX distinguish between upper and lowercase letters.
Many OS support two-part file names that are separated by a period. The part following the period is called the file extension and usually indicates something about the file. In UNIX, the size of the extension is up to the user, and a file may even have two or more extensions(homepage.html.zip: .html indicates a Web page in HTML and .zip indicates that the file(homepage.html) has been compressed). Common file extensions and their meanings are shown in Fig. 4-1.

In UNIX, file extensions are just conventions and are not enforced by the OS. But Windows is aware of the extensions and assigns meaning to them. When a user double clicks on a file name, the program assigned to its file extension is launched with the file as parameter.

4.1.2 File Structure

Three common possibilities to structure files are depicted in Fig. 4-2. OS does not know or care what is in the file, all it sees are bytes. Any meaning must be imposed by user-level programs.
Having the OS regard files as just byte sequences provides flexibility. User programs can put anything they want in their files and name them any way that they find convenient. UNIX use this file model.
Fig. 4-2(b): A file is a sequence of fixed-length records, each with some internal structure. The read operation returns one record and the write operation overwrites or appends one record.
Fig. 4-2(c): A file consists of a tree of records, not necessarily all the same length, each containing a key field in a fixed position in the record. The tree is sorted on the key field, to allow rapid searching for a particular key. New records can be added to the file with the OS, not the user, deciding where to place them. This type of file is used on some large mainframe computers for commercial data processing.

4.1.3 File Types

UNIX has regular files, directories, character and block special files.
1. Regular files contain user information. All the files of Fig. 4-2 are regular files.
2. Directories are system files for maintaining the structure of the file system.
3. Character special files are related to input/output and used to model serial I/O devices(terminals, printers, and networks).
4. Block special files are used to model disks.
Regular files are either ASCII files or binary files.
1. ASCII files consist of lines of text. In some systems each line is terminated by a carriage return character; in others, the line feed character is used. The advantage of ASCII files is that they can be displayed and printed as is, and they can be edited with any text editor.
2. Binary files are not ASCII files. Listing them on the printer gives an incomprehensible listing full of random junk. They have some internal structure known to programs that use them.
Every OS must recognize its own executable file.

Fig. 4-3(a) is an executable binary file from UNIX. OS will execute a file only if it has the proper format. (b) is a binary file(an archive).

4.1.4 File Access

Sequential access: a process could read all the bytes or records in a file in order, starting at the beginning, could not skip around and read them out of order.
Random-access: read bytes or records of files in any order.
Two methods can be used for specifying where to start reading.
1. Every read operation gives the position in the file to start reading at.
2. Operation seek() is provided to set the current position. After a seek, the file can be read sequentially from the now-current position. This method is used in UNIX.

4.1.5 File Attributes

All OS associate file attributes with each file. Some people call them meta-data. The table of Fig. 4-4 shows some of the attributes.

Common system calls relating to files.
1. Create. The file is created with no data.
2. Delete. When the file is no longer needed, it has to be deleted to free up disk space.
3. Open. Before using a file, a process must open it. This allow the system to fetch the attributes and list of disk addresses into main memory for rapid access on later calls.
4. Close. When all the accesses are finished, the attributes and disk addresses are no longer needed, so the file should be closed to free up internal table space. A disk is written in blocks, and closing a file forces writing of the file’s last block, even though that block may not be full yet.
5. Read. The caller must specify how many data are needed and provide a buffer to put them in.
6. Write. Data are written to the file, usually at the current position. If the current position is the end of the file, the file’s size increases. If the current position is in the middle of the file, existing data are overwritten and lost forever.
7. Append. This call can add data only to the end of the file.
8. Seek. For random-access files, seek() repositions the file pointer to a specific place in the file. After this call has completed, data can be read from, or written to that position.
9. Get attributes.
10. Set attributes.
11. Rename.

4.1.7 An Example Program Using File-System Calls

Fig. 4-5: The program must be called with two legal file names: the first is source; the second is the output file.

4.2 DIRECTORIES

To keep track of files, file systems normally have directories or folders, which are themselves files.

4.2.1 Single-Level Directory Systems

The simplest form of directory system is having one directory containing all the files. Sometimes it is called the root directory.

Fig. 4-6. The advantages of this scheme are its simplicity and the ability to locate files quickly. It is sometimes used on embedded devices.

4.2.2 Hierarchical Directory Systems

Fig. 4-7. With hierarchy approach, there can be as many directories as are needed to group the files in natural ways. The ability to create an arbitrary number of subdirectories provides a structuring tool for users to organize their work. Nearly all modern file systems are organized in this manner.

4.2.3 Path Names

Two different methods are used to specify file names.
1. Each file is given an absolute path name consisting of the path from the root directory to the file. If the first character of the path name is the separator, then the path is absolute.
2. Relative path name is used with the concept of the working directory(i.e., current directory). A user can designate one directory as the current working directory and all path names not beginning at the root directory are taken relative to the working directory.
Each process has its own working directory, so when it changes its working directory and later exits, no other processes are affected and no traces of the change are left behind in the file system.
Most OS that support a hierarchical directory system have two special entries in every directory, “.” and “..”. Dot refers to the current directory; dotdot refers to its parent(except in the root directory, where it refers to itself).

4.2.4 Directory Operations

Common system calls for managing directories:
1. Create. A directory is created. It is empty except for . and ...
2. Delete. Only an empty directory can be deleted. A directory containing only . and .. is considered empty.
3. Opendir. Before a directory can be read, it must be opened.
4. Closedir. After a directory has been read, it should be closed to free up internal table space.
5. Readdir. This call returns the next entry in an open directory.
6. Rename.
7. Link. Linking is a technique that allows a file to appear in more than one directory. This system call specifies an existing file and a path name, and creates a link from the existing file to the name specified by the path. A link of this kind, which increments the counter in the file’s i-node(to keep track of the number of directory entries containing the file) is called a hard link.
8. Unlink. A directory entry is removed. If the file being unlinked is only present in one directory, it is removed from the file system. If it is present in multiple directories, only the path name specified is removed. The others remain. In UNIX, the system call for deleting files is unlink.
A variant on linking files is the symbolic link. Instead of having two names point to the same internal data structure representing a file, a name can be created that points to a file naming another file. When the first file is used, the file system follows the path and finds the name at the end.
Pro: can cross disk boundaries and even name files on remote computers.
Con: implementation is less efficient than hard links.

4.3 FILE-SYSTEM IMPLEMENTATION

4.3.1 File-System Layout

File systems are stored on disks. Most disks can be divided up into one or more partitions, with independent file systems on each partition. Sector 0 of the disk is called the MBR(Master Boot Record) and is used to boot the computer. The end of the MBR contains the partition table that gives the starting and ending addresses of each partition. One of the partitions in the table is marked as active. When the computer is booted, the BIOS reads in and executes the MBR.
The first thing MBR does is locate the active partition, read in its first block, which is called the boot block, and execute it. The program in the boot block loads the OS contained in that partition. For uniformity, every partition starts with a boot block, even if it does not contain a bootable OS.
Other than starting with a boot block, the layout of a disk partition varies from file system to file system. Often the file system contains some of the items shown in Fig. 4-9.

Superblock: contains all the key parameters about the file system and is read into memory when the computer is booted or the file system is first touched. Typical information in the superblock includes a magic number to identify the file-system type, the number of blocks in the file system, and other key administrative information.
Free space: information about free blocks in the file system, e.g. in the form of a bitmap or a list of pointers.
I-nodes: an array of data structures, one per file, telling all about the file.
Root directory: contains the top of the file-system tree.

4.3.2 Implementing Files

Contiguous Allocation

Fig. 4-10. Each file begins at the start of a new block, so that if file A was x.5 blocks, some space is wasted at the end of the last block.
Two advantages.
1. Simple to implement. Because we only need to remember two numbers(the disk address of the first block and the number of blocks in the file).
2. High performance. The read performance is excellent because the entire file can be read from the disk in a single operation. Only one seek is needed to the first block, after that, no more seeks or rotational delays are needed, so data come in at the full bandwidth of the disk.
Drawback: the disk becomes fragmented over the course of time. When a file is removed, its blocks are freed, leaving a run of free blocks on the disk. The disk is not compacted on the spot to squeeze out the hole since that involves copying all the blocks following the hole, which would be slow. Eventually the disk will fill up and it is necessary to either compact the disk(expensive) or to reuse the free space in the holes. Reusing the space requires maintaining a list of holes, which is doable. But when a new file is to be created, it is necessary to know its final size in order to choose a hole of the correct size to place it in.
There is one situation in which contiguous allocation is still used: on CD-ROMs. Here all the file sizes are known in advance and will never change during subsequent use of the CD-ROM file system.

Linked-List Allocation

Fig. 4-11: keep each file as a linked list of disk blocks. The first word of each block is used as a pointer to the next one. The rest of the block is for data.
Every disk block can be used and no space is lost to disk fragmentation(except for internal fragmentation in the last block). The directory entry needs only store the disk address of the first block and the rest can be found starting there.
Disadvantages:
1. Though reading a file sequentially is straightforward, random access is slow. To get block n, the OS has to start at the beginning and read the n 1 blocks prior to it.
2. The amount of data storage in a block is no longer a power of two because the pointer takes up a few bytes. Having a peculiar size is less efficient because many programs read and write in blocks whose size is a power of two. With the first few bytes occupied by a pointer, reads of the full block size require acquiring and concatenating information from two disk blocks, which generates extra overhead due to the copying.

Linked-List Allocation Using a Table in Memory

Both disadvantages can be eliminated by taking the pointer from each disk block and putting it in a table in memory. Figure 4-12 shows what the table looks like for the example of Fig. 4-11. Chains are terminated with a special marker(e.g., -1) that is not a valid block number. Such a table in main memory is called a FAT(File Allocation Table).
Random access is quicker: the chain is entirely in memory, so it can be followed without making any disk references.
Disadvantage: the entire table must be in memory all the time to make it work.

I-nodes

Associate with each file a data structure called an i-node(index-node), which lists the attributes and disk addresses of the file’s blocks.
Advantage of this scheme over linked files using an in-memory table is that the i-node need be in memory only when the corresponding file is open. If each inode occupies n bytes and a maximum of k files may be open at once, the total memory occupied by the array holding the i-nodes for the open files is only k*n bytes. Only this much space need be reserved in advance.
This array is smaller than the space occupied by the file table in the previous section.
1. The table for holding the linked list of all disk blocks is proportional in size to the disk itself. If the disk has n blocks, the table needs n entries. As disks grow larger, this table grows linearly with them.
2. The i-node scheme requires an array in memory whose size is proportional to the maximum number of files that may be open at once. It is irrelevant with the disk size.
Problem with i-nodes is that if each one has room for a fixed number of disk addresses, what happens when a file grows beyond this limit?
Solution is to reserve the last disk address for the address of a block containing more disk-block addresses. Even more advanced would be two or more such blocks containing disk addresses or even disk blocks pointing to other disk blocks full of addresses.

4.3.3 Implementing Directories

Before a file can be read, it must be opened. When a file is opened, the OS uses the path name supplied by the user to locate the directory entry on the disk. The directory entry provides the information needed to find the disk blocks. Depending on the system, this information may be the disk address of the entire file(contiguous allocation), the number of the first block(both linked-list schemes), or the number of the i-node. In all cases, the main function of the directory system is to map the ASCII name of the file onto the information needed to locate the data.
Every file system maintains various file attributes and they must be stored somewhere. One possibility is to store them directly in the directory entry. This option is shown in Fig. 4-14(a): a directory consists of a list of fixed-size entries, one per file, containing a file name, a structure of the file attributes, and one or more disk addresses telling where the disk blocks are.

Fig. 4-14(b). For systems that use i-nodes, they can store the attributes in the i-nodes. The directory entry store just a file name and an i-node number.
So far we have assume that files have short, fixed-length names, how can modern OS support longer, variable-length file names?
1. Set a limit X on file-name length and use one of the designs of Fig. 4-14 with X characters reserved for each file name. This approach wastes a great deal of directory space since few files have such long names.
2. Each directory entry contains a fixed portion(typically starting with the length of the entry) and then followed by data with a fixed format(usually including the owner, creation time, protection information, and other attributes). The fixed-length header is followed by the actual file name, as shown in Fig. 4-15(a).

We have three files(project-budget, personnel, foo) and each file name is terminated by a special character. To allow each directory entry to begin on a word boundary, each file name is filled out to an integral number of words.
Disadvantage:
1. When a file is removed, a variable-sized gap is introduced into the directory into which the next file to be entered may not fit. Only compacting the directory is feasible because it is entirely in memory.
2. A single directory entry may span multiple pages, so a page fault may occur while reading a file name.
Fig. 4-15(b): Another way to handle variable-length names is to make the directory entries themselves all fixed length and keep the file names together in a heap at the end of the directory.
Advantage: when an entry is removed, the next file entered will always fit there.
Disadvantage: the heap must be managed and page faults can still occur while processing file names.
One minor win is that there is no longer any real need for file names to begin at word boundaries, so no filler characters are needed after file names as in Fig. 4-15(a).
In all designs so far, directories are searched linearly from beginning to end when a file name has to be looked up and linear searching can be slow. One way to speed up the search is to use a hash table in each directory.
Call the size of the table n.
1. Enter a file name, the name is hashed onto a value between 0 and n 1.
2. The table entry corresponding to the hash code is inspected.
  -1If it is unused, a pointer is placed there to the file entry. File entries follow the hash table.
  -2If that slot is already in use, a linked list is constructed, headed at the table entry and threading through all entries with the same hash value.
Looking up a file follows the same procedure. The file name is hashed to select a hash-table entry. All the entries on the chain headed at that slot are checked to see if the file name is present. If the name is not on the chain, the file is not present in the directory.
Another way to speed up searching large directories is to cache the results of searches. Before starting a search, a check is first made to see if the file name is in the cache. If so, it can be located immediately.

4.3.4 Shared Files

Figure 4-16. The connection between B’s directory and the shared file is called a link. The file system is now a Directed Acyclic Graph(DAG).
Sharing files is convenient, but it also introduces problem.
If directories really do contain disk addresses, then a copy of the disk addresses will have to be made in B’s directory when the file is linked. If either B or C subsequently appends to the file, the new blocks will be listed only in the directory of the user doing the append. The changes will not be visible to the other user, thus defeating the purpose of sharing.
Two solutions:
1. Disk blocks are not listed in directories, but in a data structure associated with the file itself. The directories then point to the data structure. This is the approach used in UNIX(where the data structure is the i-node).
2. B links to one of C’s files by having the system create a new file of type LINK and entering that file in B’s directory. The new file contains just the path name of the file to which it is linked. When B reads from the linked file, the OS sees that the file being read from is of type LINK, looks up the name of the file, and reads that file. This approach is called symbolic linking, contrast with hard linking.

Fig. 4-17. At the moment that B links to the shared file, the i-node records the file’s owner as C. Creating a link does not change the ownership, but it increases the link count in the i-node, so the system knows how many directory entries currently point to the file. If C subsequently tries to remove the file, the system is faced with a problem. If it removes the file and clears the i-node, B will have a directory entry pointing to an invalid i-node. If the i-node is later reassigned to another file, B’s link will point to the wrong file. The system can see from the count in the i-node that the file is still in use, but there is no easy way for it to find all the directory entries for the file, in order to erase them.
The only thing to do is remove C’s directory entry, but leave the i-node intact with count set to 1, as shown in Fig. 4-17(c). B is the only user having a directory entry for a file owned by C.
With symbolic links this problem does not arise because only the true owner has a pointer to the i-node. Users who have linked to the file just have path names, not i-node pointers. When the owner removes the file, it is destroyed. Subsequent attempts to use the file via a symbolic link will fail when the system is unable to locate the file. Removing a symbolic link does not affect the file at all.
The problem with symbolic links is the extra overhead required. The file containing the path must be read, then the path must be parsed and followed, until the i-node is reached. All of this activity may require a considerable number of extra disk accesses. Furthermore, an extra i-node is needed for each symbolic link, as is an extra disk block to store the path, although if the path name is short, the system could store it in the i-node itself, as a kind of optimization.
Symbolic links have the advantage that they can be used to link to files on machines anywhere in the world, by simply providing the network address of the machine where the file resides in addition to its path on that machine.
There is another problem introduced by links, symbolic or otherwise. When links are allowed, files can have two or more paths. Programs that start at a given directory and find all the files in that directory and its subdirectories will locate a linked file multiple times. E.g., a program that dumps all the files multiple copies of a linked file. Furthermore, if the tape is then read into another machine, unless the dump program is clever, the linked file will be copied twice onto the disk, instead of being linked.

4.3.5 Log-Structured File Systems

The idea that drove the LFS design is that as CPUs get faster and RAM memories get larger, disk caches are also increasing rapidly. It is possible to satisfy a very substantial fraction of all read requests directly from the file-system cache, with no disk access needed. It follows from this observation that in the future, most disk accesses will be writes, so the read-ahead mechanism used in file systems to fetch blocks before they are needed no longer gains much performance. What’s worse, in most file systems, writes are done in very small chunks. Small writes are highly inefficient, since a 50μ sec disk write is often preceded by a 10-msec seek and a 4-msec rotational delay.
To see where all the small writes come from, consider creating a new file on a UNIX system. To write this file, the i-node for the directory, the directory block, the i-node for the file, and the file itself must all be written. While these writes can be delayed, doing so exposes the file system to consistency problems if a crash occurs before the writes are done. So, the i-node writes are generally done immediately.
From this reasoning, the LFS designers structure the entire disk as a great big log to achieve the full bandwidth of the disk, even in the face of a workload consisting in large part of small random writes.
Periodically, and when there is a need for it, all the pending writes being buffered in memory are collected into a single segment and written to the disk as a single contiguous segment at the end of the log. A single segment may thus contain i-nodes, directory blocks, and data blocks, all mixed together. At the start of each segment is a segment summary, telling what can be found in the segment. If the average segment can be made to be about 1 MB, almost the full bandwidth of the disk can be utilized.
In this design, i-nodes still exist and have the same structure as in UNIX, but they are scattered all over the log, instead of being at a fixed position on the disk. Finding an i-node is harder since its address cannot be calculated from its i-number as in UNIX. When an i-node is located, locating the blocks is done in the usual way.
To to find i-nodes, an i-node map that indexed by i-number is maintained. Entry i in this map points to i-node i on the disk. The map is kept on disk, but it is also cached, so the heavily used parts will be in memory most of the time.
Summary: All writes are initially buffered in memory, and periodically all the buffered writes are written to the disk in a single segment, at the end of the log. Opening a file consists of using the map to locate the i-node for the file. Once the i-node has been located, the addresses of the blocks can be found from it. All of the blocks will be in segments, somewhere in the log.
Because real disks are finite, eventually the log will occupy the entire disk, at which time no new segments can be written to the log.
Solution: LFS has a cleaner thread that scans the log circularly to compact it.
It starts out by reading the summary of the first segment in the log to see which i-nodes and files are there. It then checks the current i-node map to see if the i-nodes are still current and file blocks are still in use. If not, that information is discarded. The i-nodes and blocks that are still in use go into memory to be written out in the next segment. The original segment is then marked as free, so that the log can use it for new data.
In this manner, the cleaner moves along the log, removing old segments from the back and putting any live data into memory for rewriting in the next segment. So, the disk is a circular buffer, with the writer thread adding new segments to the front and the cleaner thread removing old ones from the back.
The bookkeeping here is nontrivial since when a file block is written back to a new segment, the i-node of the file(somewhere in the log) must be located, updated, and put into memory to be written out in the next segment. The i-node map must then be updated to point to the new copy.

4.3.6 Journaling File Systems

Journaling file systems: keep a log of what the file system is going to do before it does it, so that if the system crashes before it can do its planned work, upon rebooting the system can look in the log to see what was going on at the time of the crash and finish the job. The Linux ext3 file systems use journaling.
Consider removing a file operation in UNIX:
1. Remove the file from its directory.
2. Release the i-node to the pool of free i-nodes.
3. Return all the disk blocks to the pool of free disk blocks.
1. The first step is completed and then the system crashes. The inode and file blocks will not be accessible from any file, also not be available for reassignment; they are just off in limbo somewhere, decreasing the available resources. If the crash occurs after the second step, only the blocks are lost.
2. If the order of operations is changed and the i-node is released first, then after rebooting, the i-node may be reassigned, but the old directory entry will continue to point to it, hence to the wrong file.
3. If the blocks are released first, then a crash before the i-node is cleared will mean that a valid directory entry points to an inode listing blocks now in the free storage pool and which are likely to be reused shortly, leading to two or more files randomly sharing the same blocks.
What the journaling file system does is first write a log entry listing the three actions to be completed and then the log entry is written to disk. Only after the log entry has been written, do the various operations begin.
After the operations complete successfully, the log entry is erased. If the system now crashes, upon recovery the file system can check the log to see if any operations were pending. If so, all of them can be rerun (multiple times in the event of repeated crashes) until the file is correctly removed.
To make journaling work, the logged operations must be idempotent which means they can be repeated as often as necessary without harm. Operations such as “Update the bitmap to mark i-node k or block n as free” can be repeated until the cows come home with no danger. But adding the newly freed blocks from i-node K to the end of the free list is not idempotent since they may already be there. Journaling file systems have to arrange their data structures and loggable operations so they all are idempotent. Under these conditions, crash recovery can be made fast and secure.

4.3.7 Virtual File Systems

Linux makes attempt to integrate multiple file systems into a single structure. From the user’s point of view, there is a single file-system hierarchy. The operation that encompass multiple(incompatible) file systems is not visible to users or processes.
Most UNIX systems have used the concept of a VFS(virtual file system) to try to integrate multiple file systems into an orderly structure. The idea is to abstract out that part of the file system that is common to all file systems and put that code in a separate layer that calls the underlying concrete file systems to actually manage the data. The overall structure is illustrated in Fig. 4-18.

All system calls relating to files are directed to the virtual file system for initial processing. These calls, coming from user processes, are the standard POSIX calls. Thus the VFS has an upper POSIX interface to user processes.
The VFS also has a lower VFS interface to the concrete file systems. This interface consists of several dozen function calls that the VFS can make to each file system to get work done.
Most VFS implementations are object oriented.
1. There are several object types that are normally supported. These include the superblock(which describes a file system), the v-node(which describes a file), and the directory(which describes a file system directory). Each of these has associated operations(methods) that the concrete file systems must support.
2. The VFS has some internal data structures for its own use, including the mount table and an array of file descriptors to keep track of all the open files in the user processes.
When the system is booted, the root file system is registered with the VFS. When other file systems are mounted, either at boot time or during operation, they also must register with the VFS. When a file system registers, what it basically does is provide a list of the addresses of the functions the VFS requires, either as one long call vector(table) or as several of them, one per VFS object, as the VFS demands. Thus once a file system has registered with the VFS, the VFS knows how to do the actual operation(e.g., read a block from it: it calls the fourth(or whatever) function in the vector supplied by the file system). Similarly, the VFS also knows how to carry out every other function the concrete file system must supply: it just calls the function whose address was supplied when the file system registered.
After a file system has been mounted, it can be used. E.g., if a file system has been mounted on /usr and a process makes the call
open("/usr/include/unistd.h", O RDONLY);
While parsing the path, the VFS sees that a new file system has been mounted on /usr and locates its superblock by searching the list of superblocks of mounted file systems. Then it can find the root directory of the mounted file system and look up the path include/unistd.h there. The VFS then creates a v-node and makes a call to the concrete file system to return all the information in the file’s inode. This information is copied into the v-node(in RAM), along with other information, most importantly the pointer to the table of functions to call for operations on v-nodes, such as read, write, close, and so on.
After the v-node has been created, the VFS makes an entry in the file-descriptor table for the calling process and sets it to point to the new v-node.(The file descriptor actually points to another data structure that contains the current file position and a pointer to the v-node.) Finally, the VFS returns the file descriptor to the caller so it can use it to read, write, and close the file.
Later when the process does a read using the file descriptor, the VFS locates the v-node from the process and file descriptor tables and follows the pointer to the table of functions, all of which are addresses within the concrete file system on which the requested file resides. The function that handles read is now called and code within the concrete file system goes and gets the requested block. The data structures involved are shown in Fig. 4-19.

To add new file systems, the designers first get a list of function calls the VFS expects and then write their file system to provide all of them. If the file system already exists, then they have to provide wrapper functions that do what the VFS needs, usually by making one or more native calls to the concrete file system.

4.4 FILE-SYSTEM MANAGEMENT AND OPTIMIZATION

Two strategies are possible for storing an n byte file: n consecutive bytes of disk space are allocated, or the file is split up into a number of contiguous blocks.
Storing a file as a contiguous sequence of bytes has the problem that if a file grows, it may have to be moved on the disk. For this reason, nearly all file systems chop files up into fixed-size blocks that need not be adjacent.

Block Size

1. Having a large block size means that every file ties up an entire cylinder, so, small files waste a large amount of disk space.
2. A small block size means that most files will span multiple blocks and thus need multiple seeks and rotational delays to read them, reducing performance.
  That is, if the allocation unit is too large, waste space; too small, waste time.
Making a good choice requires having information about the file-size distribution. The results are shown in Fig. 4-20.

Figure 4-21.
The curves show that performance and space utilization are in conflict.
1. Data Rate. The access time for a block is completely dominated by the seek time and rotational delay, so given that it is going to cost x msec to access a block, the more data that are fetched, the better. Hence the data rate goes up with block size.
2. Space Efficiency. With 4-KB files and 1-KB, 2-KB, or 4-KB blocks, files use 4, 2, and 1 block, respectively, with no wastage. With an 8-KB block and 4-KB files, the space efficiency drops to 50%, and with a 16-KB block it is down to 25%.

Keeping Track of Free Blocks

Two methods are used to keep track of free blocks(Fig. 4-22).
1. Use a linked list of disk blocks, with each block holding as many free disk block numbers as will fit. With a 1-KB block and a 32-bit disk block number, each block on the free list holds the numbers of 255 free blocks(one slot is required for the pointer to the next block). If free blocks tend to come in long runs of consecutive blocks, the free-list system can be modified to keep track of runs of blocks rather than single blocks. An 8-, 16-, or 32-bit count could be associated with each block giving the number of consecutive free blocks.
2. Use a bitmap. A disk with n blocks requires a bitmap with n bits. Free blocks are represented by 1s in the map, allocated blocks by 0s(or vice versa).
For free list method, only one block of pointers need be kept in main memory.
1. When a file is created, the needed blocks are taken from the block of pointers. When it runs out, a new block of pointers is read in from the disk.
2. When a file is deleted, its blocks are freed and added to the block of pointers in main memory. When this block fills up, it is written to disk.

This method leads to unnecessary disk I/O. Consider Fig. 4-23(a): the block of pointers in memory has room for only two more entries.
1. If a three-block file is freed, the pointer block overflows and has to be written to disk, leading to the situation of Fig. 4-23(b).
2. If a three-block file is now written, the full block of pointers has to be read in again, taking us back to Fig. 4-23(a).
  That is, when the block of pointers is almost empty, a series of short-lived files can cause a lot of disk I/O.
Solution: The idea is to keep most of the pointer blocks on disk full, keep the one in memory about half full, so it can handle both file creation and file removal without disk I/O on the free list. We go from Fig. 4-23(a) to Fig. 4-23(c) when three blocks are freed.
With a bitmap, it is also possible to keep just one block in memory, going to disk for another only when it becomes completely full or empty. A benefit of this approach is that by doing all the allocation from a single block of the bitmap, the disk blocks will be close together, thus minimizing disk-arm motion. Since the bitmap is a fixed-size data structure, if the kernel is paged, the bitmap can be put in virtual memory and have pages of it paged in as needed.

Disk Quotas

To prevent people from hogging too much disk space, multiuser operating systems assigns each user a maximum allotment of files and blocks, and the OS makes sure that the users do not exceed their quotas.
When a user opens a file, the attributes and disk addresses are located and put into an open-file table in main memory. Among the attributes is an entry telling who the owner is. Any increases in the file’s size will be charged to the owner’s quota. A second table contains the quota record for every user with a currently open file, even if the file was opened by someone else.

This table is shown in Fig. 4-24. It is an extract from a quota file on disk for the users whose files are currently open. When all the files are closed, the record is written back to the quota file.
When a new entry is made in the open-file table, a pointer to the owner’s quota record is entered into it, to make it easy to find the various limits.
Every time a block is added to a file, the total number of blocks charged to the owner is incremented, and a check is made against both the hard and soft limits. The soft limit may be exceeded, but the hard limit may not. An attempt to append to a file when the hard block limit has been reached will result in an error.
Analogous checks exist for the number of files to prevent a user from hogging all the i-nodes.
When a user attempts to log in, the system examines the quota file to see if the user has exceeded the soft limit for either number of files or number of disk blocks. If either limit has been violated, a warning is displayed, and the count of warnings remaining is reduced by one. If the count gets to zero, the user has ignored the warning too many times, and is not permitted to log in.
This method has the property that users may go above their soft limits during a login session, provided they remove the excess before logging out. The hard limits may never be exceeded.

4.4.2 File-System Backups

Backups to tape are made to handle one of two potential problems:
1. Recover from disaster.
2. Recover from stupidity(users accidentally remove files that they later need again).
First, should the entire file system be backed up or only part of it?
1. At many installations, the executable programs are kept in a limited part of the file-system tree. It is not necessary to back up these files if they can all be reinstalled from the manufacturer’s Website.
2. Most systems have a directory for temporary files. There is no reason to back it up either. In UNIX, all the special files(I/O devices) are kept in a directory /dev.
  It is desirable to back up only specific directories and everything in them rather than the entire file system.
Second, it is wasteful to back up files that have not changed since the previous backup, which leads to the idea of incremental dumps.
The simplest form of incremental dumping is to make a complete dump(backup) periodically and to make a daily dump of only those files that have been modified since the last full dump. Even better is to dump only those files that have changed since they were last dumped.
While this scheme minimizes dumping time, it makes recovery more complicated, because first the most recent full dump has to be restored, followed by all the incremental dumps in reverse order.
Third, since immense amounts of data are dumped, it is desirable to compress the data before writing them to tape.
But with many compression algorithms, a single bad spot on the backup tape can foil the decompression algorithm and make an entire file or even an entire tape unreadable.
Fourth and last, it is difficult to perform a backup on an active file system. If files and directories are being added, deleted, and modified during the dumping process, the resulting dump may be inconsistent.
Algorithms have been devised for making rapid snapshots of the file-system state by copying critical data structures, and then requiring future changes to files and directories to copy the blocks instead of updating them in place. In this way, the file system is effectively frozen at the moment of the snapshot, so it can be backed up at leisure afterward.
Two strategies can be used for dumping a disk to a backup disk: a physical dump or a logical dump.

Physical Dump

A physical dump starts at block 0 of the disk, writes all the disk blocks onto the output disk in order, and stops when it has copied the last one.
Since there is no value in backing up unused disk blocks, if the dumping program can obtain access to the free-block data structure, it can avoid dumping unused blocks. But skipping unused blocks requires writing the number of each block in front of the block, since it is no longer true that block k on the backup was block k on the disk.
Dump bad blocks. Sometimes when a low-level format is done, the bad blocks are detected, marked as bad, and replaced by spare blocks reserved at the end of each track for such emergencies. In many cases, the disk controller handles bad-block replacement transparently without the OS knowing about it. But sometimes blocks go bad after formatting, in which case the OS will eventually detect them. OS solves the problem by creating a file consisting of all the bad blocks to make sure they never appear in the free-block pool and are never assigned. This file is completely unreadable.
If all bad blocks are remapped by the disk controller and hidden from the OS, physical dumping works fine. But if they are visible to the OS and maintained in one or more bad-block files or bitmaps, it is essential that the physical dumping program get access to this information and avoid dumping them to prevent endless disk read errors while trying to back up the bad-block file.
Specific systems may also have other internal files that should not be backed up, so the dumping program needs to be aware of them.
Advantages: simplicity and great speed.
Disadvantages: the inability to skip selected directories, make incremental dumps, and restore individual files upon request. For these reasons, most installations make logical dumps.

Logical Dump

A logical dump starts at one or more specified directories and recursively dumps all files and directories found there that have changed since some given base date. So, the dump disk gets a series of identified directories and files, which makes it easy to restore a specific file or directory upon request.

Fig. 4-25: The shaded items have been modified since the base date and thus need to be dumped; the unshaded ones do not need to be dumped.
This algorithm also dumps all directories that lie on the path to a modified file or directory for two reasons.
1. Make it possible to restore the dumped files and directories to a fresh file system on a different computer. In this way, the dump and restore programs can be used to transport entire file systems between computers.
2. Make it possible to incrementally restore a single file. Suppose that a full file system dump is done on t0. On t1, the directory /usr/jhs/proj/nr3 is removed, along with all the directories and files under it. On t2, the user wants to restore the file /usr/jhs/proj/nr3/plans/summary. It is not possible to just restore the file summary because there is no place to put it, so, the directories nr3 and plans must be restored first. To get their owners, modes and whatever, these directories must be present on the dump disk even though they themselves were not modified since the previous full dump.

The dump algorithm maintains a bitmap indexed by i-node number with several bits per i-node. The algorithm operates in four phases.
1. Phase 1 begins at the starting directory(the root in this example) and examines all the entries in it. For each modified file, its i-node is marked in the bitmap. Each directory is also marked and then recursively inspected. At the end of phase 1, all modified files and all directories have been marked in the bitmap, Fig. 4-26(a).
2. Phase 2 recursively walks the tree to un-mark any directories that have no modified files or directories in them or under them. The directories and files that must be dumped are marked in Fig. 4-26(b). For efficiency, phases 1 and 2 can be combined in one tree walk.
3. Phase 3 consists of scanning the i-nodes in numerical order and dumping all the directories that are marked for dumping, shown in Fig. 4-26(c). Each directory is prefixed by the directory’s attributes so that they can be restored.
4. Phase 4, the files marked in Fig. 4-26(d) are dumped, again prefixed by their attributes. This completes the dump.
Restore a file system from the dump disk:
1. Create an empty file system on disk.
2. Restore the most recent full dump. Since the directories appear first on the dump disk, they are all restored first, giving a skeleton of the file system; then the files are restored.
3. Step 1 and 2 are then repeated with the first incremental dump made after the full dump, then the next one, and so on.

A few tricky issues

Free block list: Since the free block list is not a file, it is not dumped and hence it must be reconstructed from scratch after all the dumps have been restored. Doing so is possible since the set of free blocks is the complement of the set of blocks contained in all the files combined.
Links: If a file is linked to two or more directories, the file is restored only one time and that all the directories that are supposed to point to it do so.
UNIX files’ holes: It is legal to open a file, write a few bytes, then seek to a distant file offset and write a few more bytes. The blocks in between are not part of the file and should not be dumped and must not be restored. Core files often have a hole of hundreds of megabytes between the data segment and the stack. If not handled properly, each restored core file will fill this area with zeros and thus be the same size as the virtual address space(232 bytes or 264 bytes).
Special files(named pipes and anything that is not a real file): should never be dumped, no matter in which directory they may occur(they need not be confined to /dev).

4.4.3 File-System Consistency

Many file systems read blocks, modify them, and write them out later. If the system crashes before all the modified blocks have been written out, the file system can be left in an inconsistent state.
To deal with inconsistent file systems, most computers have a program that checks file-system consistency(UNIX has fsck). This utility can be run whenever the system is booted, especially after a crash. The description below tells how fsck works. All file-system checkers verify each file system(disk partition) independently of the other ones. Two kinds of consistency checks can be made: blocks and files.

Check for Block Consistency

To check for block consistency, the program builds two tables, each one containing a counter for each block, initially set to 0. The counters in the first table keep track of how many times each block is present in a file; the counters in the second table record how often each block is present in the free list(or the bitmap of free blocks).
The program then reads all the i-nodes using a raw device, which ignores the file structure and returns all the disk blocks starting at 0. Starting from an inode, it is possible to build a list of all the block numbers used in the corresponding file. As each block number is read, its counter in the first table is incremented.
The program then examines the free list or bitmap to find all the blocks that are not in use. Each occurrence of a block in the free list results in its counter in the second table being incremented.

Fig. 4-27(a): If the file system is consistent, each block will have a 1 either in the first table or in the second table.
Fig. 4-27(b): Block 2 does not occur in either table, it will be reported as being a missing block. Though missing blocks do no real harm, they waste space and reduce the capacity of the disk. Solution: the checker adds them to the free list.
Fig. 4-27(c): Block 4 occurs twice in the free list(Duplicates can occur only if the free list is really a list; with a bitmap it is impossible). Solution: rebuild the free list.
Fig. 4-27(d): The same data block is present in two or more files
1. If either of these files is removed, block 5 will be put on the free list, so the same block is both in use and free at the same time.
2. If both files are removed, the block will be put onto the free list twice.
  Solution: the checker allocates a free block, copy the contents of block 5 into it, and insert the copy into one of the files. So the information content of the files is unchanged and the file-system structure is made consistent. The error should be reported, to allow the user to inspect the damage.
The file-system checker also checks the directory system. It uses a table of counters per file. It starts at the root directory and recursively descends the tree, inspecting each directory in the file system. For every i-node in every directory, it increments a counter for that file’s usage count. Due to hard links, a file may appear in two or more directories. Symbolic links do not count and do not cause the counter for the target file to be incremented.
When the checker is done, it has a list that indexed by i-node number, telling how many directories contain each file. It then compares these numbers with the link counts stored in the i-nodes themselves. These counts start at 1 when a file is created and are incremented each time a hard link is made to the file. In a consistent file system, both counts will agree.
Two kinds of errors can occur: the link count in the i-node can be too high or it can be too low.
1. If the link count is higher than the number of directory entries, then even if all the files are removed from the directories, the count will still be nonzero and the inode will not be removed. This error is not serious, but it wastes space on the disk with files that are not in any directory. It should be fixed by setting the link count in the i-node to the correct value.
2. If two directory entries are linked to a file, but the i-node says that there is only one, when either directory entry is removed, the i-node count will go to zero and then the file system marks it as unused and releases all of its blocks. This action will result in one of the directories now pointing to an unused i-node, whose blocks may soon be assigned to other files. Solution is to force the link count in the inode to the actual number of directory entries.
These two operations(checking blocks and checking directories) are often integrated for efficiency reasons(i.e., only one pass over the i-nodes is required).

4.4.4 File-System Performance

Block Cache

One technique to reduce disk accesses is the block cache or buffer cache. A cache is a collection of blocks that logically belong on the disk but are being kept in memory for performance reasons.
A common algorithm is to check all read requests to see if the needed block is in the cache. If it is, the read request can be satisfied without a disk access. If the block is not in the cache, it is first read into the cache and then copied to wherever it is needed. Subsequent requests for the same block can be satisfied from the cache.

Fig. 4-28. The usual way to determine if a given block is present is to hash the device and disk address and look up the result in a hash table. All the blocks with the same hash value are chained together on a linked list so that the collision chain can be followed.
When a block has to be loaded into a full cache, some block has to be removed and rewritten to the disk if it has been modified since being brought in. One good difference between paging and caching is that cache references are relatively infrequent, so that it is feasible to keep all the blocks in exact LRU order with linked lists.
In Fig. 4-28, in addition to the collision chains starting at the hash table, there is a bidirectional list running through all the blocks in the order of usage, with the least recently used block on the front of this list and the most recently used block at the end. When a block is referenced, it can be removed from its position on the bidirectional list and put at the end. In this way, exact LRU order can be maintained.
But it turns out that LRU is undesirable. If a critical block, such as an i-node block, is read into the cache and modified, but not rewritten to the disk, a crash will leave the file system in an inconsistent state. If the i-node block is put at the end of the LRU chain, it may be quite a while before it reaches the front and is rewritten to the disk. Some blocks, such as i-node blocks, are rarely referenced two times within a short interval. These considerations lead to a modified LRU scheme, taking two factors into account:
1. Is the block likely to be needed again soon?
2. Is the block essential to the consistency of the file system?
For both questions, blocks can be divided into categories such as i-node blocks, indirect blocks, directory blocks, full data blocks, and partially full data blocks.
1. Blocks that will probably not be needed again soon go on the front, rather than the rear of the LRU list, so their buffers will be reused quickly.
2. Blocks that might be needed again soon, such as a partly full block that is being written, go on the end of the list, so they will stay around for a long time.
If the block is essential to the file-system consistency(basically, everything except data blocks), and it has been modified, it should be written to disk immediately, regardless of which end of the LRU list it is put on.
Even with this measure to keep the file-system integrity intact, it is undesirable to keep data blocks in the cache too long before writing them out. Consider someone who is using a computer to write a book. Even if our writer periodically tells the editor to write the file being edited to the disk, there is a good chance that everything will still be in the cache and nothing on the disk. If the system crashes, the file-system structure will not be corrupted, but a whole day’s work will be lost.
Solution in UNIX: There is a system call, sync, which forces all the modified blocks out onto the disk immediately. When the system is started up, a program is started up in the background to sit in an endless loop issuing sync calls, sleeping for 30 sec between calls. So, no more than 30 seconds of work is lost due to a crash.
Some OS integrate the buffer cache with the page cache. This is attractive when memory-mapped files are supported. If a file is mapped onto memory, then some of its pages may be in memory because they were demand paged in. Such pages are not different from file blocks in the buffer cache. In this case, they can be treated the same way, with a single cache for both file blocks and pages.

Block Read Ahead

A second technique for improving file-system performance is to try to get blocks into the cache before they are needed to increase the hit rate. When the file system is asked to produce block k in a file, it does that; when it is finished, it makes a sneaky check in the cache to see if block k + 1 is already there. If it is not, it schedules a read for block k + 1 in the hope that when it is needed, it will have already arrived in the cache. At the very least, it will be on the way.
This read-ahead strategy works only for files that are actually being read sequentially. If a file is being randomly accessed, read ahead does not help.
In fact, it hurts by tying up disk bandwidth reading in useless blocks and removing potentially useful blocks from the cache and possibly tying up more disk bandwidth writing them back to disk if they are dirty. To see whether read ahead is worth doing, the file system can keep track of the access patterns to each open file. E.g., a bit associated with each file can keep track of whether the file is in “sequential-access mode” or “random-access mode”. Initially, the file is put in sequential-access mode. Whenever a seek is done, the bit is cleared. If sequential reads start happening again, the bit is set once again. In this way, the file system can make a reasonable guess about whether it should read ahead or not.

Reducing Disk-Arm Motion

Another technique is to reduce the amount of disk-arm motion by putting blocks that are likely to be accessed in sequence close to each other, preferably in the same cylinder. When an output file is written, the file system has to allocate the blocks one at a time, on demand.
1. If the free blocks are recorded in a bitmap, and the whole bitmap is in main memory, it is easy to choose a free block as close as possible to the previous block.
2. With a free list, part of which is on disk, it is harder to allocate blocks close together.
Even with a free list, some block clustering can be done. The trick is to keep track of disk storage in groups of consecutive blocks. If all sectors consist of 512 bytes, the system could use 1-KB blocks(2 sectors) but allocate disk storage in units of 2 blocks (4 sectors). This is not the same as having 2-KB disk blocks, since the cache would still use 1-KB blocks and disk transfers would still be 1 KB, but reading a file sequentially on an idle system would reduce the number of seeks by a factor of two, improving performance.
A variation on the same theme is to take account of rotational positioning. When allocating blocks, the system attempts to place consecutive blocks in a file in the same cylinder.

Another performance bottleneck in systems that use i-nodes is that reading even a short file requires two disk accesses: one for the i-node and one for the block.
The usual i-node placement is shown in Fig. 4-29(a): all the i-nodes are near the start of the disk, so the average distance between an inode and its blocks will be half the number of cylinders, requiring long seeks.
One improvement shown in Fig. 4-29(b) is to divide the disk into cylinder groups, each with its own i-nodes, blocks, and free list. When creating a new file, any i-node can be chosen, but an attempt is made to find a block in the same cylinder group as the i-node. If none is available, then a block in a nearby cylinder group is used.

4.4.5 Defragmenting Disks

When the OS is initially installed, the programs and files it needs are installed consecutively starting at the beginning of the disk, each one directly following the previous one. All free disk space is in a single contiguous unit following the installed files. As time goes on, files are created and removed and typically the disk becomes fragmented, with files and holes all over the place. When a new file is created, the blocks used for it may be spread all over the disk, giving poor performance.
The performance can be restored by moving files around to make them contiguous and to put most of the free space in one or more large contiguous regions on the disk.
Defragmentation works better on file systems that have a lot of free space in a contiguous region at the end of the partition. This space allows the defragmentation program to select fragmented files near the start of the partition and copy all their blocks to the free space. Doing so frees up a contiguous block of space near the start of the partition into which the original or other files can be placed contiguously. The process can then be repeated with the next chunk of disk space, etc.
Some files cannot be moved, including the paging file, the hibernation file, and the journaling log, because the administration that would be required to do this is more trouble than it is worth. In some systems, these are fixed-size contiguous areas, so they do not have to be defragmented. The one time when their lack of mobility is a problem is when they happen to be near the end of the partition and the user wants to reduce the partition size. The only solution is to remove them altogether, resize the partition, and then recreate them afterward.
SSDs do not suffer from fragmentation at all. Defragmenting an SSD not only no gain in performance, but SSDs wear out, so defragmenting them merely shortens their lifetimes.