CSCI.4210 Operating Systems Fall, 2009 Class 13
More on File Systems

Log Structured File Systems

If a System has a large cache, most reads and writes are to cache rather than to disk. This means that the vast majority of disk accesses are writes, not reads. Writes are often done in small chunks. To create a new file, a directory entry has to be written, an inode has to be written, and the actual data has to be written. (For example, if the user enters the shell command
ls -l > temp
the system has to create a new file). These cannot be buffered.

One solution to this problem is a Log Structured File System in which the entire disk is a log. All pending writes are collected into a single segment and written as a single segment at the end of the log. This is done as one continuous write, and so it is very efficient because there is little or no seek time. At the start of each segment is a segment summary listing the contents of the segment. Finding the contents of a file from the inode is no harder, but finding the inodes is much harder (because Inodes are simply a part of the log, and therefore distributed more or less randomly on the disk) so the system maintains an inode map.

The log is the only structure on the disk. This makes writes very much more efficient compared to a traditional file system. With a traditional file system write, the system has to read the Inode to determine the sector number (although the Inode is presumably cached, so this may not be a costly operation), then it has to perform a seek operation on the disk to move the read/write head over the appropriate cylinder. This is a very costly operation.

With a log structured file system, all of the writing operations are done consecutively. At first all of the data to be written is kept in a buffer, but this buffer is written periodically (perhaps every 30 seconds). Since it is written consecutively, there is no seek time and so writes can be done much more quickly. The initial paper describing the Log Structured File System found that it could use 70% of the bandwidth for writing, as compared to only 5% to 10% of the bandwidth for a traditional file system, and overall performance was improved by an order of magnitude when dealing with mostly small files (it sort of breaks down with huge files, but is no slower).

For example, to write a new single block file, a traditional file system requires four non-sequential write operations. The inode for the directory has to be written, the directory contents have to be written, the inode for the new file has to be written, and the actual contents of the file have to be written. Each of these operations will require a separate disk seek operation. In contrast, the LSFS will still perform the same writes, but it will be done in one long write of all of the information.

If the disk were infinitely large, this would be the end of the story, but over time, many blocks in segments are defunct, so LFS have a cleaner program that runs as a thread of the kernel and cleans things up. All it has to do is to ignore defunct blocks and put active blocks back in the cache for rewriting. This ensures that there are always large chunks of free space available on the disk for writing the log.

A number of Unix systems, NetBSD in particular, use such a system.

Journaling File Systems

It is important to maintain file system consistency. It would not do to have blocks which were neither assigned to a file nor on the free list, or i-node pointers which pointed to the wrong blocks. This could happen if the system crashed during the middle of a file system operation.

Here are the steps involved in deleting a file

  1. Remove the file from the directory
  2. Release the inode to the pool of free inodes
  3. return all the blocks to the pool of free blocks
The order that these are executed does not matter as long as the system does not crash, but if the system crashed between steps 1 and 2, the inode and disk blocks would remain, but there would be no way that a user could access them.

A Journaling File System, such as ReiserFS or ext3, both of which are used on various flavors of Linux, keeps track of this. All of the steps for an operation are written to a journal before any of them are executed, and this journal is written to disk. Whenever the system is rebooted, the system checks the journal and exeutes the steps in the journal.

Unfortunately, Hans Reiser was convicted of murdering his wife last year and is in jail, so this may slow down further development of such systems.

Virtual File Systems

It is common now for one operating system to have multiple file systems. Microsoft systems do this by designating different file systems with different letters A: B: C: These do not all need to be running the same type of file system. When you load a CD, it is running a different file system than the file system on your hard drive, and if you are running Samba, (which maps a Unix file system to your PC), this is running yet another type of file system.

A Unix systems can also be running different file systems, but Unix likes to unite them all in a single directory tree with a single root. Many unix systems have a networked file system in which files are on a separate system, although they appear as all one local system to the user.

The solution to this problem is a Virtual File System (VFS). In a VFS, all of the system calls (read, open, mkdir etc) are identical across all file systems. These are sent to the VFS layer in the kernel. This has a separate interface to each different file system.

Each file system has a VFS interface which consists of several dozen function calls. Generally each file system has a vector of all of these functions so the VFS layer simply has to index into this vector.

In the unix world, the process of establishing an interface to a new file system is called a mount

File System Backup

There are two ways to back up an entire file system, logical and physical. A logical backup starts at the root and makes a copy of each file, directory, etc. on the backup medium (often still tape).

A physical backup simply copies the contents of an entire disk to the tape.

The advantage of the logical backup is that only sectors contains valid data are copied and so the backup is smaller. The disadvantage is that it involves a lot of jumping around on the disk. The advantage of the physical backup is that it can be done very efficiently, there is essentially no jumping around. However, empty or invalid sectors are copied.

There is also the issue of what to back up. Once a complete backup has been made, it is usually only necessary to copy those files that have changed since the last backup (an incremental dump). This has the disadvantage that restoring a particular file is more difficult.

If the file system is active, backing up is more difficult, and can result in inconsistencies.

Logical dumps are more common.

Maintaining file system consistency

Operating systems generally have a utility to detect file system inconsistencies. On unix, this is call fsck, and on Windows, it is called scandisk

These systems build two tables. In both tables there is a counter for each block. The first keeps track of how often each block is used in a file, the second table keeps track of how often each block is on the free list.

The program then goes through a list of all the inodes, incrementing the blocks in each file.

Once these two tables have been completed, the utility checks for inconsistencies. In a completely consistent file system, each block will be a file exactly once, or it will be on the free list exactly once.

Here is a list of possible inconsistencies

It is also necessary to traverse each directory to make sure that all files have an entry in at least one directory and to make sure that each entry in a directory is a valid file. This also checks to make sure that the number of links to each file as specified in the inode is correct.

Example File Systems

CD-ROM File System (ISO-9660
These are very simple because files are not allowed to expand. This means that space can be allocated contiguously. A CD does not have concentric cylinders, the sectors are in a continuous spiral, so writing is very easy. Each sector (logical block) contains 2352 bytes, but each includes some metadata (preamble, error correction, etc) so the actual data is 2048 bytes per block.

There is also space at the front of the CD for information such as (creation date, volume id.) This is followed by the root directory

Unix
I mentioned the inode structure Many implementations of Unix use NFS, the network file system, or a similar system in which the files are distributed over several different computers (file servers). There will be a detailed discussion of this later in the course.

Windows File system implementation

FAT

Modern Microsoft operating systems support several different file systems. The simplest of these is the FAT file system. Recall that FAT stands for File Allocation Table, and the idea behind this is that the system maintains a table for each file which contains the disk addresses of each block.

FAT comes in several flavors, including FAT12, FAT16 and FAT32. A pure file allocation table would have an entry in the table for each sector (512 bytes), but for performance reasons, the disk is broken up into larger units called clusters or allocation units, which contain a number of contiguous sectors, usually in the range of 4 to 64 (2,048 to 32,767 bytes). Each file is allocated a number of clusters; one cluster cannot contain data for more than one file.

A portion of each disk is allocated for the FAT, but this is usually kept in memory for faster access.

FAT12 was a component of early DOS computers. The FAT12 file system used a 12 bit field to hold the cluster number. This meant that a volume using FAT12 could only have 4,086 clusters (12K minus a few for housekeeping). This is hardly ever used today.

FAT16 is still used on many older systems; you might have been able to guess that it uses a 16 bit field for cluster numbers. A FAT16 file system could have 65,526 clusters.

FAT32 is used on the more modern Windows operating systems. It uses a 28 bit field to store cluster numbers (that's not a typo, apparently they thought that the name FAT32 was more aethetically pleasing than FAT28). Since the maximum cluster size that it can support is 32KB, the FAT32 file system can support disks of up to 241 bytes, which is more than enough.

There is one FAT for the file system. Directory entries contain that FAT entry of the first cluster of the file. This entry contains the cluster number of the first cluster and the FAT entry for the second cluster of the file. The FAT entry for the second cluster contains the FAT entry for the third cluster, and so on. This allows the operating system to obtain the cluster numbers of all of the clusters for a file without any time consuming disk reads (assuming that the FAT is in memory).

Note that deleting a file on a FAT file system (or on most other file systems) does not mean that the contents are erased. It simply means that the entry in the FAT is deleted. In fact, often the file system does not even delete entries in the FAT table; it simply marks them as deleted. This makes it possible to recover a deleted file (often third party software will have such a utility). Of course over time, the clusters get overwritten and so recovery is no longer possible.

The clusters of a file can be distributed anywhere on the disk. It is more efficient if the clusters of a file are stored contiguously. The defrag utility goes over the entire disk and stores logically contiguous clusters of each file on physically contiguous clusters. This permits much faster file access.

NTFS

The New Technology File System (NTFS) was introduced as a part of the Windows NT operating system. It incorporated a number of improvements over the various FAT systems, particularly FAT16, which was the standard at the time that NTFS was first developed. These improvements included:

Files are not just a string of bytes, there are a number of fields such as name. But one file can have multiple streams. For example, the file foo could have multiple streams foo:stream1, foo:stream2 etc.

Each volume (partition) has a Master File Table (MFT) - this has metadata such as a bitmap of blocks used, the root directory, a list of bad blocks, security descriptors for files, etc. It also has a record for each file. (1KB per record). If a file is short enough, the data can be kept here too. Big files may have more than one record in the MFT.

Storage Allocation is based on runs. A run is a series of consecutive blocks. One file can have many runs. But note that unlike an inode, where there was a pointer to every block, the number of pointers may be much less. All you need is a pointer to the starting point of each run, and its size.

NTFS also supports file compression, journaling, and file encryption.



Input/Output Systems

When people want a powerful computer, they look at the clock speed of the CPU or the design of the CPU, but what determines the performance of most systems is the speed and design of the various I/O devices. Most processes will read from a hard disk, and often from other devices as well, and these tend to take many orders of magnitude longer than simple processing with the data in main memory.

There is an enormous range of devices that can be attached to computers these days. Here is a far from complete list.

Mechanisms for performing I/O

The operating system provides an interface between applications and the I/O devices. Most operating systems make the I/O operations as general as possible. For example, Unix systems have only two I/O system calls, read and write, regardless of whether the device is a hard drive, a floppy drive, a keyboard or a network interface. This makes programming I/O very simple for an applications programmer. The operating system has the job of determining the nature of the device that the program is accessing, handling the low level operations, and scheduling the operations.

There are three ways that a device can get data into memory or from memory.

Programmed I/O is rarely used today, and interrupt driven I/O is being replaced by DMA as the preferred method for I/O. This is part of a general trend to do more I/O off line, that is, with a separate I/O processor rather than involving the CPU in I/O.

DMA is far more complicated than the other two methods. The DMA controller copies data directly to or from main memory without involving the processor. It uses the main system bus to do this. Meanwhile the CPU is doing other things, and will be accessing the bus. The bus controller hardware needs a mechanism for multiplexing these two independent modules and dealing with possible contention when both the CPU and the DMA controller want to use the bus at the same time. However, with modern cache memory, the CPU does not need to access main memory on each instruction as it used to, and so it is possible for the DMA controller to access the bus when the CPU doesn't need it. This is known as cycle stealing.

Another problem that DMA has to address is that the kernel has to make sure that the process that requested the data is not swapped out. Since the DMA controller copies the data directly into the memory space of the process, that memory, or more technically, that page or pages, has to be pinned in memory. Recall that if the system is busy, the kernel may choose to copy some processes entirely to disk for a while to improve performance. If the process that requested the read is swapped out, a deadlock situation could occur, in which the process is swapped out while waiting for data to be transferred but the DMA is blocked waiting for the process memory to be reloaded.

One solution which many operating systems use is to pin pages of memory where DMA will be performing I/O so that there is no danger that they will be replaced, even if the process has been sleeping.