CSCI.4210 Operating Systems Fall, 2009, Class 12

CSCI.4210 Operating Systems Fall, 2009 Class 12
File Systems

File System Implementation Issues

The contents of files are stored in physical sectors on a disk, but when users, even systems programmers, wants to read from or write to a file, they don't care about cylinders and tracks and sectors; they simply refer to files.

A file is a collection of data. Everyone in this course knows what a file is so there is no reason to describe it further. Files are stored in directories (folders in Windows). Any operating system provides a file system which is a logical structure for data, and there has to be a mapping from the logical file structure to the physical addresses (cylinders and sectors) of the disk. Typically, a user is not allowed to access a file directly, using the physical address; and most users don't want to; they have to access the contents of a file through the file system.

Unix and most other modern operating systems view a file as simply a string of bytes with no inherent structure. Of course many (most) files have an internal structure; a file can be a database, or an executable image or a spreadsheet or a bitmapped image or any of numerous other file types, but to the operating system, a file is just a string of bytes; the structure is imposed by an application program.

Windows systems encourage users to append a three character suffix to each file which tells the system what type of file it is. (or example, .txt for a text file, .doc for a word document, .exe for an executable file, etc). The system maintains a mapping of these tags to applications in the registry.

Unix systems allow this as well, but the system does not usually treat files with different tags differently. Applications such as compilers often use this information.

Older operating systems had a concept of a record within a file, and this was known to the operating system. On some systems, read operations had to read an entire record at a time, and often the operating system itself could randomly access records (Note that random access for a file simply means that records are not necessarily read sequentially; reads can skip around in the file). For example, older IBM systems had a concept of an indexed sequential access method (ISAM) file, in which the user program could look for a particular record within a file by passing an appropriate key to the read statement.

Storing files on disks

The problem of storing files on disks presents some of the same problems that we encountered with memory management. The operating system has to address the following issues

How to store logical file blocks on the disk
How to allocate space to files as they grow
How to find a logical file on the disk
How to record information about files

The simplest solution is to store the contents of a file in contiguous blocks. This makes sequential reads and writes very fast. However, it is difficult to implement in practice. It might be possible to implement such a system efficiently if the file size was known at creation time, and if files did not grow after they were created, but this is not the case. When a new file is created, the application program does not generally specify how large the file will be, and many files grow or shrink over time.

Another problem with storing files as contiguous blocks on a disk is fragmentation. At first, each new file can just be appended onto the end of the previous file, but over time, files get deleted, and there is no more room at the end. The result is a disk layout with many holes of varying sizes.

This is not to say that storing files in contiguous blocks is impossible. One solution is to allocate a fixed number of blocks whenever a new file is created, and if the file expands beyond that size, allocate a larger number of blocks (twice as many for example), copy the contents of the file to the new blocks, and treat the old blocks as free space.

The problem of fragmentation can be solved by periodically running a compaction program which compresses all of the files into contiguous space, replacing many small holes with one large hole.

Here is an small example. Suppose that a disk has 32 blocks, and when files are created, they are given four blocks (Real systems would have millions of blocks, but that's hard to model). Files A, B, C, and D are created. Allocation looks like this.

Now file C outgrows its four blocks, so it is assigned a new eight block chunk, creating a four block hole.

File A also outgrows its space so it is also given a new eight block chunk.

The compaction algorithm is then run, which eliminates the two small holes and makes one large hole at the end.

Moving files around on the disk, which is done whenever a file outgrows its assigned blocks or whenever the compaction algorithm is run, is very time consuming. As a result, no real operating system uses such an algorithm.

Linked List Allocation

A solution to the disk allocation problem which should occur to any Computer Science major is a linked list allocation, in which the first (or last) word of each block is a pointer to the next block of the file, and the rest of the block is the data. This has the advantage that there is essentially no loss due to fragmentation except for internal fragmentation within the last block of a file. (If a block is 512 bytes, and the very last block of the file only needs 200 bytes, the rest of the block is wasted). This also solves the problem of files expanding beyond their allocated space. If more data is added to a file, it just allocates new blocks at the end of the linked list, with none of the movement of data on the disk that presented such a problem with the prior method.

However, this has three major disadvantages

random access is extremely slow. If the file is a huge database, and the application would like to read starting at byte 50,000, the I/O system has to read in each block prior to that to get the address of the next block.
the amount of data in each block is different from the sector size because there has to be room for the pointer. If each sector has 512 bytes, it can only store 508 bytes of data because it has to allocate 4 bytes (or more) for the pointer to the next block.
An error can either lose a lot of data or result in the wrong file being read.

A hybrid is to collect blocks into multiples, called clusters, and allocate clusters rather than blocks. This solves some of these problems.

There is a variant of the linked list solution which solves all of these problems. Keep all of the pointers for a file in one place. Random access can be very fast. If the program wants to read bytes starting at 50,000, the system needs to calculate which block would contain this data and go to that entry in the table. Also, the block size and the sector size are the same size.

Such a system is called a File Allocation Table (FAT) and is used in many real operating systems, including Windows. It can be diagrammed like this.

I-nodes and V-nodes

In Unix, the information about a file is stored in a structure called an I-node (index node). Each file has one I-node. When a file is opened, the i-node data is copied into memory. The I-node contains the following information.

The mode and type of file
The number of links to the file (as we will see below, one file can have more than one link to it in the file system)
The user id of the file owner
The file owner's group id
The number of bytes in the file
The time that the file was last accessed
The time that the file was last modified
The time of the last status change of the file
The number of blocks allocated to the file

Unix I-nodes keep track of the blocks for a file using a method called multilevel indexing. Most files on most Unix systems are fairly small, studies of many Unix systems show that the median file size is about 1K. However, modern operating systems need to be capable of supporting files of essentially infinite size. The I-node has direct pointers to the first twelve blocks of the file. Thus, if the block size is 4096, which is typical for modern systems, the file system can access files up to 48K bytes (49,152) (12 times 4096) directly. This is adequate for almost all files on the typical system.

The I-node also has three indirect block pointers, labeled single indirect, double indirect, and triple indirect. The single indirect pointer points to a block in memory which contains pointers to the next 1024 blocks. If a file has more than 48K bytes, and the file system needs to access bytes beyond the first 48K, it loads the indirect block from memory, reads the block address from the single indirect block, and then accesses the appropriate block from the disk. Since the single indirect block contains 1024 block pointers, and each block is 4,096 bytes, files of up to 4,243,456 bytes (1024 times 4096 plus 49,152) can be handled through the single indirect block.

However, file systems need to be able to handle larger files, and this is done through the double indirect block. The double indirect block pointer in the i-node points to a block on the disk, this block in turn contains 1024 pointers to other blocks on the disk. Each of these blocks in turn contains 1024 pointers to blocks of data.

Thus the double indirect pointer can access 1024 x 1024 or 1,048,576 blocks, each of which would contain 4096 bytes, for a total of more than 4.2 billion bytes.

But even that may not be enough. Files larger than this can use the triple indirect block. This points to a block which contains 1024 pointers to blocks, and each of these in turn points to a block containing 1024 blocks. The arithmetic to calculate the largest file size possible using the triple indirect block would make my head hurt, but it is far larger than the largest file in the known universe.

This is called multilevel indexing for obvious reasons.

In the first assignment, we used the Unix stat() system call.

#include <sys/types.h>
#include <sys/stat.h>

int stat(const char *filename, struct stat *buf);
int lstat(const char *filename, struct stat *buf);
int fstat(int filedesc, struct stat *buf);

struct stat
{
    dev_t       st_dev;     /* ID of device containing file */
    ino_t       st_ino;     /* inode number */
    mode_t      st_mode;    /* protection */
    nlink_t     st_nlink;   /* number of hard links */
    uid_t       st_uid;     /* user ID of owner */
    gid_t       st_gid;     /* group ID of owner */
    dev_t       st_rdev;    /* device ID (if special file) */
    off_t       st_size;    /* total size, in bytes */
    blksize_t   st_blksize; /* blocksize for filesystem I/O */
    blkcnt_t    st_blocks;  /* number of blocks allocated */
    time_t      st_atime;   /* time of last access */
    time_t      st_mtime;   /* time of last modification */
    time_t      st_ctime;   /* time of last status change */

This returns most of the information in the inode. lstat() is the same as stat except that it follows symbolic links (more on this below), and fstat() is the same as stat except that its first argument is a file descriptor instead of a path.

Most Unix file systems are distributed now, which means that the files are stored on other computers on the network. This is generally transparent to the user. There are a number of different distributed file system protocols. Modern Unix systems will make references to a Vnode in place of an Inode. V stands for virtual. This is a mechanism which allows several different file system protocols to be provided, transparent to the user. Distributed file systems will be discussed after we talk about networking in a few weeks. You can think of a Vnode as being a more general implementation of an Inode.

Implementing Directories

A directory is just a file with an array of pairs, file names and pointers to inodes.

Unix supports shared files, i.e. one file listed in two or more directories. Also, it is possible for a file to have two (or more) names in the same directory, as shown in the above diagram.

The shell command
link existingfile newfile
will do this.

There is a also a system call
#include <unistd.h> int link(char *existingfile, char *newfile);
so that you can do the same thing from inside a program.

You can create a symbolic link with the system call
int symlink(const char *name1, const char *name2);
One important difference between hard links and symbolic links is that hard links must be in the same file system (on the same server) while symbolic links can cross file servers (more on this shortly).

Windows supports the same concept. The windows term for a symbolic link is a shortcut. These have a .lnk suffix.

The fact that one file (inode) can have multiple pointers to it from different directories complicates the problem of deleting files. Note that one of the fields of the Inode is the number of links. Whenever a new link is created, this is incremented, and whenever a link is deleted, this is decremented. If this value gets to zero, then the system can overwrite the contents of the file and delete the Inode.

The name of a file is an entry in a directory, and a file can have more than one name. The Inode does not have a field for the file name.

Also note that a dump (i.e. a backup) may copy the file two ore more times, once for each link to it; and a search (the Unix search command is grep (don't ask)) may search the same file two or more times.

The connection between the name of a file in a directory and the pointer to its Inode is called a hard link. Unix also allows the user to create Symbolic Links (these are called shortcuts in Windows). A symbolic link differs in a subtle way from a hard link in that the symbolic link contains the address (pathname) of a hard link.

When I do an ls -l / on mary-kate, asking for a long listing of the root directory, one of the lines looks like this.
sys -> usr/src/sys
This means that the entry /sys is really a symbolic link to the directory
< /usr/src/sys/

Here are the steps involved in creating a new file in Unix

create a new inode
find a free block on disk for the inode
find a free block on disk for the first block of data
find the directory on the disk and read it into memory
add an entry to the directory
write the directory
write the inode
write the data

Note that to traverse a pathname such as
/usr/include/sys/types.h the system first has to read the inode for the root directory, then find the data for the directory and read it. This will provide a pointer to the inode for the directory usr, so this inode has to be loaded into memory and read so that the contents of the actual directory can be read. This will provide a pointer to the inode for include, and so on. Thus, a total of ten reads have to take place in order to find the location of the file and read it into memory.

Keeping track of free space on the disk

The file system has to keep track of free blocks of memory. There are two widely used systems for this. Some systems maintain a linked list of free blocks, while others keep track of free blocks with a bitmap. With a linked list, when memory becomes freed because a file is deleted, a pointer to that block of memory is simply added to the list, and when a new block of memory is needed because a new file is created or because an existing file grows, the system just has to find any available free block and remove it from the list.

If the system uses bitmaps, a bitmap is maintained in memory, with one bit corresponding to each block. When a block becomes freed, its corresponding bit is set to 0, and when a new block is needed, its bit is set to 1.

Improving File System Performance

Processing speed has improved much faster than disk access speed, which means that the bottleneck in many processes is disk access. Anything that might reduce the number of disk accesses can dramatically improve performance. There are a number of methods that file systems use to accomplish this.

block caching storing a collection of blocks that logically belong in the file system, but are kept in memory The file system uses a hashing system to determine if a block is in the cache.
Note that as with any caching system, you need a page replacement algorithm.
If an inode is read into the cache, modified, but not written back, an inconsistent file system can result. As a result, inodes (and other blocks which maintain file system consistency) should always be written to disk immediately.
Unix has a command sync which writes all modified blocks in the cache to disk. This is run every 30 seconds or so.
Block Read ahead When a request to read block n is read, also read block n+1, n+2, .... Modern file systems with large memories can carry this to the extreme of reading a whole file into memory when it is first opened.
Reducing Disk Arm Motion - If an Inode is stored near the actual data, disk arm movement can be reduced because when the file is opened, the Inode is read into memory, and if the system is using block read ahead, much of the file is then read into memory, often before the first read operation.

Block Size

The block size is an important design decision for file systems. Obvious candidates are sector size, or page size. A larger block size results in fewer seeks and rotational delays, but more internal fragmentation. The median file size in UNIX is about 1K, and so if the block size is 4K, the majority of files can fit in a single block. However, if a file is, say, 500 bytes, there is 3.5 K of wasted disk space.

Disk is now inexpensive, and disks are very large, so the wasted space of fragmentation is less of a problem than it used to be.

CSCI.4210 Operating Systems Fall, 2009 Class 12 File Systems

CSCI.4210 Operating Systems Fall, 2009 Class 12
File Systems