CSCI.4210 Operating Systems Fall, 2009, Class 19

CSCI.4210 Operating Systems Fall, 2009 Class 19
Networked File Systems, Multiple Processor Systems

Networked File Systems

The file systems of many modern computer systems are distributed; this means that the files themselves are on file servers, computers with huge disk farms attached. This allows users to sit down at any computer on the network and get access to their files, with the illusion that the files reside on their local machine.

It should be noted that the structure of the file system as it appears to the user is independent of whether it is remote or local; a typical user cannot easily tell whether the file system is distributed or local, and usually doesn't care (until the file server crashes). All of the file system calls are identical, but the implementation of the file system calls in the kernel has to use Remote Procedure Calls (RPCs) for a distributed file system. A remote procedure call is a procedure call which is executed on a different system over a network. On Unix systems, this is done through the vnode (v stands for virtual). In the discussion of the Unix file system, I discussed the inode in some detail, and mentioned the vnode, which is a more general implementation of the inode. On a system with a local file system, the vnode is just a pointer to the inode. On a system with a distributed file system, the vnode contains the information about the location of the file as well as a pointer to the inode.

There are two widely used distributed file system protocols, both of which are used on our campus. The Network File System (NFS) developed by Sun Microsystems, was the first widely used distributed file system, and is used by the Computer Science Department system. AFS (Andrew File System) is a more robust distributed file system and is used by RCS.

NFS

NFS was developed by Sun Microsystems, originally for diskless clients, but now a standard. It is a protocol, not a product. This means that anyone can implement it if they wish. Many computer manufacturers use code licensed from Sun, but people are free to write their own implementation of NFS.

A key component of NFS is the concept of a mount. The mount protocol allows the file server to hand out remote access privileges to clients. The file server runs a mount daemon mountd. When a new client is booted, it calls the mount system call, which attaches a specific directory tree to a mount point, which is a node on the client's local file directory tree.

An example will clarify this. Suppose the client has a directory that looks like this:

and the server has a directory that looks like this:

When the client is booted, it makes a call to mount, requesting the server to mount file system D to /C. After this is done, the file system on the client would look like this to a user.

Note that some of the files are local and others are remote. but this is transparent to the user; it looks like a single file system. The user does not need to know where the file actually resides.

One potential source of confusion with NFS is that different clients can mount the same remote file system in different places. With our example, another client could choose to mount D onto B. This means that the same files are available to users on different machines, but in different places.

NFS servers are dumb and NFS clients are smart. It is the clients that do the work required to convert the generalized file access that servers provide into a file access method that is useful to applications and users.

The server is stateless. (Actually newer releases of NFS include some stateful information on the server.) A stateless protocol means that each call is independent of every other call.A server should not need to maintain any protocol state information about any of its clients in order to function correctly. Stateless servers have a distinct advantage over stateful servers in the event of a failure. With stateless servers, a client need only retry a request until the server responds; it does not even need to know that the server has crashed, or the network temporarily went down. The client of a stateful server, on the other hand, needs to either detect a server failure and rebuild the server's state when it comes back up, or cause client operations to fail.

When a user calls open, the call has to figure out if the file is local or remote, and if it is remote, the NFS client on that machine has to contact the appropriate server.

Since NFS has to accommodate heterogeneous file systems. (i.e. Windows) the client is the only one that interprets full path names. This may mean multiple NFS queries to resolve a single request For example, the file /a/b/c/d might take four request to resolve. But this means that the server doesn't need to know anything about the client's naming system or directory structure.

Once a client has identified and opened a file, the server gives the client a handle for subsequent reads and writes. This is an opaque data structure that the client uses for future reads and writes to that file.

Note that since the server is stateless, the client has to store file offset info. This means that a call to lseek is completely local.

Performance is an extremely important concern. For this reason, communication between clients and servers usually uses UDP. Also, when the server is started, it uses preforking, it forks off a number of processes at creation so that it does not need to call fork create a thread for each request.

Security and authentication are also an important concern. The system administrators have to tell the servers which clients can connect to which servers, and has to check all of the usual permissions to make sure that individual users are who they say they are and have access to particular files.

NFS has a function calls to do almost anything that the user might want to do with a file. Here are some examples.

Get file attributes (similar to stat)
Set file attributes
Read a number of bytes from the file
Write bytes to the file.
read a symbolic link
create a file
remove a file
rename a file
create a link to a file
Create a directory
remove a directory

Here is an NFS detailed run stream

AFS

AFS is a distributed filesystem that enables co-operating hosts (clients and servers) to efficiently share filesystem resources across both local area and wide area networks. It is far more robust than NFS.

AFS was developed at Carnegie-Mellon University. This was called the Andrew File System, named after both Andrew Carnegie and Andrew Mellon. There are several implementations of AFS. A proprietary version from Transarc Corporation (now owned by IBM) used to be the most widely used, but is losing favor. There is an open source version called openAFS.

Recall that with NFS, different clients could mount the same file system in different places. AFS has gone to the opposite extreme; there is one AFS file system for the planet. If you are on an AFS system such as RCS, the root is /afs. This provides access to every (or at least most) systems running AFS. Hint: Don't type ls -l /afs because it will need to contact each of the sites in the world to get the information, and this takes a while. You might want to go to lunch while you wait for this. But you should try this command.
ls /afs)

AFS files are grouped together in cells. An AFS cell is a collection of servers grouped together administratively and presenting a single, cohesive filesystem. Typically, an AFS cell is a set of hosts that use the same Internet domain name. For example, all of the files in the rpi.edu domain constitute a cell. AFS cells can range from the small (1 server/client) to the massive (with tens of servers and thousands of clients).

AFS client machines run a very efficient Cache Manager process. The Cache Manager maintains information about the identities of the users logged into the machine, finds and requests data on their behalf, and keeps chunks of retrieved files on local disk.

The effect of this is that as soon as a remote file is accessed a chunk of that file (often the whole file) gets copied to local disk and so subsequent accesses are almost as fast as to local disk and considerably faster than a read across the network. Local caching also significantly reduces the amount of network traffic,

Unlike NFS, which makes use of /etc/filesystems (on a client) to map (mount) between a local directory name and a remote filesystem, AFS does its mapping (filename to location) at the server. This has the tremendous advantage of making the served file space location independent.

Location independence means that a user does not need to know which file-server holds the file, the user only needs to know the pathname of a file.

To understand why such location independence is useful, consider having 20 clients and two servers. Let's say you had to move a filesystem "/home" from server a to server b.

Using NFS, you would have to change the /etc/filesystems file on 20 clients and take "/home" off-line while you moved it between servers.

With AFS, you simply move the AFS volume(s) which constitute "/home" between the servers. You do this "on-line" while users are actively using files in "/home" with no disruption to their work.

With location independence comes scalability. An architectural goal of the AFS designers was client/server ratios of 200:1 which has been successfully exceeded at some sites.

AFS files are stored in structures called Volumes. These volumes reside on the disks of the AFS file server machines. Volumes containing frequently accessed data can be read-only replicated on several servers. For example, if there are many users using the C compiler gcc, there can be several instances of it on different servers. Note that a given user does not know anything about this. He or she just types gcc, and the AFS server finds an instance.

AFS (and thus RCS) do not use the standard Unix permission system, and this is a source of confusion, because the Unix file permission bit are settable and visible, but ignored. On a typical Unix system, permissions are done on a file specific basis, but on AFS, permissions are on a directory basis.

The AFS permission system allows the owner of a directory to set four types of permission for that directory, lookup, insert, delete, and administer. Each file has three types of permissions, read, write and lock. Unlike normal Unix, these can be set for specific users, you can give Suzy read privileges for files in a directory for example. You can even give read privileges for everyone except Suzy.

AFS is generally more secure than NFS. It uses the Kerberos authentication system, which will be discussed in detail in a later lesson. Newer versions of NFS also use Kerberos.

Samba

Samba is a freeware system that allows Unix and Microsoft to share files. Samba is a suite of Unix applications that speak the SMB (Server Message Block) protocol, which is the protocol used by Windows to perform client-server networking. By supporting this protocol, Samba allows Unix servers to communicate with the same networking protocol as Microsoft Windows products. Thus, a Samba-enabled Unix machine can masquerade as a server on a Microsoft network and offer the following services:

Share one or more file systems
Share printers installed on both the server and its clients
Assist clients with Network Neighborhood browsing
Authenticate clients logging onto a Windows domain
Provide or assist with WINS (Windows Internet Naming Services) resolution

The Samba suite revolves around a pair of Unix daemons that provide shared resources -- or shares -- to SMB clients on the network. (Shares are sometimes called services as well.) These daemons are:

smbd A daemon that allows file and printer sharing on an SMB network and provides authentication and authorization for SMB clients.
nmbd A daemon that looks after the Windows Internet Name Service (WINS), and assists with browsing.

Here is a link to the official Samba web site

Multiprocessor and Distributed Systems

The term supercomputer is loosely defined as an extremely fast computer with lots of memory and storage. There have been numerous different companies that have built so-called supercomputers over the years. While these have often been technological wonders, and have accrued prestige to their manufacturers, in general they have been commercial failures, primarily because of their high cost. The trend now in supercomputing is away from the development of specialized processors and toward using commercial off-the-shelf (COTS) components in multiprocessor computers or clusters of computers. These often have all of the computational power and high availability of the fastest and most powerful special purpose supercomputers but cost several orders of magnitude less.

One of the most active areas of computer science today is the development of techniques of parallel programming; that is, taking a large computation and spreading it over a number of CPUs to achieve speedup. One important issue in such programming is load balancing. If one process has multiple threads potentially running on different CPUs, it is important to keep all of the CPUs busy; if one is not careful, it is likely that one CPU will be overworked while another is sitting idle.

There are a three different models of multiprocessor systems.

Shared Memory multiprocessors, in which several CPUs share a common memory. Such a model makes interprocess communication trivial, but introduces serious problems of contention if several different processors wish to access the memory bus at the same time. Also, if data are cached, discrepancies between cache memory and main memory become magnified.
Multicomputers, also known as Message Passing Systems, in which several CPUs, each with its own memory but tightly coupled, communicate with each other by sending messages. Such systems are often called clusters.
Distributed computing, in which several computers, connected over a network, work on a common problem.

I will discuss each of these models, with particular emphasis on what this might mean for operating system design.

Shared memory multiprocessors

The simple model for shared memory multiprocessors is several CPUs on a common bus.

Since only one processor at a time can be accessing memory through the bus, this model works well with a small number of processors, but breaks down completely with a large number of processors because of bus contention.

Generally, each processor has its own L1 cache memory, and this can dramatically reduce the number of main memory accesses, but it also increases the problem of cache compatibility. There are two kinds of problems here.

First, CPU 1 might read the value of a memory location, storing it in its local cache, and then it modifies this value. This means that there is now an inconsistency between the value in the cache of CPU 1 and main memory. If this is a shared variable, CPU 1 and CPU 2 see different values for the same variable.

The second possible source of cache inconsistency is if CPU 1 reads a value and places it in its cache, then CPU 2 writes a different value from its cache out to main memory.

There are a number of solutions to this problem, all ugly. One solution is to mark each cache block as read only, in which case it can be present in multiple caches, or read/write, in which case, it can only be present in one cache at a time. If a CPU attempts to write to a word which is in one or more remote caches, the bus hardware detects this and sends a signal to all of the other caches. Try writing the code for that in such a way that it doesn't slow things down!

Because of the issue of bus contention, multiprocessor systems which have more than a few CPUs need a more complex connection between the CPUs and memory. One such solution is the crossbar switch. Memory is dividing up into a number of blocks, and there is a connection between each CPU and each memory block, controlled by a switch. Here is a crossbar switch for a system with eight CPUs and eight memory modules. The circles at the junctions represent switches. For CPU 3 to access memory in unit 6, the darkened switch would need to be turned on.

The crossbar switch allows different CPUs to access memory simultaneously as long as each CPU needs to access a different memory unit. The number of switches is the number of CPUs times the number of memory units, and so this too becomes infeasible with very large numbers of CPUs or memory modules

An alternative architecture is an omega network, which is based on a number of two by two switches. Each switch has two input lines and two output lines, and so can be configured in four possible ways

In-1 -> Out-1
In-2 -> Out-1
In-1 -> Out-2
In-2 -> Out-2

With an omega switch, eight CPUs can be connected to eight memory units with only 12 switches.

NUMA Multiprocessors

The architectures discussed above are all Uniform Memory Access (UMA) since all processors can access all of memory equally fast. An alternative model is Non-uniform Memory Access (NUMA), in which each processor has some memory associated with it, and can access its own memory faster than the memory on another processor.

This architecture is still a multiprocessor system because typically all of the processors share a common memory addressing scheme (address space).

Multiprocessor Operating Systems

The text discusses three possible configurations of operating systems for a shared memory multiprocessor system.

Each CPU has its own operating system. In such a system, a portion of the memory is allocated for each CPU, and there one instance of the OS code in memory. While this is simple to implement, it is not generally used for several reasons.
- Each OS does its own scheduling, so there can be no sharing of processes. In particular, if one process has multiple threads, they all run on the same CPU so there is no speedup due to parallelism.
- I/O buffering and caching is largely impossible because maintaining cache consistency is problematic.
- Page sharing is impossible because each CPU has its own memory
- Load balancing is difficult or impossible. One CPU may be overworked but there is no simple way for some of this work to be offloaded to another CPU.
A Master-slave arrangement, in which one CPU is the master and controls the other CPUs. In this implementation, one CPU runs the operating system and farms processes out to the other CPUs. Note that all system calls are handled by the CPU running the OS. This solves some of the problems of the above system, but the OS becomes a bottleneck.
Symmetric Multiprocessors (SMP) In this model there is only one copy of the OS in memory, but any CPU can run it. Thus, when a process makes a system call, that CPU traps it and processes it. The major problem with this model is that, as we saw in week 4, concurrent processes can result in incorrect results if not handled carefully. If two different CPUs are updating the Process Table at the same time, the result can be an inconsistent process table, followed by disaster.
The solution is to divide the kernel up into critical regions that can be run independently of one another, and surround each with a mutex so that only one CPU at a time is running in a particular critical region. Since CPUs may be accessing several of these critical regions at the same time, deadlock is a possibility, so the system needs a mechanism to prevent deadlock.
In spite of its complexity, this is how most multiprocessor systems are designed now.

Even implementing synchronization primitives on a multiprocessor system is more difficult. On a single processor system, the kernel could insure that any code which locked a mutex was performed atomically without the possibility of interruption. On a multiprocessor system, this is not so simple. There needs to be a way to make sure that two different CPUs don't both lock a mutex at the same time.

Since all of the CPUs share a common bus, one solution is to allow the bus to enforce mutual exclusion with a special flag. In other words, when a CPU wants to perform an operation on a mutex, it requests the bus in the usual way and also requests that this flag be set. Thus, no other CPU can use the bus (meaning that no other CPU can access memory), until the CPU finishes locking the mutex and releases the bus. This works, but it requires special bus hardware, and it can also incur a substantial performance penalty because other CPUs are prevented from using the bus and accessing memory.

Unfortunately, deadlock can also be an issue here.

Process scheduling on a multiprocessor system

Process scheduling on a single CPU machine is far from trivial, and processing on a multiprocessor system is even more complex. If all processes are independent, then we can use a scheduling algorithm similar to that used for a single processor. Each CPU is running a process, and the system maintains a single list of runnable jobs. When a process on a particular CPU becomes blocked or uses up its time quantum or terminates, that CPU just gets the next runnable job from the list and continues.

But there are some complications. The first is that since mutexes are much more common on symmetric multiprocessors, it is possible that a process will use up its time quantum while it holds a mutex. It can't easily yield the mutex because it is in the process of updating a global data structure. On the other hand, if it is simply returned to the ready queue, any other processes which are waiting for that mutex will be forced to wait even longer. One solution is to give such processes extra time; if a process uses up its time quantum while it is in a critical region, it is permitted to finish its critical region and unlock the mutex before being replaced by another process.

Another issue is that it might make sense to restart a process on the same CPU when it returns from a blocked state. It is possible that much of its data is still in cache, and that some of its pages are still in the TLB. This leads to a two level scheduling algorithm. Once a process is assigned to a CPU, it always (or almost always) runs on that CPU, even if it is blocked for an extended period. So process scheduling on a CPU is one level. Assigning new processes to a CPU is the second level of process scheduling.

Return to the course home page