Both Multiprocessor Systems and Multicomputers are more or less tightly coupled. Typically all of the computers are in the same room, and often on the same rack. The idea behind distributed computing is that many, essentially independent computers, often widely distributed, can work together on a problem. A current initiative in our department is world wide computing, sometimes called Grid Computing in which computers all over the world work together, communicating over the Internet.
There are few special implications for the design of operating systems for computers which will be doing loosely coupled, distributed computing. Each node simply runs its own native operating system. Note that it is possible, and indeed likely, that different nodes are running different operating systems. Interprocess communication between processes on different computers is done using standard protocols over the network.
We can expand our definition of an operating system to include any system which coordinates computation, even including widely distributed computation. If we do this, a distributed system introduces new problems.
One problem with distributed computing over the Internet is that there can be significant and unpredictable delays in transmission. This leads to a problem for some applications such as distributed games, because it is important to have a consistent global state, and this is difficult if there are substantial and unpredictable communication lags. For example, if Alice and Bob are playing a distributed game in which users score points by killing bad guys, and both users have a chance to kill the same bad guy at about the same time, it is possible that Alice will kill the bad guy on her machine while at the same time Bob kills the same bad guy on his machine because the news that Alice had already killed the bad guy had not reached his machine yet. This is an inconsistent state; each thinks that they should get the points for killing the bad guy.
For a more formal analysis of the problem, I'll use a more mundane example (borrowed from Stallings, Operating Systems). An individual has accounts in two different branches of a bank, and every day at 3:00, each branch sends the balance to a central server. At about 3:00, each account has $100, and the account holder transfers $20 from the account in Branch A to the account in Branch B. There are two scenarios in which this could result in the wrong answer being stored on the central server.
The algorithm uses a control message called a marker. Some process initiates the algorithm by recording its state and sends out a marker on all of its outgoing channels before any other messages are sent. Each process p, when it first receives the marker from another process, q for example, does the following
Each process terminates when it receives the marker from all other processes. A globally consistent state can be recorded by having every process send its state data along every outgoing channel.
Using a marker to solve our bank problem is trivial. Suppose that when the server thinks that it's 3:00, it sends out a marker. Consider our first case where the Branch A has sent the message for the transfer but Branch B has not yet received it. Branch A receives the marker from the server and records its state (balance is $80). Branch B receives the marker from the server and records its state (balance is $100). But we're not done yet. Branch A sends the marker to Branch B, and by the rules that we set, it does not arrive until after the transfer message, so when it does arrive, Branch B has already received the transfer message. It updates its state accordingly. Branch B also sends the marker to Branch A. Now the algorithm terminates and everyone is in agreement.
The text (and everyone else), talks about middleware as if it were a separate layer in the protocol stack, but it isn't. These are just protocols to get work done on a distributed network. I will discuss several of these systems.
The World Wide Web is based on a very simple protocol called http (Hypertext Transport Protocol). A client (a web browser) sends a GET request for a document to a web server, using TCP/IP. Here is a simple GET request to retrieve the course home page. It would be sent to the CS department web server www.cs.rpi.edu on port 80.
GET /academics/courses/os/index.html HTTP/1.0A web browser would follow this with some additional information, but this statement would work .
The web server (apache) would retrieve the web page and send it back to the browser. The browser would then display this document.
The web was initially conceived to retrieve static documents such as this, but very quickly, it was modified to include new features.
cgi (The common gateway interface) This allows the client (the browser) to send a more specific request to the server. For example, a request to a stock market quote server might look something like this.
GET /quote?stock=IBM HTTP/1.0Rather than returning a static document, the server would query a database to retrieve some information about the current price of a share of IBM stock, and it would format a document on the fly and send this to the client.
There are a number of widely used server technologies which have been developed to facilitate such cgi calls. Examples include the freeware tomcat which runs Java Server Pages. Commercial products include Microsoft ASP, BEA weblogic, Cold Fusion, and IBM WebSphere.
cookies Initially http was conceived as a stateless protocol. The server simply received requests and returned the appropriate document. However, for more elaborate interactions, it was necessary for the server to send some information to the client. This information is called a cookie. The browser stores this cookie and whenever it makes a new request to the server, it sends the cookie along, so that the server knows that this request is part of an ongoing session.
Object based Middleware
We have already discussed Remote Procedure Calls and the java equivalent, Remote Method Invocation (RMI). The logical extension of this is distributed objects. The best attempt to implement this is CORBA (Common Object Request Broker Architecture).
Developed by the Object Management Group, it allows clients to access remote objects. Unlike RPCs and RMI, it is not limited to one language.
The Object Request Broker (ORB) is the key component. It manages all communication between its components. It allows objects to interact in a heterogeneous, distributed environment, independent of the platforms on which these objects reside and techniques used to implement them.
The functions of the ORB are as follows:
The Dynamic Invocation Interface (DII) API in CORBA provide a Dynamic Interface to a CORBA Object whose interface is not known at compile time to the client of the CORBA Object
CORBA seems to be going out of fashion.
File System Based Middleware
We have already discussed two instances of File System based middleware (what I call a networked file system), here is a third.
Google has some unique file system needs. They generate enormous numbers of files, and many of the files are enormous; the files have to be widely distributed; they have to have extremely fast access to enormous numbers of clients; and they have to be highly fault tolerant. As a result, they have developed their own unique file system GFS.
Here are some of its constraints
Computers are grouped into clusters. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. Files are divided into 64MB chunks (No, that's not a typo) . Each chunk is identified by a globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range, bypassing the inodes and vnodes. For reliability, each chunk is replicated on multiple chunkservers. The default is three replicas.
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
With only one master per cluster, there is the danger that the master could become a bottleneck. To avoid this,clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
Here is how a typical read works:
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages. This eliminates the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
File system consistency is very important. To assure this, all mutations (e.g., file creation, writes and appends) are atomic. They are handled exclusively by the master; namespace locking guarantees atomicity and correctness. All metadata changes are logged, and the log file is replicated. In the event of a crash, the master can recreate the data by replaying the log file. The master periodically sets a checkpoint (storing all the metadata) and so the logfile only has to keep track of events since the last checkpoint. This keeps the logfile size manageable.
Data are also kept consistent. An append operation writes data at the end of a file, but the end is determined by the file system, not by where the client thinks the end is. This means that multiple concurrent appends will be handled correctly (the order of the two appends might be determined by a race condition, but both writes will be done correctly). Appends (and writes) happen to all of the replicas in a set order.
Each chunk has a version number, updated at every write or append, and so it is easy for the master to determine if a particular chunk is stale, i.e. not the latest version. All stale chunks are deleted. Once a problem is detected, the master creates new replications quickly, and so the only way that data in a chunk can be lost is if all of the replicas are lost before the file system can correct the problem. And even then, data may be lost, but the file system structure is consistent.
Here is a more detailed breakdown of the steps involved in a write or append.
This refers to systems that allow widely distributed computing on the Internet.
Jini (Pronounced gee nee, like the word genie) is a distributed system developed by Sun which allows a client on one machine to find and use a service on a different machine. A service may be a computation, storage, a communication channel to another user, a software filter, a hardware device, or another user. Two examples of services are printing a document and translating from one word-processor format to some other.
The essence of Jini is the lookup service (LUS). When a client wishes to use a service, it connects with the LUS, which returns a registrar object (other systems might call this a handle) which it uses to look up a particular service. Thus it is more powerful than a Remote Procedure Call or Java Remote Method Invocation, where the client needs to know the location of the service in advance.
Once it locates the service, the client communicates directly with it as it would if the service was on the local machine.
Services register with the lookup service, advertising their existence. These services can change dynamically, new services become available, old services are taken down.
The jini package itself provides the tools for service construction, lookup, and communication. These are available for many different platforms.
Here is an example.
Globus is an alliance to develop fundamental grid technologies to let people share computing power, databases, and other on-line tools across the Internet. It's main product is the Globus Toolkit, an open source software toolkit for building Grid based applications.
Here are some examples:
The Globus Architecture
The Globus toolkit provides three sets of components
BOINC, the Berkeley Open Infrastructure for Network Computing, is a middleware infrastructure for grid computing. It was originally developed for the SETI@home project but is now publicly available. It is used for very large scale scientific computation in which individual workunits can be distributed to volunteers. The volunteers perform their computations when their computers are not being used for other purposes, and when they have completed a particular computation, they return the results to the server and receive a new assignment. The clients can be running many different operating systems, and they can go down at any time, and so the server(s) have to take this into account. BOINC has been used for projects in drug discovery, mathematics, climatology and astrophysics.
The architecture consists of a server system and client software that communicate with each other to distribute, process, and return workunits.
The server complex of a BOINC project is centered around a relational database that stores descriptions of applications, platforms, versions, workunits, results, accounts and so on. it provides tools for creating, starting, stopping, and querying projects, adding new applications, adding new platforms, creating workunits and monitoring server performance.
BOINC has to be able to deal with erroneous computations (i.e. wrong answers). To do this, it supports redundant computation. If different clients return different answers to the same workunit, it distributes yet more identical workunits to other clients until there is agreement.
Here are some of the projects
Return to the course home page