CSCI.4210 Operating Systems Fall, 2008, Class 21

CSCI.4210 Operating Systems Fall, 2008 Class 21
Distributed Computing

Distributed Computing

Both Multiprocessor Systems and Multicomputers are more or less tightly coupled. Typically all of the computers are in the same room, and often on the same rack. The idea behind distributed computing is that many, essentially independent computers, often widely distributed, can work together on a problem. A current initiative in our department is world wide computing, sometimes called Grid Computing in which computers all over the world work together, communicating over the Internet.

There are few special implications for the design of operating systems for computers which will be doing loosely coupled, distributed computing. Each node simply runs its own native operating system. Note that it is possible, and indeed likely, that different nodes are running different operating systems. Interprocess communication between processes on different computers is done using standard protocols over the network.

We can expand our definition of an operating system to include any system which coordinates computation, even including widely distributed computation. If we do this, a distributed system introduces new problems.

One problem with distributed computing over the Internet is that there can be significant and unpredictable delays in transmission. This leads to a problem for some applications such as distributed games, because it is important to have a consistent global state, and this is difficult if there are substantial and unpredictable communication lags. For example, if Alice and Bob are playing a distributed game in which users score points by killing bad guys, and both users have a chance to kill the same bad guy at about the same time, it is possible that Alice will kill the bad guy on her machine while at the same time Bob kills the same bad guy on his machine because the news that Alice had already killed the bad guy had not reached his machine yet. This is an inconsistent state; each thinks that they should get the points for killing the bad guy.

For a more formal analysis of the problem, I'll use a more mundane example (borrowed from Stallings, Operating Systems). An individual has accounts in two different branches of a bank, and every day at 3:00, each branch sends the balance to a central server. At about 3:00, each account has $100, and the account holder transfers $20 from the account in Branch A to the account in Branch B. There are two scenarios in which this could result in the wrong answer being stored on the central server.

Branch A has recorded the fact that the user has transferred $20 from that account and so it tells the central server that the balance is $80. However, Branch B has not yet received the message, and so it tells the central server that its balance is $100. Thus, the central server thinks that the total in the two accounts is $180 when in fact it should be $200.
Synchronizing clocks on distributed systems is difficult, and so it is likely that the two branches may not completely agree on when it is 3:00. Thus it is possible that Branch A might send the value of $100 to the central server, then transfer the money. Branch B records the transfer and then sends the balance to the central server, telling it that that balance is $120. Now the central server thinks that the sum of the two balances is $220.

One solution to this problem is the distributed snapshot algorithm. This assumes that messages are delivered in the order that they are sent, and no messages are lost. Note that TCP satisfies this requirement.

The algorithm uses a control message called a marker. Some process initiates the algorithm by recording its state and sends out a marker on all of its outgoing channels before any other messages are sent. Each process p, when it first receives the marker from another process, q for example, does the following

records its local state
records the state of the incoming channel from q as empty
propagates the marker to all of its neighbors.

At any time after recording its state, when process p receives a marker from process r, it records the state of the channel from r to p as the sequence of messages that it received from the time it recorded its local state to the time that it received the marker from r.

Each process terminates when it receives the marker from all other processes. A globally consistent state can be recorded by having every process send its state data along every outgoing channel.

Using a marker to solve our bank problem is trivial. Suppose that when the server thinks that it's 3:00, it sends out a marker. Consider our first case where the Branch A has sent the message for the transfer but Branch B has not yet received it. Branch A receives the marker from the server and records its state (balance is $80). Branch B receives the marker from the server and records its state (balance is $100). But we're not done yet. Branch A sends the marker to Branch B, and by the rules that we set, it does not arrive until after the transfer message, so when it does arrive, Branch B has already received the transfer message. It updates its state accordingly. Branch B also sends the marker to Branch A. Now the algorithm terminates and everyone is in agreement.

Some Distributed Computing Architectures

The text (and everyone else), talks about middleware as if it were a separate layer in the protocol stack, but it isn't. These are just protocols to get work done on a distributed network. I will discuss several of these systems.

Document Based Middleware

The World Wide Web is based on a very simple protocol called http (Hypertext Transport Protocol). A client (a web browser) sends a GET request for a document to a web server, using TCP/IP. Here is a simple GET request to retrieve the course home page. It would be sent to the CS department web server www.cs.rpi.edu on port 80.

 
GET /academics/courses/os/index.html HTTP/1.0

A web browser would follow this with some additional information, but this statement would work .

The web server (apache) would retrieve the web page and send it back to the browser. The browser would then display this document.

The web was initially conceived to retrieve static documents such as this, but very quickly, it was modified to include new features.

cgi (The common gateway interface) This allows the client (the browser) to send a more specific request to the server. For example, a request to a stock market quote server might look something like this.

GET /quote?stock=IBM HTTP/1.0

Rather than returning a static document, the server would query a database to retrieve some information about the current price of a share of IBM stock, and it would format a document on the fly and send this to the client.

There are a number of widely used server technologies which have been developed to facilitate such cgi calls. Examples include the freeware tomcat which runs Java Server Pages. Commercial products include Microsoft ASP, BEA weblogic, Cold Fusion, and IBM WebSphere.

cookies Initially http was conceived as a stateless protocol. The server simply received requests and returned the appropriate document. However, for more elaborate interactions, it was necessary for the server to send some information to the client. This information is called a cookie. The browser stores this cookie and whenever it makes a new request to the server, it sends the cookie along, so that the server knows that this request is part of an ongoing session.

Object based Middleware

We have already discussed Remote Procedure Calls and the java equivalent, Remote Method Invocation (RMI). The logical extension of this is distributed objects. The best attempt to implement this is CORBA (Common Object Request Broker Architecture).

Developed by the Object Management Group, it allows clients to access remote objects. Unlike RPCs and RMI, it is not limited to one language.

The Object Request Broker (ORB) is the key component. It manages all communication between its components. It allows objects to interact in a heterogeneous, distributed environment, independent of the platforms on which these objects reside and techniques used to implement them.

The functions of the ORB are as follows:

Lookup and instantiate objects on remote machines
Marshal parameters from one application object to the other
Handle security issues across machine boundaries
Retrieve and publish data on objects on local machine for another ORB to use
Invoke methods on a remote object using static method invocation
Invoke methods on a remote object using dynamic method invocation
Automatically instantiate objects that aren'tt currently running
Route callback methods to the appropriate local object being managed
Communicate with other ORBs using Internet Inter-ORB Protocol ( IIOP )

The Dynamic Invocation Interface (DII) API in CORBA provide a Dynamic Interface to a CORBA Object whose interface is not known at compile time to the client of the CORBA Object

CORBA seems to be going out of fashion.

File System Based Middleware

We have already discussed two instances of File System based middleware (what I call a networked file system), here is a third.

The Google File System

Google has some unique file system needs. They generate enormous numbers of files, and many of the files are enormous; the files have to be widely distributed; they have to have extremely fast access to enormous numbers of clients; and they have to be highly fault tolerant. As a result, they have developed their own unique file system GFS.

Here are some of its constraints

Scalability The file system has to support petabytes (2⁵⁰ Bytes) of data. The file system is growing rapidly.
Fault tolerant Google's hardware philosophy is to use inexpensive Commercial Off-the-Shelf (COTS) components rather than designing and building their own hardware. With many thousands of computers distributed in many different sites, failures of both hardware and software occur regularly, and the system has to have enough redundancy built in so that failures do not result in a loss of functionality.
Large file sizes Unlike most file systems, most of the files in the google file system are enormous, many are multi-GB. However, virtually all writes are appends. There are hardly ever any random writes to the middle of a file. Even most reads are sequential rather than random access.

To address these issues, Google designed their own file system from scratch. It uses some of the standard posix system calls like read, write, open, and close, but it has built in append and snapshot operations. The latter creates a copy of a file or directory.

Computers are grouped into clusters. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. Files are divided into 64MB chunks (No, that's not a typo) . Each chunk is identified by a globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range, bypassing the inodes and vnodes. For reliability, each chunk is replicated on multiple chunkservers. The default is three replicas.

The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

With only one master per cluster, there is the danger that the master could become a bottleneck. To avoid this,clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

Here is how a typical read works:

Using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file.
The client sends the master a request containing the file name and chunkindex.
The master replies with the corresponding chunkhandle and locations of the replicas.
The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk.
Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened.

A client can ask for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master communications, at essentially no cost.

The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages. This eliminates the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.

File system consistency is very important. To assure this, all mutations (e.g., file creation, writes and appends) are atomic. They are handled exclusively by the master; namespace locking guarantees atomicity and correctness. All metadata changes are logged, and the log file is replicated. In the event of a crash, the master can recreate the data by replaying the log file. The master periodically sets a checkpoint (storing all the metadata) and so the logfile only has to keep track of events since the last checkpoint. This keeps the logfile size manageable.

Data are also kept consistent. An append operation writes data at the end of a file, but the end is determined by the file system, not by where the client thinks the end is. This means that multiple concurrent appends will be handled correctly (the order of the two appends might be determined by a race condition, but both writes will be done correctly). Appends (and writes) happen to all of the replicas in a set order.

Each chunk has a version number, updated at every write or append, and so it is easy for the master to determine if a particular chunk is stale, i.e. not the latest version. All stale chunks are deleted. Once a problem is detected, the master creates new replications quickly, and so the only way that data in a chunk can be lost is if all of the replicas are lost before the file system can correct the problem. And even then, data may be lost, but the file system structure is consistent.

Here is a more detailed breakdown of the steps involved in a write or append.

The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses.
The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primarybecomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. Any errors encountered at any of the replicas are reported to the client.

Unlike other file systems, GFS does not have a per-directory data structure that lists all the files in that directory and it does not support multiple hard or soft links to the same file. There is a single lookup table which maps full pathnames to the metadata. There is a sophisticated locking system which allows a range to be locked during a relatively long operation such as a snapshop (copy), but which allows concurrent operations on other fields in the namespace. Locks are obtained in a consistent global order to prevent deadlock.

Coordination based Middleware

This refers to systems that allow widely distributed computing on the Internet.

Jini

Jini (Pronounced gee nee, like the word genie) is a distributed system developed by Sun which allows a client on one machine to find and use a service on a different machine. A service may be a computation, storage, a communication channel to another user, a software filter, a hardware device, or another user. Two examples of services are printing a document and translating from one word-processor format to some other.

The essence of Jini is the lookup service (LUS). When a client wishes to use a service, it connects with the LUS, which returns a registrar object (other systems might call this a handle) which it uses to look up a particular service. Thus it is more powerful than a Remote Procedure Call or Java Remote Method Invocation, where the client needs to know the location of the service in advance.

Once it locates the service, the client communicates directly with it as it would if the service was on the local machine.

Services register with the lookup service, advertising their existence. These services can change dynamically, new services become available, old services are taken down.

The jini package itself provides the tools for service construction, lookup, and communication. These are available for many different platforms.

Here is an example.

A high resolution color printer is powered up and connected to the network.
The printer has to discover the Lookup Service by sending a multicast message. The LUS replies with a servide registrar object.
The printer creates a service object of the correct type for printing services. It provides an arry of LookupEntry instances to specify the attributes (i.e. paper size, color, resolution), and it sends this to the LUS. The service is now registered.
When the printer registers, it receives a lease. Periodically the printer will have to renew the lease with the LUS.
A digital camera (the client) would like to print some high resolution images. It first has to locate a LUS. The LUS returns a service registrar object.
the Camera creates a template, which is used to match and filter the set of existing services. This specifies the type, and any attributes (i.e. color, resolution) This is sent to the LUS and the LUS returns a single service object to the camera.
The LUS is no longer involved. The camera configures the printer, then sends the images to be printed. The printer sends a notification when it is done of if an error occurred.

Globus

Globus is an alliance to develop fundamental grid technologies to let people share computing power, databases, and other on-line tools across the Internet. It's main product is the Globus Toolkit, an open source software toolkit for building Grid based applications.

Here are some examples:

A scientist or business analyst needs to access data located in different storage systems across a scientific collaboration.
A business or scientific community must allocate computing, storage, and network resources dynamically for a time varying e-commerce or scientific data analysis workload
An engineer needs to design and operate experiments on remote equipment, linking and comparing numerical and physical simulations
An astronomy experiment must replicate a terabyte of data a day to partner sites worldwide.

Here is some functionality that is often required for such problems.

discovering available resources
configuring a computing resource to run an application or deploy a service.
Manage an application or service
Move data reliably from one site to another
Monitor system components
Control who can do what
manage user credentials and attributes

Globus makes extensive use of Web Services, XML based mechanisms for describing, discovering, and invoking network services.

The Globus Architecture

The Globus toolkit provides three sets of components

A set of service implementations. Services include
- Execution Management
- Data access and movement
- Replica management
- Monitoring and discovery
- Credentials management
- Instrument management
Three containers used to host user developed services, One in Java, one in C, and one in Python
A set of client libraries which allow clients to invoke operations

BOINC

BOINC, the Berkeley Open Infrastructure for Network Computing, is a middleware infrastructure for grid computing. It was originally developed for the SETI@home project but is now publicly available. It is used for very large scale scientific computation in which individual workunits can be distributed to volunteers. The volunteers perform their computations when their computers are not being used for other purposes, and when they have completed a particular computation, they return the results to the server and receive a new assignment. The clients can be running many different operating systems, and they can go down at any time, and so the server(s) have to take this into account. BOINC has been used for projects in drug discovery, mathematics, climatology and astrophysics.

The architecture consists of a server system and client software that communicate with each other to distribute, process, and return workunits.

The server complex of a BOINC project is centered around a relational database that stores descriptions of applications, platforms, versions, workunits, results, accounts and so on. it provides tools for creating, starting, stopping, and querying projects, adding new applications, adding new platforms, creating workunits and monitoring server performance.

BOINC has to be able to deal with erroneous computations (i.e. wrong answers). To do this, it supports redundant computation. If different clients return different answers to the same workunit, it distributes yet more identical workunits to other clients until there is agreement.

Here are some of the projects

SETI@home The Search for ExtraTerrestial Intelligence
Predictor studies protein dynamics and mechanics. (Based at the Scripps Research Institute)
Folding@home Studies protein folding, misfolding, and aggregation (base at Stanford University)
Climateprediction.net aims to quantify and reduce the uncertainties in long-term client prediction based on computer simulations.
Climate@home This project has similar goals to the above project.
Einstein@home A project to detect certain types of gravitational waves such as those from spinning neutron stars, that can be detected only by using highly selective filtering techniques that require extreme computing power.

Return to the course home page