CSCI.4210 Operating Systems Fall, 2009, Class 21

CSCI.4210 Operating Systems Fall, 2009 Class 21
Distributed Computing

Review

In distributed computing, many, essentially independent computers, often widely distributed, can work together on a problem.

We have to expand our definition of an operating system to include any system which coordinates computation, even including widely distributed computation.

Distributed Mutual Exclusion

How would you implement mutual exclusion in a fully distributed system?

There are two methods. The first is a centralized system in which one node is considered to be the coordinator. Whenever a process wants to enter its critical region (CR), it first sends a message to the coordinator. If the coordinator knows that no other process is in its CR, it sends back an immediate OK reply (the conceptual equivalent of locking a mutex). If another process is in its CR, it does not reply until it gets another message from the process in the CR saying that it has left its CR (the conceptual equivalent of unlocking a mutex). If two or more messages arrive at the coordinator at the same time, it chooses one and sends an OK message, then puts the other requests on a queue.

The downside of this method is that the coordinator can be a bottle neck, and if it ever goes down, the entire system breaks (not to mention what happens if a process dies while it is in its CR).

An alternative is a fully distributed method. But before we discuss this, we have to talk about solving the problem of event ordering. Lamport (1978) developed a method of distributed event ordering. Each process involved in this system keeps an integer time stamp. An event with a lower time stamp is assumed to have happened before an event with a higher time stamp. Note that it may not have actually happened earlier in real clock time.

Whenever a process sends a message to another process, it attaches its time stamp. We make the reasonable assumption that the receipt of a message is always later than the time that the message was sent. If the receiving process thinks that the time is later than the time stamp of the sending process, everything is in agreement. However, if the receiving process thinks that the time is earlier than than the sending process (for example, the receiving process time stamp is 56, and the time stamp on the message from the sending process is 60), then the receiver sets its time stamp to one more than that of the sender, 61 in this case. There needs to be a simple tie breaking scheme as well, such as assigning a number to all of the processes, and arbitrarily specifying that the event from the lower numbered process occurred first.

This method assures that all processes agree on an ordering of events.

To implement mutual exclusion, a process that wants to enter its CR sends a message to every other process in the system saying that it wants to enter a CR, and it attaches its time stamp. It will not enter its CR until it receives an OK from every other node.

If a node receives this message and does not want to enter its CR, it responds with an OK immediately. If it is in the CR, it waits until it leaves the CR and then responds. If it wants to enter its CR but is not yet in, it uses the time stamp to determine which goes first, earlier request first.

Here is an example with three processes. P₀ sends everyone a request with timestamp 8. At the same time P₂ sends everyone a request with timestamp 12. P₁ does not want to enter its CR, so it sends an immediate OK to both. Processes 0 and 2 compare timestamps. P₂ sees that it has lost, so it sends an OK to P₀. P₀ queues the request from 2, and when it is finished with the CR, it sends an OK to P₂.

There are several downsides to this system.

Each process needs to know about every other process in the system.
It generate a lot of network traffic.
If any node goes down, the system breaks down

Some Distributed Computing Architectures

Document Based Middleware

The World Wide Web is based on a very simple protocol called http (Hypertext Transport Protocol). A client (a web browser) sends a GET request for a document to a web server, using TCP/IP. Here is a simple GET request to retrieve the course home page. It would be sent to the CS department web server www.cs.rpi.edu on port 80.

 
GET /academics/courses/os/index.html HTTP/1.0

A web browser would follow this with some additional information, but this statement would work .

The web server (apache) would retrieve the web page and send it back to the browser. The browser would then display this document.

The web was initially conceived to retrieve static documents such as this, but very quickly, it was modified to include new features.

cgi (The common gateway interface) This allows the client (the browser) to send a more specific request to the server. For example, a request to a stock market quote server might look something like this.

GET /quote?stock=IBM HTTP/1.0

Rather than returning a static document, the server would query a database to retrieve some information about the current price of a share of IBM stock, and it would format a document on the fly and send this to the client.

There are a number of widely used server technologies which have been developed to facilitate such cgi calls. Examples include the freeware tomcat which runs Java Server Pages. Commercial products include Microsoft ASP, BEA weblogic, Cold Fusion, and IBM WebSphere.

cookies Initially http was conceived as a stateless protocol. The server simply received requests and returned the appropriate document. However, for more elaborate interactions, it was necessary for the server to send some information to the client. This information is called a cookie. The browser stores this cookie and whenever it makes a new request to the server, it sends the cookie along, so that the server knows that this request is part of an ongoing session.

Object based Middleware

We have already mentioned Remote Procedure Calls and the java equivalent, Remote Method Invocation (RMI). The logical extension of this is distributed objects. The best attempt to implement this is CORBA (Common Object Request Broker Architecture).

Developed by the Object Management Group, it allows clients to access remote objects. Unlike RPCs and RMI, it is not limited to one language.

The Object Request Broker (ORB) is the key component. It manages all communication between its components. It allows objects to interact in a heterogeneous, distributed environment, independent of the platforms on which these objects reside and techniques used to implement them.

The functions of the ORB are as follows:

Lookup and instantiate objects on remote machines
Marshal parameters from one application object to the other
Handle security issues across machine boundaries
Retrieve and publish data on objects on local machine for another ORB to use
Invoke methods on a remote object using static method invocation
Invoke methods on a remote object using dynamic method invocation
Automatically instantiate objects that aren't currently running
Route callback methods to the appropriate local object being managed
Communicate with other ORBs using Internet Inter-ORB Protocol ( IIOP )

The Dynamic Invocation Interface (DII) API in CORBA provide a Dynamic Interface to a CORBA Object whose interface is not known at compile time to the client of the CORBA Object

CORBA seems to be going out of fashion.

File System Based Middleware

We have already discussed two instances of File System based middleware (what I call a networked file system), here is a third.

The Google File System

Google has some unique file system needs. They generate enormous numbers of files, and many of the files are enormous; the files have to be widely distributed; they have to have extremely fast access to enormous numbers of clients; and they have to be highly fault tolerant. As a result, they have developed their own unique file system GFS.

Here are some of its constraints

Scalability The file system has to support petabytes (2⁵⁰ Bytes) of data. The file system is growing rapidly.
Fault tolerant Google's hardware philosophy is to use inexpensive Commercial Off-the-Shelf (COTS) components rather than designing and building their own hardware. With many thousands of computers distributed in many different sites, failures of both hardware and software occur regularly, and the system has to have enough redundancy built in so that failures do not result in a loss of functionality.
Large file sizes Unlike most file systems, most of the files in the google file system are enormous, many are multi-GB. However, virtually all writes are appends. There are hardly ever any random writes to the middle of a file. Even most reads are sequential rather than random access.

To address these issues, Google designed their own file system from scratch. It uses some of the standard posix system calls like read, write, open, and close, but it has built in append and snapshot operations. The latter creates a copy of a file or directory.

Computers are grouped into clusters. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. Files are divided into 64MB chunks (No, that's not a typo) . Each chunk is identified by a globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range, bypassing the inodes and vnodes. For reliability, each chunk is replicated on multiple chunkservers. The default is three replicas.

The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

With only one master per cluster, there is the danger that the master could become a bottleneck. To avoid this,clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

Here is how a typical read works:

Using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file.
The client sends the master a request containing the file name and chunkindex.
The master replies with the corresponding chunkhandle and locations of the replicas.
The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk.
Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened.

A client can ask for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master communications, at essentially no cost.

The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages. This eliminates the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.

File system consistency is very important. To assure this, all mutations (e.g., file creation, writes and appends) are atomic. They are handled exclusively by the master; namespace locking guarantees atomicity and correctness. All metadata changes are logged, and the log file is replicated. In the event of a crash, the master can recreate the data by replaying the log file. The master periodically sets a checkpoint (storing all the metadata) and so the logfile only has to keep track of events since the last checkpoint. This keeps the logfile size manageable.

Data are also kept consistent. An append operation writes data at the end of a file, but the end is determined by the file system, not by where the client thinks the end is. This means that multiple concurrent appends will be handled correctly (the order of the two appends might be determined by a race condition, but both writes will be done correctly). Appends (and writes) happen to all of the replicas in a set order.

Each chunk has a version number, updated at every write or append, and so it is easy for the master to determine if a particular chunk is stale, i.e. not the latest version. All stale chunks are deleted. Once a problem is detected, the master creates new replications quickly, and so the only way that data in a chunk can be lost is if all of the replicas are lost before the file system can correct the problem. And even then, data may be lost, but the file system structure is consistent.

Here is a more detailed breakdown of the steps involved in a write or append.

The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses.
The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. Any errors encountered at any of the replicas are reported to the client.

Unlike other file systems, GFS does not have a per-directory data structure that lists all the files in that directory and it does not support multiple hard or soft links to the same file. There is a single lookup table which maps full pathnames to the metadata. There is a sophisticated locking system which allows a range to be locked during a relatively long operation such as a snapshot (copy), but which allows concurrent operations on other fields in the namespace. Locks are obtained in a consistent global order to prevent deadlock.

Coordination based Middleware

This refers to systems that allow widely distributed computing on the Internet.

Jini

Jini (Pronounced gee nee, like the word genie) is a distributed system developed by Sun which allows a client on one machine to find and use a service on a different machine. A service may be a computation, storage, a communication channel to another user, a software filter, a hardware device, or another user. Two examples of services are printing a document and translating from one word-processor format to some other.

The essence of Jini is the lookup service (LUS). When a client wishes to use a service, it connects with the LUS, which returns a registrar object (other systems might call this a handle) which it uses to look up a particular service. Thus it is more powerful than a Remote Procedure Call or Java Remote Method Invocation, where the client needs to know the location of the service in advance.

Once it locates the service, the client communicates directly with it as it would if the service was on the local machine.

Services register with the lookup service, advertising their existence. These services can change dynamically, new services become available, old services are taken down.

The jini package itself provides the tools for service construction, lookup, and communication. These are available for many different platforms.

Here is an example.

A high resolution color printer is powered up and connected to the network.
The printer has to discover the Lookup Service by sending a multicast message. The LUS replies with a service registrar object.
The printer creates a service object of the correct type for printing services. It provides an array of LookupEntry instances to specify the attributes (i.e. paper size, color, resolution), and it sends this to the LUS. The service is now registered.
When the printer registers, it receives a lease. Periodically the printer will have to renew the lease with the LUS.
A digital camera (the client) would like to print some high resolution images. It first has to locate a LUS. The LUS returns a service registrar object.
the Camera creates a template, which is used to match and filter the set of existing services. This specifies the type, and any attributes (i.e. color, resolution) This is sent to the LUS and the LUS returns a single service object to the camera.
The LUS is no longer involved. The camera configures the printer, then sends the images to be printed. The printer sends a notification when it is done of if an error occurred.

Globus

Globus is an alliance to develop fundamental grid technologies to let people share computing power, databases, and other on-line tools across the Internet. It's main product is the Globus Toolkit, an open source software toolkit for building Grid based applications.

Here are some examples:

A scientist or business analyst needs to access data located in different storage systems across a scientific collaboration.
A business or scientific community must allocate computing, storage, and network resources dynamically for a time varying e-commerce or scientific data analysis workload
An engineer needs to design and operate experiments on remote equipment, linking and comparing numerical and physical simulations
An astronomy experiment must replicate a terabyte of data a day to partner sites worldwide.

Here is some functionality that is often required for such problems.

discovering available resources
configuring a computing resource to run an application or deploy a service.
Manage an application or service
Move data reliably from one site to another
Monitor system components
Control who can do what
manage user credentials and attributes

Globus makes extensive use of Web Services, XML based mechanisms for describing, discovering, and invoking network services.

The Globus Architecture

The Globus toolkit provides three sets of components

A set of service implementations. Services include
- Execution Management
- Data access and movement
- Replica management
- Monitoring and discovery
- Credentials management
- Instrument management
Three containers used to host user developed services, One in Java, one in C, and one in Python
A set of client libraries which allow clients to invoke operations

BOINC

BOINC, the Berkeley Open Infrastructure for Network Computing, is a middleware infrastructure for grid computing. It was originally developed for the SETI@home project but is now publicly available. It is used for very large scale scientific computation in which individual workunits can be distributed to volunteers. The volunteers perform their computations when their computers are not being used for other purposes, and when they have completed a particular computation, they return the results to the server and receive a new assignment. The clients can be running many different operating systems, and they can go down at any time, and so the server(s) have to take this into account. BOINC has been used for projects in drug discovery, mathematics, climatology and astrophysics.

The architecture consists of a server system and client software that communicate with each other to distribute, process, and return workunits.

The server complex of a BOINC project is centered around a relational database that stores descriptions of applications, platforms, versions, workunits, results, accounts and so on. it provides tools for creating, starting, stopping, and querying projects, adding new applications, adding new platforms, creating workunits and monitoring server performance.

BOINC has to be able to deal with erroneous computations (i.e. wrong answers). To do this, it supports redundant computation. If different clients return different answers to the same workunit, it distributes yet more identical workunits to other clients until there is agreement.

Here are some of the projects

SETI@home The Search for ExtraTerrestial Intelligence
Predictor studies protein dynamics and mechanics. (Based at the Scripps Research Institute)
Folding@home Studies protein folding, misfolding, and aggregation (base at Stanford University)
Climateprediction.net aims to quantify and reduce the uncertainties in long-term client prediction based on computer simulations.
Climate@home This project has similar goals to the above project.
Einstein@home A project to detect certain types of gravitational waves such as those from spinning neutron stars, that can be detected only by using highly selective filtering techniques that require extreme computing power.