CSCI.4210 Operating Systems Fall, 2009, Class 20

CSCI.4210 Operating Systems Fall, 2009 Class 20
Multiprocessors, Multicomputers

Review of Multiprocessor Systems

There are a three different models of multiprocessor systems.

Shared Memory multiprocessors, in which several CPUs share a common memory. Such a model makes interprocess communication trivial, but introduces serious problems of contention if several different processors wish to access the memory bus at the same time. Also, if data are cached, discrepancies between cache memory and main memory become magnified.
Multicomputers, also known as Message Passing Systems, in which several CPUs, each with its own memory but tightly coupled, communicate with each other by sending messages. Such systems are often called clusters.
Distributed computing, in which several computers, connected over a network, work on a common problem.

In the last class we discussed shared memory multiprocessors.

Multicomputers

The second computer system architecture is the multicomputer, in which each CPU has its own memory, but the CPU are tightly coupled. Processes can no longer use shared memory for communication, so interprocess communication is done with Message Passing. Because the messages need to go over a network, message passing communication is several orders of magnitude slower than communication through shared memory. A group of tightly clustered computers is called a cluster

Clusters provide a number of attractive features.

They are highly scalable. It is possible to create clusters with hundreds or even thousands of nodes. Note that this was not generally feasible with shared memory architectures because of bus contention.
They can provide high availability. Because each node in a cluster is a stand alone computer, failure of a node does not mean loss of service.
Clusters can provide very good price performance because they use commercial, off the shelf (COTS) components.

There are several methods of connecting the nodes together. One obvious method is using Ethernet, which is a broadcast protocol. However, other systems use various point-to-point protocols. For example, the nodes can be connected using a grid or mesh.

One interesting example of this is a hypercube. The number of nodes (CPUs) on a hypercube should be a power of 2. Picture a cube, with the corners as nodes and lines between them as connections. There are eight nodes, and no node is more than three hops from any other node. It takes three hops to get from the top left front node to the bottom right back node.

A four dimensional cube would have 16 nodes, and a maximum of four hops to get from one node to another.

A five dimensional hypercube would have 32 nodes and a maximum of five hops to get from one node to another, and so on.

Clusters often include middleware to provide a single image to all of the users; that is, the fact that there are numerous nodes is transparent to the typical user. A user logs into the cluster, not onto a particular machine in the cluster, and sees a single file system, such as NFS, a common user interface, and a common view of resources and devices.

Even though each node in the cluster has its own memory, it is possible for a cluster to provide a single memory image for multiple CPUs; this is called Distributed Shared Memory. Each page is in the memory of one of the nodes, but all nodes have access to it. This is made possible because of virtual memory. When a CPU tries to access a page that it does not have in its page table, the OS locates the page in the network (perhaps on some other computer). The network sends to page to the requesting node where it is mapped in its page table.

This can result in cache inconsistencies if a CPU writes to its own instance of a page. As a result, whenever a process is about to execute a WRITE to a virtual memory page, a message is sent to all the other CPUs which hold a copy of that page telling them unmap and discard the page.

Load balancing is a major issue in such systems. Once a process has been assigned to a node, it is difficult to move it, and so if the system is busy, it is important to keep all of the processors occupied. It is easy for a situation to develop where one node is overworked and another is idle.

There are a number of algorithms for this, depending on how much information is available and how much information the scheduler has. Here are three:

A graph theoretic algorithm If the system knows in advance how much communication there will be between various processes, it can minimize the amount of network traffic by assigning clusters of processes which will be sending many messages to each other to the same node. The algorithm which does this is a graph partitioning algorithm Each process is a vertex of the graph, and each arc is the message volume between two nodes. If there are more processes (p) than there are nodes (n), the algorithm will partition the graph of p vertices into n partitions such that the amount of communication between any two nodes is minimized.
A sender initiated distributed algorithm When a new process is created, it runs on the node on which it was created unless the workload on that node exceeds a certain level, in which case it chooses another node at random and asks what its load it. If the load of the new node is below a certain threshold, then the process is sent to that node to be run. If the load on the randomly selected node is too high, then another node is selected at random until a node is found with a low load.
Note that unlike the previous algorithm, this makes no attempt to minimize network traffic, and in fact if the system is heavily loaded, this algorithm generates even more network traffic.
A receiver initiated distributed algorithm This is the opposite algorithm. Whenever a process terminates on a node, if the workload on that node is below a threshold, it queries other nodes. If it finds a node which is heavily loaded, that node will send a process to the lightly loaded node.

In order to implement load balancing, it there must be a mechanism to migrate a process from one node to another. There are several ways to do this.

The simplest way is to simply copy the entire process address space, but this can be quite time consuming, and it often results in a lot of unneeded copying.
Precopying the process continues to execute on the source node while the address space is copied to the target node. Pages which are modified during the precopy operation may need to be copied a second time.
Transfer only those pages which are in main memory and which have been modified. Any additional blocks will be transferred on demand only. While this is very efficient, it means that the sending node has to continue to be involved in the process, by maintaining page tables and transferring pages as needed.
Flushing All dirty pages are simply copied to disk, and the process starts on the new node by loading pages from disk rather than copying directly from the memory of the source to the memory of the destination. This is simple to implement, but it may require extra disk accesses.

A final issue which has to be resolved in process migration is the status of outstanding signals and messages.

The standard cluster software for a bunch of Linux machines is Beowulf.

Virtualization

A virtual machine (VM) is a software implementation of a machine (i.e. a computer) or program that executes programs like a physical machine. Java programs typically run in a virtual machine called the Java Runtime Environment (JRE). Cygwin is a virtual machine of sorts.

Virtualization as applied to operating systems refers to running several different OSs (or multiple instances of the same OS) on a single machine. This concept has been around for a long time; in the 1970s, IBM had a mainframe operating system called VM, in which each user was given their own instance of the OS. The company VMWare has made virtualization popular, although there are other virtualizations such as Solaris Zones. Virtualization has become so important that Intel has added features to assist virtualization onto their Processors. The technology, called VT by Intel creates containers in which virtual machines can be run.

Here are some advantages of virtualization.

multiple OS environments can co-exist on the same computer, in strong isolation from each other
the virtual machine can provide an instruction set architecture (ISA) that is somewhat different from that of the real machine
it is possible to run applications which are taylored to a particular architecture or OS.

The main disadvantage is that it is less efficient than a real machine because it has to access the hardware indirectly.

One obvious use is software testing. A company can test software on multiple platforms, including different versions of an operating system. Also, each guest OS can be treated as a sandbox, an environment to test potentially buggy or virus infected code which is isolated from the production environment.

The underlying software that controls this is called a hypervisor, and this is the true Operating System. The other operating systems (the ones that the users see) are called guest operating systems. Whenever a process on a guest OS tries to execute a privileged instruction, this is intercepted by the hypervisor.

There are two types of hypervisors, Type 1 and Type 2.

A type 1 hypervisor runs on the bare metal, a type 2 hypervisor runs as a process on the host OS (linux, Windows or whatever).

There are a number of problems that virtualization has to solve. The first is that the guest operating system will have to execute privileged instructions (recall from the first class that there are some processor instructions which should only be executed in kernel mode). In all types of virtualization, these privileged instructions are trapped by the hypervisor rather than run directly, and the hypervisor then executes them.

VMWare uses a Type 2 Hypervisor. In this case, as the code is run, it is inspected to find basic blocks, that is, runs of instructions that do not have jump instructions. If there are privileged instructions, they are trapped by the hypervisor, which handles them; otherwise, the blocks run directly, so there is no slowdown. This can work because once the instructions are cached, they do not need to be reprocessed.

Another problem is memory virtualization. Each guest OS thinks that it can see the entire memory, and of course each has its own page table. What happens if two different guests try to map to the same physical page. The hypervisor solves this problem by creating a shadow page table that maps the virtual pages used by the virtual machine onto real pages. This can result in a performance penalty, because every change to a guest's virtual page table results in an update to the shadow page table as well.

Distributed Computing

Both Multiprocessor Systems and Multicomputers are more or less tightly coupled. Typically all of the computers are in the same room, and often on the same rack. The idea behind distributed computing is that many, essentially independent computers, often widely distributed, can work together on a problem. A current initiative in our department is world wide computing, sometimes called Grid Computing in which computers all over the world work together, communicating over the Internet.

There are few special implications for the design of operating systems for computers which will be doing loosely coupled, distributed computing. Each node simply runs its own native operating system. Note that it is possible, and indeed likely, that different nodes are running different operating systems. Interprocess communication between processes on different computers is done using standard protocols over the network.

We can expand our definition of an operating system to include any system which coordinates computation, even including widely distributed computation. If we do this, a distributed system introduces new problems.

One problem with distributed computing over the Internet is that there can be significant and unpredictable delays in transmission. This leads to a problem for some applications such as distributed games, because it is important to have a consistent global state, and this is difficult if there are substantial and unpredictable communication lags. For example, if Alice and Bob are playing a distributed game in which users score points by killing bad guys, and both users have a chance to kill the same bad guy at about the same time, it is possible that Alice will kill the bad guy on her machine while at the same time Bob kills the same bad guy on his machine because the news that Alice had already killed the bad guy had not reached his machine yet. This is an inconsistent state; each thinks that they should get the points for killing the bad guy.

For a more formal analysis of the problem, I'll use a more mundane example (borrowed from Stallings, Operating Systems). An individual has accounts in two different branches of a bank, and every day at 3:00, each branch sends the balance to a central server. At about 3:00, each account has $100, and the account holder transfers $20 from the account in Branch A to the account in Branch B. There are two scenarios in which this could result in the wrong answer being stored on the central server.

Branch A has recorded the fact that the user has transferred $20 from that account and so it tells the central server that the balance is $80. However, Branch B has not yet received the message, and so it tells the central server that its balance is $100. Thus, the central server thinks that the total in the two accounts is $180 when in fact it should be $200.
Synchronizing clocks on distributed systems is difficult, and so it is likely that the two branches may not completely agree on when it is 3:00. Thus it is possible that Branch A might send the value of $100 to the central server, then transfer the money. Branch B records the transfer and then sends the balance to the central server, telling it that that balance is $120. Now the central server thinks that the sum of the two balances is $220.

One solution to this problem is the distributed snapshot algorithm. This assumes that messages are delivered in the order that they are sent, and no messages are lost. Note that TCP satisfies this requirement.

The algorithm uses a control message called a marker. Some process initiates the algorithm by recording its state and sends out a marker on all of its outgoing channels before any other messages are sent. Each process p, when it first receives the marker from another process, q for example, does the following

records its local state
records the state of the incoming channel from q as empty
propagates the marker to all of its neighbors.

At any time after recording its state, when process p receives a marker from process r, it records the state of the channel from r to p as the sequence of messages that it received from the time it recorded its local state to the time that it received the marker from r.

Each process terminates when it receives the marker from all other processes. A globally consistent state can be recorded by having every process send its state data along every outgoing channel.

Using a marker to solve our bank problem is trivial. Suppose that when the server thinks that it's 3:00, it sends out a marker. Consider our first case where the Branch A has sent the message for the transfer but Branch B has not yet received it. Branch A receives the marker from the server and records its state (balance is $80). Branch B receives the marker from the server and records its state (balance is $100). But we're not done yet. Branch A sends the marker to Branch B, and by the rules that we set, it does not arrive until after the transfer message, so when it does arrive, Branch B has already received the transfer message. It updates its state accordingly. Branch B also sends the marker to Branch A. Now the algorithm terminates and everyone is in agreement.

Some Distributed Computing Architectures

The text (and everyone else), talks about middleware as if it were a separate layer in the protocol stack, but it isn't. These are just protocols to get work done on a distributed network. I will discuss several of these systems.

Document Based Middleware

The World Wide Web is based on a very simple protocol called http (Hypertext Transport Protocol). A client (a web browser) sends a GET request for a document to a web server, using TCP/IP. Here is a simple GET request to retrieve the course home page. It would be sent to the CS department web server www.cs.rpi.edu on port 80.

 
GET /academics/courses/os/index.html HTTP/1.0

A web browser would follow this with some additional information, but this statement would work .

The web server (apache) would retrieve the web page and send it back to the browser. The browser would then display this document.

The web was initially conceived to retrieve static documents such as this, but very quickly, it was modified to include new features.

cgi (The common gateway interface) This allows the client (the browser) to send a more specific request to the server. For example, a request to a stock market quote server might look something like this.

GET /quote?stock=IBM HTTP/1.0

Rather than returning a static document, the server would query a database to retrieve some information about the current price of a share of IBM stock, and it would format a document on the fly and send this to the client.

There are a number of widely used server technologies which have been developed to facilitate such cgi calls. Examples include the freeware tomcat which runs Java Server Pages. Commercial products include Microsoft ASP, BEA weblogic, Cold Fusion, and IBM WebSphere.

cookies Initially http was conceived as a stateless protocol. The server simply received requests and returned the appropriate document. However, for more elaborate interactions, it was necessary for the server to send some information to the client. This information is called a cookie. The browser stores this cookie and whenever it makes a new request to the server, it sends the cookie along, so that the server knows that this request is part of an ongoing session.

Return to the course home page