CSCI.4210 Operating Systems Fall, 2009, Class 17

CSCI.4210 Operating Systems Fall, 2009 Class 17
Computer Networking

Networking

Computers have been connected together since the 1970's, but in the 1990's, the Internet created a revolution in how computers were used. Now essentially all computers are connected to a network of some sort, and the notion of a computer as a free-standing box on your desk is obsolete. Since most of what this course has covered so far has treated a computer as a free-standing box, this means that a lot of what we have discussed so far is rapidly becoming obsolete. To an increasing extent, computing itself, and the implementation of an operating system is distributed; i.e., spread across several computers. The logical image of a file system is a set of files on a hard drive on one computer, but many (most) file systems are distributed now. Many large applications such as databases are distributed, with the data residing on one computer and the users connecting to it remotely. Some fundamental operating system functions can be done remotely; we can talk about remote procedure calls in which a process on one computer executes a procedure call which is executed on another computer.

Before we can discuss how these are implemented, we need some basic networking terminology.

computer network - an interconnected collection of autonomous computers, i.e. they can exchange info
Tightly coupled - (multiprocessor computers) - computers that share memory and a clock. These generally need to be in located in the same box.
Loosely coupled (Distributed System) - a collection of processors that do not share memory or a clock. Often this term is used to describe a network where the existence of multiple autonomous computers is transparent to the user. The system explicitly finds an appropriate computer to run something without the user knowing.

There are two types of network technology

broadcast networks - a single communication channel shared by all the machines on the network. All packets are sent to all machines.
point-to-point networks - many connections between individual machines.

There are two broad classes of networks based on their scale.

LANs (local area networks) privately owned, up to a few kilometers.
WANs (wide area networks)Wide area network, often spanning a country or a planet.
Networks have two components, transmission lines and switching elements. The switching elements are computers that connect the lines and route packets. These include routers, gateways, bridges and other kinds of switches.
In general, LANs are broadcast and WANs are point-to-point, but there are exceptions.
Protocol a set of rules governing the operation of and communication on a network.
The Internet A global network of networks (LANs and WANs). This is important because different networks run different protocols, so there needs to be way for different protocols to communicate with each other.

The architecture of the Internet

The Internet is a worldwide network of networks that connects devices.

These devices, called hosts or end systems were always computers until recently, and most still are, but increasingly these end systems are devices such as PDAs, GPS systems, environmental sensing devices, cell phones, webcams, even refrigerators. No one has a good idea of how many such devices are on the Internet at any given time, because devices are added and removed continually, and there is no central authority to register or control these.

We can talk loosely of three tiers of networks. The outermost tier, tier 3, is the network of a company, a college, a government agency or other organization which has hosts connected to it. These have one or more connections to Internet Service Providers (ISPs), which constitute tier 2. These sometimes have direct connections to other ISPs, but they always connect to one or more Tier 1 providers, the Internet Backbone. These are large telecommunications providers such as Sprint, MCI, Qwest, Cable and Wireless, or AT&T.

All Tier 1 providers connect to all other tier 1 providers, and each is connected to a number of tier 2 ISPs.

Suppose an RPI student wants to order a pizza on the web. He sends a request from his computer in his dorm room. This request is routed through the RPI network, eventually reaching RPI's Edge Router or Gateway, which passes the request on to RPI's ISP, Broadwing. It may take several hops within the Broadwing network, before it is passed on to an Internet backbone provider. It may take several hops inside this network before being passed on to another backbone provider, who in turn passes it on to one of its tier 2 ISP customers. This passes the request on to the pizza web server. The pizza web server sends a reply back to the student's computer in the same fashion, although the reply may not necessarily take the same route though the network.

A typical message may take 15 or more hops to get from its source to its destination, regardless of whether its destination is just down the street or halfway around the world.

Protocols

Devices are connected to the Internet in many different ways. Many computers are on a local area network (LAN) running Ethernet; some can be connected through a wireless hub; home users can connect to their ISPs with a dial-up connection, a cable connection or DSL, and newer devices connect via satellite.

There has to be agreement between a sender and a receiver on how to communicate. Such an agreement is called a Protocol. A protocol defines the format and order of messages exchanged between communicating entities and actions taken in response to various messages.

Internet protocols are developed by the Internet Engineering Task Force (IETF). The Standards documents are called Requests for Comments, (RFCs). The first RFC was published in 1969. As of October, 2009, there were 5734 RFCs.

All end user systems and all of the intermediate nodes on the Internet communicate using the Internet Protocol (IP). IP is a packet switching protocol. This concept is fundamental to understanding the Internet.

We can divide networking protocols into two broad categories, connection-oriented and connectionless. In a connection oriented protocol, the two ends of communication contact each other, agree on parameters, and sometimes determine a path through the network and reserve resources prior to any actual data being transmitted. After the communication is done, the reserved resources are freed up. The analogy is a traditional phone call.

A connectionless network is the opposite; messages are sent to the network, and each intermediate node simply passes the message on until it reaches its destination. The analogy here is the post office.

IP is a connectionless protocol. Another name for this is packet switching. Each intermediate node is called an IP Router. A router receives a packet from some other router, looks at the destination address, and passes it on. There are no acknowledgments (at this level).

A sender breaks a large message down into small chunks, called packets. Of course the packets are sent in order, but there is no guarantee that they will arrive at their destination in the same order that they were sent. Packets can get dropped, and different packets to the same destination can take different routes through the network, so it is the responsibility of the receiver to reassemble the packets into a message. (Different packets of a stream taking different routes is a theoretical possibility, but in reality it is extremely rare)

IP is an unreliable protocol; this does not mean that it loses a lot of information (in fact the Internet is now extremely reliable). The term unreliable when applied to a protocol means that it does not guarantee delivery; packets are forwarded on a best effort basis. A reliable protocol is one which acknowledges receipt of messages in some fashion.

There are tradeoffs to using a packet switching network architecture vs a connection-oriented design. A packet switching network can be much more efficient, efficiency defined as the percentage of the available bandwidth that is actually used. With a connection oriented protocol that reserves bandwidth prior to any communication, not only is there overhead associated with setting up and tearing down the communication channel, but during most communication sessions, there is a significant amount of down time when no data is being transferred, but the resources are still reserved and thus are not available for other sessions.

Packet switching uses store and forward transmission. Each link has a set of buffers. It reads a packet from a link, stores it in one of the buffers, figures out where to send it, and passes it on. It is possible under conditions of heavy usage that packets can arrive at a node faster than the router can process them, resulting in buffer overflow and lost packets.

With a virtual circuit system, the actual routing process can be done more quickly than with datagrams.

Protocol Stacks

The concept of a protocol stack underlies much of the study of networking. A protocol stack is a set of protocols that work together to transmit information from one computer to another. The protocols are layered, and each layer has the illusion that it is communicating with the equivalent layer on the other computer, but in fact it is communicating with the layers above it and below it in the stack.

An analogy might be President Bush talking to Vladimir Putin, the President of Russia. President Bush speaks English (sort of), and President Putin speaks Russian (He might speak English as well, but let's pretend that he doesn't). Bush says something in English, looking at President Putin, but in fact he is speaking to a translator. The translator translates what he says into Russian. President Putin replies in Russian, looking at President Bush, but in fact he is speaking to a translator.

The archetype protocol stack is the ISO (International Standards Organization) OSI (Open Systems Interconnection) Reference Model. This is a seven layer protocol stack on which all other protocol stacks are based. No "real world" communication systems actually use this model in its entirety. Here is the OSI-ISO protocol stack

There are so many good descriptions of this model that I am not going to try to describe the seven layers. Here is a good link. You are responsible for this material.

The Wikipedia description of the OSI model

The Internet runs a four level protocol stack.

This diagram shows two computers, labeled Host A and Host B and two Intermediate Routers. In practice there would be many intermediate switching element, but only two are shown. An application on Host A wants to communicate with a peer application on Host B. The two applications must speak the same protocol. A typical example would be a web browser which wishes to request a document from a web server in a distant city. The protocol that web browsers and web servers use to communicate is http which stands for HyperText Transmission Protocol.

The Application Layer (the web browser) on Host A has the illusion that it is communicating directly with the server on Host B (pardon the anthropomorphism), but in reality it sends its message to the Transport Layer software on the same computer. In the diagram, the dotted arrow represents the illusion, the solid arrow represents reality.

The Transport Layer is responsible for making sure that complete messages are delivered end to end. This may sound like a trivial problem but it is not, because messages are often broken up into chunks as they are sent over the Internet, and it is possible for these chunks to get lost. Also, they may not arrive in the same order that they are sent.

The Transport Layer Protocol that is generally used on the Internet is TCP, the Transmission Control Protocol. This will be described in more detail below. For the moment, you need to know that the TCP layer on the sending computer establishes a connection with the TCP layer on the receiving computer, and they talk TCP to make sure that the message is received in its entirety and free of errors.

TCP may break a large message into smaller segments. The Segment is the unit of transmission.

The two TCP layers have the illusion that they are talking to each other, but in reality, they communicate with the Network Layer. The only network layer protocol used on the Internet is the Internet Protocol (IP).

There is a second transport layer protocol called User Datagram Protocol (UDP), which is also widely used on the Internet. In contract to TCP, which is connection oriented and reliable, UDP is connectionless and unreliable. UDP is used when speed is of the essence, such as with a file server.

The Network Layer (called the Internet Layer in the diagram) is responsible for routing messages from one place to another. All routers on the Internet run the IP protocol. Each has several possible output lines and it has to figure out which output line to send each packet in order to get it to its destination.

In the Protocol stack diagram above, there are two Intermediate Routers. The top two layers, the Application Layer and the Transport Layer, run only on the two end computers but the lower two layers, the network layer and the Link Layer, run on each intermediate node as well. Recall that although there are only Intermediate Routers shown, there may be many such switching elements between the two hosts.

The bottom layer is the Link Layer. This is responsible for actually translating the software message into a physical representation and putting them on the wire (or through the air in a wireless network). This is an enormously complex undertaking, but is primarily in the realm of computer engineering rather than computer science, so for this course, we can just assume that there is a physical layer without going into too much detail.

The unit of transmission on the link layer is the frame.

There are numerous different physical layer protocols, and a message which takes a number of hops on the Internet to get from one host to another will be translated into a number of different physical representations. A typical physical layer protocol is IEEE 802.3, commonly known as Ethernet.

Each layer of the protocol stack on the sender side does its work by attaching a header (and sometimes trailing information as well) to the message which is passed down from the next higher layer in the stack. The Transport Layer receives a message from the application; it attaches a TCP header onto the front and passes this down to the network layer. The network layer appends an IP header onto the front of this and passes it on to the physical layer. The physical layer (Ethernet for example) attaches a header (and a checksum trailer) to this message and sends it to the next switching element.

In theory, the message received at each layer is identical to that sent by the corresponding peer at the other end. (Nit picking readers can find instances where this is not the case, such as the TTL field in the IP header)

The physical layer of the receiver reads the header information, strips the header (and trailer) off and passes the remainder to the network layer. The network layer reads the IP header (ignoring the rest of the message). If this is the final destination of the message, the network layer strips off the IP header and passes the remainder of the message up the stack to the TCP layer. Otherwise, the network layer determines where to send the message for its next hop and passes the message back down the physical layer for its next journey.

The Transport layer at on the receiving host reads the TCP header, strips it off, and passes the message up to the appropriate application process.

There are many different protocols at each level. Here are some representative protocols for the Internet.

Application Layer HTTP, telnet, ftp, email, VoIP

Transport Layer TCP, UDP

Network Layer IP

Link Layer Ethernet, WiFi, ATM, X.25, Frame Relay

The Network Layer, IP

The purpose of the network layer is to route packets from source host to destination host. In a point-to-point network such as the Internet, there are two models; virtual circuits, in which the complete path is laid out prior to any data being transmitted, or best effort packet switching. The Internet Protocol (IP) uses the latter. Most other protocols, such as ATM or frame relay, use virtual circuits. Virtual Circuits can provide higher reliability and more stable delivery times (all packets take the same route), at the cost of bandwidth efficiency.

IP Routers

There is only one protocol running on the network layer of the Internet, IP, the Internet Protocol. The current version is version 4. This is a remarkably stable protocol; it has been around for nearly 30 years.

The intermediate nodes of the Internet are called IP Routers. An IP router has five components

A set of input ports (these perform the data link functions)
A set of output ports (ditto)
A switching fabric connecting these two
A routing processor to determine which output line to use for each incoming packet.
A routing table which provides data to the processor. Each entry in the routing table had two fields, the destination address and the output line (there are other less important fields as well).

The job of an IP Router is to receive packets (sometimes known as datagrams) on its input ports and forward each one onto the next hop on one of its output ports. It may have to buffer the packet briefly while it determines which output line to use. The decision about which output line to use is done by reading the destination address of the packet, and looking this address up the routing table. Once the entry is found, it passes the packet on to the next hop.

Large routers may have to process a million or more packets a second, so the routing table search cannot be linear. Large Cisco routers have 64K of content addressable memory for each input port so it can perform the lookup in constant time. Typically a router will handle packets on a First-In-First-Out (FIFO) basis, since they do not have time to process them on the basis of Type of Service or precedence.

traceroute There is a utility called traceroute which can be used to determine a route that packets take to get from your computer to any destination on the Internet. Here is the route that packets take to get from mary-kate to slashdot.

The IP Header

Version The current version is 4 soon to be upgraded to 6
Internet Header Length Unit is 32 bit words, usually 5, but may be more of there are options
Type Of Service The first three bits are precedence, 0 is routine up to 7 for control messages, Other bits are minimize delay, maximize throughput, and maximize reliability. Production routers in the Internet almost always ignore this field.
Total length Unit is bytes, so a packet can theoretically be up to 64K bytes, but in practice, they are always much smaller.

The next three fields deal with fragmentation. Each link has a Maximum Transmission Unit (MTU), the maximum size of the payload of a frame. It is 1500 for Ethernet but may be larger or smaller for other protocols. If a packet is too big to be sent over a particular physical layer, it is fragmented, that is, divided into smaller chunks. These are reassembled at the destination.

IdentificationAll fragments of a particular packet have the same identification number so that the destination can determine which fragments belong to which packets.
Flags There are three one bit flags, but only two are used. If the do-not-fragment bit is set, then the packet will not be fragmented. If it happens to come to a network where it is too big to send without fragmenting, it is dropped, and an Internet Control Message Protocol (ICMP) message is sent to the sender. If the more-fragments flag is set, this means that this fragment is not the last fragment, more fragments of the same packet are on the way.
Fragment OffsetThe offset from the start of the packet where this fragment belongs. The unit is an 8 byte word. The size of all fragments except the last must be multiple of 8 bytes.
Frag 1 1480 bytes of data, id is 777, offset=0, flag=1
Frag 2 1480 bytes of data, id is 777, offset=185 (185*8=1480) flag=1
Frag 3 1020 bytes of data, id is 777, offset= 370, flag=0

TTLTime to Live. Initially set to a value such as 30, decremented by 1 by each Router. If this reaches zero, the packet must be dropped. This prevents packets from traveling endlessly around the network if a routing table is misconfigured to create a cycle.

ProtocolThe protocol of the transport layer (TCP is 6, UDP is 17).

Header checksum Calculated at each hop to make sure that the header was not corrupted.

Source IP address

Destination IP address

There are a number of possible IP options. These are mostly used for debugging and network tracing.

Here are some options.

Security - almost always ignored
Strict source routing - gives the sender the option of specifying the complete path to be followed to the destination
Loose source routing - the sender appends a list of routers not to be missed
Record Route - make each router append its IP address. used for traceroute and similar utilities
Timestamp - make each router append its address and timestamp

An IP address is 32 bits, so there are potentially more than 4 billion hosts on the Internet. In fact, there are far fewer than this, but nonetheless, current estimates put the number of hosts at somewhere around 230 million. Clearly, routing tables cannot be this large.

A 32 bit IP address is divided into two parts, a network part and a host part. All routing except at the destination network is done on the basis of the network only. For example, the IP address of every host on the RPI network starts with 128.113. The first sixteen bits are the network field and the last 16 are the host field. Outside of RPI, routers only look at the network field. Routers know that when the network field is 128.113, they need to send the packet to the RPI network. But they know nothing about the hosts inside the RPI network.

This also solves another problem. Suppose an RPI network engineer adds another host to the RPI network and runs a web server on this machine. The host is assigned the IP address 128,113.67.21. Immediately, any web browser on the entire planet could connect to this server (if it somehow knew of its existence). None of the millions of intermediate network routers need to be updated. They just need to know that this is on the RPI network. Once the packet gets to the RPI network, a router somewhere inside RPI needs to know where this machine is, but this is entirely a local RPI problem (see the section in the prior class on ARP.).

The above description is for IPv4 (version 4) which has been the version for the past 25 years or so (very impressive). This is in the process of being replaced by IPv6, which offers a much wider variety of options, including larger address space, improved security options, and improved quality of service options.

Transport Layer Protocols

There are two widely used transport layer protocols, TCP and UDP

User Datagram Protocol (UDP)

This is a very simple protocol. It is little more than a chunk of data enclosed in an IP Packet.

Transmission Control Protocol (TCP)

In contrast to UDP, TCP is quite complicated. It is connection oriented and reliable. Before any data is transferred the client and server have established a connection; each knows that it can send and receive data from the other and has allocated appropriate buffer space and other system resources. How can two entities communicate reliably over an unreliable network?

The transport layer receives a message from the application layer. If it is a long message, TCP breaks it down into segments to pass on to the network layer, which may in turn break it down into smaller packets. The other end reassembles the segments into a continuous byte stream and passes this on up to the application.

Services provided by TCP

Multiplexing - there can be many network applications running on the same computer. When a packet of data arrives from the network, TCP has to determine which application to send it to. This is done with the port. Each host has many ports, each corresponding to a different process or thread.
Ensuring reliable transport. This is done by acknowledging segment arrival. When a sender transmits a segment, it sets a timer, and if the segment is not acknowledged within a certain amount of time, it transmits the segment again.
assembling segments into the correct order (segments do not necessarily arrive in the same order that they were sent)
Managing flow control - making sure that a fast sender does not overwhelm a slower receiver, dealing with network congestion etc.
Session and connection control. TCP is connection oriented. Before any data is transmitted, the two ends establish a connection and agree on parameters.

The TCP Header

TCP appends a 20 byte header to the front of its payload. This header has the following fields

Source Port (16 bits)
Destination Port (16 bits)
Sequence Number (32 bits) - Recall that TCP views the data as a continuous stream of bytes. This is the sequence number of the first byte of the segment.
Acknowledgment Number (32 bits) - This is the sequence number of the last byte received (TCP is full duplex; data is transported in both directions)
Header Length (4 bits)
Unused (6 bits)
Flags (one bit each)
- URG - there is urgent (out of band) data
- ACK - the acknowledgment field is valid; receipt of data is being acknowledged
- PSH - Don't buffer data, make it available to the application immediately (obsolete)
- RST - Reset the sequence number
- SYN - Synchronize - used in establishing the connection
- FIN - Sender is breaking off the connection
Receive Window (16 bits) Used in flow control. This is the amount of available buffer space that the receiver has remaining. The sender should not send more data than this.
Checksum (16 bits)
Urgent data pointer (16 bits) points to the end of the out-of-band data. Only valid if the URG flag is set.

Establishing a connection

before any data is transmitted, the two ends establish a connection with a three way handshake, so called because three packets are transmitted.

The host that initiates the connection is called the client. It sends request to establish a connection.
The host at the other end (the server) receives the packet, allocates buffers for the connection, and replies with a packet.
When the client receives the ACK packet from the server, it allocates buffers and sends an acknowledgment back to the server.

Once this has been completed, the client sends a request or requests to the server, and the server replies.

Connection Termination

Terminating a connection uses a four way handshake. Each end of the connection sends a segment with the FIN flag set, and the other end acknowledges this. Note that it is possible for one end of a connection to send a FIN, meaning "I am not going to send any more data to you", but the other end continues to send data.

Ensuring Reliable Transport

Once the connection has been established with the three way handshake, both sides can exchange data. TCP ensures reliable delivery. Conceptually this is easy. The sender transmits a segment, setting the sequence number in the header to the offset of the first byte. It also sets a timer. When the receiver receives the segment, it transmits an acknowledgment. The Acknowledgment field of the TCP header is set the sequence number of the last byte received plus one (i.e., the next byte that it is expecting). If the sender does not receive an acknowledgment by the time that the timer goes off, it sends the segment again.

The receiver does not need to acknowledge each segment individually; if it receives segments n, n+1, and n+2, it just has to acknowledge segment n+2, and the receipt of the other two is implicit.

You might ask what a receiver does when it receives an out-of-order segment. It could either store it, on the assumption that the missing segment will soon arrive, or it could simply drop it, on the assumption that it will be retransmitted. The TCP specification is silent on this, but in practice, most implementations will drop the out-of-order packet because it is more work to store it and retrieve it appropriately and because the sender is likely to resend it anyway since the receiver has no way of acknowledging out-of-order segments.

Setting the Timer

When the sender transmits a segment, it sets a timer, and if it does not receive an acknowledgment by the time that the timer goes off, it retransmits the segment. The timer value varies dynamically depending on the average Round Trip Time of recent packets.

Clients and Servers

When two computers communicate on the Internet or any other network, one of the two is called the client and the other is called the server The client is the one that initiates the connection; thus, the client is analogous to the person who initiates a phone call or mails a letter. The other computer is called the server, analogous to the person who receives a phone call or receives a letter.

The server must be started first. Once a server is started, it goes to sleep waiting for clients to connect. Whenever a client connects, it wakes up, handles the request, and waits for the next client to connect. If handling the request is complex, a server might fork off a new thread or even a new process for each connection.

A server on Unix systems is typically run as a daemon. A daemon is a process which is not connected to a controlling terminal; it is running in background.

Note that the client needs to know about the existence of the server, but, until the connection is established, the server does not necessarily know about the existence of the client.

A single Internet host can run multiple different servers. One of the jobs of the TCP software is to multiplex incoming packets; that is, it has to determine the destination process. This is done with a port. A computer has 64K ports for TCP and an additional 64K ports for UDP. Recall that the destination port number was one of the fields of the TCP header.

In order for a client to establish a connection to a server, it needs to know

the IP address of the server and
the port number on which the application process is listening.

Most computers on the internet have a name as well as an IP address. When you want to connect to www.amazon.com, you do not need to know that its IP address is 207.171.183.16. There is a fairly elaborate name service system on the Internet (which uses the same client server system) called DNS (the domain name server), so when an application has only a name, it can connect to a name server which will return the IP address.

Likewise, most people who use services on the Internet do not know anything about port numbers. The standard Internet services listen on well known port numbers. For example, a production web server always listens on TCP port 80.

A web server in a testing environment might listen on some other port. If you append a colon followed by a number at the end of a url, (for example, www.amazon.com:8080), this tells your browser (the client) to connect to a different port (8080 in the example), but of course you would only do this if you knew that there was a web server listening on that particular port.

Here are some other well known services and their port numbers

ssh (the secure shell that you use to connect to Solaris) port 22
telnet (an insecure version of ssh) port 23
smtp (simple mail transfer protocol; i.e. email) port 25
finger (a way to query users on a system) port 79
ftp (file transfer protocol, a service to copy files from one system to another) port 21 for sending commands and port 20 for transferring data

The lower 2000 ports are reserved by the kernel and require administrative privileges to use; the ports above this can be used by user processes.

Return to the course home page

Application Layer	HTTP, telnet, ftp, email, VoIP
Transport Layer	TCP, UDP
Network Layer	IP
Link Layer	Ethernet, WiFi, ATM, X.25, Frame Relay