CSCI.4210 Operating Systems Fall, 2009 Class 9
Memory Management

There are many levels of memory in a computer. In general, there is a tradeoff between memory access time (speed) vs. cost and/or capacity. The different types of memory can be viewed as a hierarchy, with very fast access, but expensive/limited capacity memory at the top, and slow access, but cheap/large capacity memory at the bottom. Here is a list of memory levels, ranked by access time

The top four levels of memory in this hierarchy are volatile. Data stored in volatile memory is lost when the computer is shut down or the power goes off. The last three memory types are non-volatile. Data stored on these media are not lost when the power goes off.

Registers

All CPUs have a number of user registers where a program can store data. Access time to registers is the fastest of any memory type, but the number of registers is very limited. Compilers try to optimize programs by storing often used data in registers during the execution of a program. C has a feature which allows a user to suggest to the compiler that a particular variable be stored in a register by declaring it like this:
register x
However, this is not used much because modern compilers are pretty good at guessing what data should be stored in registers.

Cache

The concept of cache memory will come up repeatedly in the remainder of this course. The basic idea is that frequently used data is stored in memory which has faster access. Cache memory is based on the principle of locality. This principle states that programs access a relatively small portion of their address space at any point in time. There are two types of locality.

Temporal locality An item that was recently referenced is likely to be referenced again soon

Spatial locality A program that references a particular address is likely to reference nearby addresses soon.

Most real-world programs display these two principles very strongly. If you look at a large program, at any given time it is likely to be executing inside a loop which accesses only a small number of variables.

Cache memory is based on the principle of Temporal Locality. The idea behind a cache is to store recently used data in a faster (i.e. more expensive/smaller capacity) level of memory on the assumption that it will be used again soon.

There are two types of volatile memory, SRAM (static random access memory) and DRAM (dynamic random access memory). SRAM is faster, but is more expensive. SRAM has a transfer rate as fast as the CPU, and is used in cache memory. DRAM has a transfer rate slower than the CPU and is used in main memory.

Here is a chart from Patterson and Hennessy showing cost and access time by memory type

Memory technology Typical access time $ per MByte in 2004
SRAM 0.5-5 ns $4,000-$10,000 / GB
DRAM 50-70 ns $100-$200 / GB
Magnetic disk 5-20 million ns $0.50-$2.00 / GB

Cache memory can either be on the processor chip (Level 1) or on a separate chip (Level 2). L1 cache is faster than L2 cache, but both are faster than ordinary DRAM (main memory). Many processors have two L1 caches, one for data and one for instruction. Modern processors typically have 64KB of L1 instruction cache and 64KB of L1 data cache. L2 cache is typically 512KB or 1 MB.

Caching only works if the program first searches in the cache to see if the data that it is looking for is there. The unit of memory in the cache is the block. If the data that the program needs is in cache, this is called a hit, and if it is not in the cache, this is called a miss (You could have guessed that). The percentage of cache searches which produce hits is called the hit rate.

If the program searches the cache and does not find the data that it is looking for, it has to look in main memory. When this occurs, the contents of the address in main memory are copied to the cache. Not only is this slower, but the program has incurred an additional penalty because it took some time to determine that that data was not in the cache. Thus, caching can only be successful if searching in the cache is substantially faster than accessing data in main memory and the hit rate is high.

Searching the cache has to be implemented in hardware. It makes no sense to have a software implementation of a process that has to be faster than accessing main memory. Even if the search algorithm is implemented in hardware, it has to be fast. One possible mapping strategy for cache memory is direct mapping, in which the address of a particular item in the cache is determined by its address in main memory. For example, we could simply use the main memory address (or one of the fields of the address), modulo the cache size.

Suppose a cache holds 4096 words (cache size is generally a power of 2).

Here is a typical 32 bit memory address in binary. The bits which refer to the cache address are underlined.

01110110 00011001 11001101 10101100

This address would be stored in our cache at address 1101 10101100.

Implementing such an algorithm is hardware is easy. The only problem is that if there happen to be two (or more) addresses that need to be in the cache that have the same values in this field, only one could be cached.

The cache memory has to contain not only the data value, but also a tag containing the rest of the memory address so that the CPU can confirm that that value in the cache is indeed that value that it is looking for. It also needs to have a flag indicating whether or not the address is valid.

Direct mapping as described above is not used much now because of the collision problem. There are two other mapping functions. N-way set associative cache allows each logical address to be stored in any of N slots in the cache. If N is 4, then each address can be stored in one of 4 slots, which means that up to four words (or blocks) with the same cache address can be stored at the same time. The problem of course is the process of finding an address in cache memory becomes more complicated. It has to be done very fast, so the searching is done in parallel, with all N entries searched at the same time. Patterson and Hennessy describe the design of such a circuit.

You can think of direct mapping as an instance of this where N is 1.

The third mapping function is fully associated, in which any value can be stored in any slot in the cache. This is equivalent to N-way associative cache where N is the size of the cache. It takes very complex circuitry to implement this, but it has the highest hit rate.

Cache design issues

There are a number of issues around the design of the cache.

Memory Management

In order to implement multiprogramming, the code for several different processes has to be in memory at the same time. This introduces some issues regarding memory layout and memory management.

All modern operating systems solve this problem using virtual memory, which will be discussed in detail shortly, but first we will look at some solutions which do not involve virtual memory.

The introduction of multiprogramming resulted in two new problems for the operating system which were not issues when only one process at a time was in memory (monoprogramming). These problems were address relocation and protection.

Address Relocation When the linker is creating an executable file, it has to assign memory addresses for all of the instructions and data. If all programs were loaded at the same address, this would be straightforward, but with multiprogramming, the linker does not know exactly where in memory the process will be loaded, so it does not know the actual physical addresses.

Many modern processor architectures solve this problem by calculating addresses in two parts, a segment base and an offset from this segment base. The instructions and data are divided into segments. Often each function is in its own segment. Each individual instruction or data address is then calculated not as an absolute physical memory address, but as an offset from the base. The segment base is stored in one of the registers.

Thus, when a program is loaded into memory, it is not necessary to recalculate every address in the program, but it is still necessary to calculate the base register values for each segment, based on where in main memory the process happens to be loaded.

Protection If there are several processes loaded into memory at the same time, it is crucial that the operating system prevent one process from accessing data in another process. Usually this could happen only by accident, because one process would not know the address of other processes that happened to be in memory at the same time, but if the kernel is always loaded at the same place in memory, as is customary, it is important to prevent user programs from accessing kernel memory space.

If the architecture supports segmentation, then protection is often implemented by having not only a base register for a segment, but a limit register as well, which represents the highest address in that segment. Whenever a running program accesses an address, its offset is checked to make sure that it is less than the value in the limit register.

Memory management without virtual memory

Consider a simple computer which has 120 units of memory, with addresses 1 .. 120. The bottom 20 units of memory are reserved for the kernel, which leaves 100 units for user processes. Each process has a fixed size which is known when it is loaded, and processes have to be loaded into contiguous memory.

The first process takes up 20 units of memory, and is loaded at addresses 20 - 39. The next process takes up 40 units of memory and is loaded at addresses 40 - 79. The the third process takes up 30 units of memory and is loaded at addresses 80 - 110. Memory now looks like this.

The Process 2 completes, thus freeing up its memory. Process 4, of size 30, is now loaded at address 40. Memory now looks like this.

Process 5, of size 30 needs to be loaded but there is not enough memory, so it is swapped out, waiting for more memory. Process 1 terminates, freeing up its memory. There is now 40 units of available memory, theoretically more than enough to load process 5, but there is no contiguous empty space for it, so it still cannot be loaded.

This demonstrates a problem known as fragmentation. If the operating system uses a simple placement scheme to decide where to place processes in memory, over time, many small holes develop, which reduces the efficiency of memory utilization. There are a number of solutions to the memory placement issue, and each has some strengths and weaknesses.

Memory Management without partitions

The above model used Memory management without partitions. When a new process was created, if there was a large enough unused space in memory, it was loaded, otherwise it was put aside until a large enough segment of memory became available. As we saw, this strategy tends to result in memory fragmentation; lots of small holes which are too small to hold most programs.

If several holes are available, where does the system place the new process? Suppose that a new process of size 20 has been created, and there are three holes in memory, one of size 30 at address 20, one of size 20 at address 60, and one of size 40 at address 100.

Where do we place our new process? Four algorithms have been proposed.

Best fit The best fit algorithm says to place a process in the smallest hole that will accommodate it, on the assumption that this will lead to the least fragmentation. The best fit strategy would place our new process at address 60.

Worst fit The worst fit algorithm places the new process in the largest hole. This would place the new process at address 100.

First fit The first fit algorithm places the new process in the first hole that it finds that is large enough. This would place our new process at address 20. The rationale for this is that in order to implement either of the first two algorithms, it is necessary to search all possible holes, and of course this takes time.

Next fit The next fit algorithm is similar to first fit except that instead of starting at the beginning of the list of holes each time, we start searching where we left off last time, placing the new process in the first hole that we find that is large enough. This is a small improvement over first fit, because with first fit, processes tend to be bunched at the low end of memory.

Research and simulations on these four algorithms do not show large differences in performance between any of them. Worst fit performs somewhat more poorly than the others, as one might predict, but all have the problem of fragmentation over time.

There are two ways that such a memory model might be implemented. One is to maintain a linked list of all of the holes in memory. This list is often called the free list because it is a list of all of the free space in memory.

This list would need to be updated whenever a new process is loaded into one of the holes or a process terminates, thus creating a new hole or making a current hole larger. When a process terminates, it has two neighbors, one above and one below (unless it is the very first process or the very last process). There are four possibilities.

An alternative data structure is to maintain a bitmap, with a bit corresponding to each allocation unit of memory. If a particular allocation unit is occupied, its bit is set to 1, and if the segment is unoccupied, its bit is set to zero. The smaller the allocation unit, the larger the bitmap has to be. When the memory manager wants to allocate a program of size k units, it has to search the bitmap for a run of at least k unused units, and this can be time consuming.

Compaction

One solution to the problem of fragmentation is to periodically run a program which compacts memory by moving all of the current processes to replace the small holes with one large hole.

The obvious disadvantage of this is that such a program would have to be fairly time consuming. The segment addresses in each process need to be recalculated and changed, and the memory pointers in the Process Control Blocks need to be reassigned to reflect the new addresses of the process.

Fixed Partitions

Another solution to the fragmentation problem is to used partitions of fixed size. There are two models. One is to have all partitions the same size, and a process has to be loaded into contiguous partitions. This prevents a lot of the fragmentation described above, because it is not permitted to have lots of small holes. However, it introduces a new kind of fragmentation. The type of fragmentation described above is called external fragmentation because the holes are outside of any particular partition. If partitions are of fixed size, we can get internal fragmentation, which is defined as holes inside a partition. Suppose that the partitions are all of size 64KB. A process of size 70KB has to be assigned to two partitions, and so the remaining 58KB of the partition are wasted. They are not available for other processes.

Some IBM mainframe computers in the 1960s had fixed partitions but of various sizes. Some partitions were large, others were small.

The buddy system An interesting memory management model which represents a compromise between the fixed partition model and the variable partition model is the buddy system. This is actually used on some linux systems.

Memory blocks are available in sizes 2K, L ≤ K ≤ U. So 2L is the smallest block allocated, and 2U is the largest block allocated (often the size of all available memory).

Initially the entire space for allocation is treated as a single block of size 2U. If a request of size s where 2U-1 < s ≤ 2 , then the entire space is allocated. Otherwise the block is split in two equal sized buddies. This process continues until a suitable block is assigned. The system maintains a list of holes of each size 2i. A hole can be removed from the list if it is split in half. Whenever both buddies of size 2i become available, they are coalesced into a single block of size 2i+1 and put on that list.

An example will clarify this. Consider a system with 128K of memory. Initially, all 128K (217) are in one partition.

The first process to be loaded is 30K. The one large partition is divided into two partitions of 64K each, and one of these is divided again into two partitions of 32K each. The new process is loaded into one of these two partitions, leaving two empty partitions, one of size 32K and one of size 64K. There are two K of internal fragmentation, that is, memory inside a partition which is not used or usable. The memory layout looks like this.

The next process to be loaded is 12K. The 32K partition is divided into two 16K partitions, and this process is loaded into one of them. Memory now looks like this.

The next process to be loaded is of size 50K, so it is loaded into the 64K partition.

The only remaining empty partition is 16K. The next process to be loaded is of size 20K, but there is no partition large enough, so it has to wait for one of the currently running processes to finish.

Process 1 terminates, freeing up a 32K partition, so Process 4 is loaded here.

Processes 2 and 4 terminate, freeing up their partitions. Because these are adjacent, they can be combined into one large partition of size 64.

The buddy system of memory management results in internal fragmentation, but the advantage is that the small partitions get consolidated over time as processes terminate.

Swapping

The class on process scheduling discussed swapping. Swapping involves removing entire processes from memory, copying them to disk, when the system becomes overloaded. Swapping is done either when there are too many runnable processes so that none of them receive much CPU time, or, more commonly, when there is not enough memory. Swapping is particularly important when there are processes loaded into memory which are spending long periods of time waiting for user input or for other I/O events which occur infrequently. If there are processes waiting to be loaded, it is wasteful to keep a process in memory waiting for an event which is unlikely to occur soon.

The system does not know when these events will occur, but a general rule is that the longer a process has been waiting for an event, the less likely it is that this event will happen soon, so a reasonable swapping stragegy is to look for a process or processes which have been blocked for a long time and swap them out.

Virtual Memory

The memory management strategies for multiprogramming systems described in the previous lesson were based on two assumptions:

  • A process needed to be loaded into memory in its entirety before it could run.
  • A process needed to be loaded into contiguous memory.

    These seemed like obvious assumptions when multiprogramming was first introduced. However, as programs became larger and as systems started to load larger numbers of processes at the same time, these primitive memory management systems began to break down.

    In the 1970s, most computer systems began to switch to a memory management system called virtual memory in which both of these assumptions were violated. Virtual memory is based on the observation that at any given time, a process is only using a small fraction of its code and data. A process which may consist of tens of thousands of lines of code might only be running in a few functions. In fact, a substantial fraction of most real world code consists of error handling routines which are seldom used.

    Recall the principles of spatial and temporal locality discussed above. These two principles also apply to virtual memory.

    The idea behind virtual memory is that physical memory is divided into fixed size pages. Pages are typically 512 to 8192 bytes, with 4096 being a typical value. Page size is virtually always a power of two, for reasons to be explained below. Loadable modules are also divided into a number of page frames. Page frames are always the same size as the pages in memory.

    Pages frames are loaded into memory only when they are needed. Adjacent page frames are not necessarily loaded into adjacent pages in memory. At a point in time during the execution of a program, only a small fraction of the total pages of a process may be loaded.

    Note that there are two kinds of addresses (or two address spaces), virtual addresses and real addresses. Virtual addresses, also called logical addresses, are program generated. In a system without virtual memory, a physical address and a logical address are the same; the virtual address is placed on the address bus directly because it refers to a physical address in main memory. On systems with virtual memory, there is a mapping between virtual addresses and real addresses. In the above figure, supposed that code for the next instruction to be executed is in page frame 3. The running program only knows the virtual address. Rather than putting the virtual address directly on the address bus, the system sends this address to the Memory Management Unit or MMU. The MMU provides this mapping between virtual addresses and physical addresses. In the example above, this mapping would convert a virtual address in page frame 3 to a real address in page 10. The data structure which contains this mapping is called a page table.

    It is possible that a process might access a page which is not in memory. For example, suppose that the process tries to access data in page frame 8. The MMU will determine that this page is not loaded, and this will generate a page fault. A page fault is an interrupt. The handler for a page fault locates the needed page frame on the disk, copies it to a page in memory, and updates the page table, telling it where the page frame is loaded. Control then returns to the process.

    When a program first starts running, it will generate a number of page faults. Suppose that the first executable instruction is
    MOV A704,AX; copy value at address hex A704 to register AX
    This one instruction will generate two page faults. The instruction itself will be at an address in the code segment and so the page frame which has this instruction will have to be loaded into memory. Then the page in the data segment or stack which contains virtual address A704 will have to be loaded into another frame. However, if the program conforms to the principles of temporal and spatial locality, subsequent instructions are likely to be in the same page frame as this instruction and subsequent accesses to data are likely to be in the same page frame of data.

    Structure of the page table

    A row of the page table typically has the following fields.

    The dirty bit is important. If at least one value in the page frame has been modified since the frame was loaded from disk into memory, there is a discrepancy between the page frame in memory and the page frame on the disk. In this case the dirty bit is on, otherwise it is off. This is important, because there will come a time when this page will be replaced by another page frame, and when this occurs, if the dirty bit is on, the contents of the page frame will have to be copied back to disk by the page fault handler. If the dirty bit is off, the contents of the page frame on the disk are identical to the contents of the page frame in memory, and so the page can be overwritten without losing any data.

    The MMU

    When virtual memory was first proposed, many people thought that it would never work because it would be too difficult to implement paging rapidly enough. The page table provides a mapping between the logical (virtual) address and the real address. This mapping has to be very fast, and the page tables can be very large.

    Obviously this mapping has to be implemented in hardware. The section of the CPU that provides this service is called the Memory Management Unit (MMU). Note that every memory reference has to go through the MMU.

    Let's start with a toy implementation to demonstrate some of the issues that the MMU has to deal with. Suppose a system has a 16 bit virtual address space (the program addresses range from 0 .. 64K), but it only has 32K of memory, divided into eight pages of size 4K. A 16 bit virtual address can be divided into a 4-bit page reference and a 12-bit offset. Thus, each page and page frame is 4,096 bytes (212 The page table consists of 16 entries, one for each possible value of the 4-bit page field. The offset is simply the address within that page. For example, the virtual address 0X5068
    (0101 0000 0110 0100 in binary)
    is divided into two parts. The first four bits refer to the page (0x5), and the last 12 bits refer to the offset within that page (0x068) which is 100 in decimal.

    The page table has an entry for each logical page, 16 in all, and the four bit page field is an index into the page table.

    If the entire page table is loaded into registers, this is trivial to implement in hardware, and performance would be very fast, but it has some serious problems. The obvious one is that most modern systems allow processes to have very large address spaces. An address space of 32 bits per process is typical now. If a page size is 12 bits (4096 bytes, also a typical figure), then the page field has to be 20 bits, so the page table has to have 220 (1,048,57610) entries. This is clearly far too large to load into registers.

    Also, keep in mind that each process would have its own page table, and so every time a context switch occurred, the page table for the new process would have to be loaded into the registers, and this would make context switching very slow, far too slow for a typical system which may make a hundred or more context switches per second.

    The alternative is to keep the page table in main memory, but this is inefficient because each memory reference now involves two memory references, one to get the appropriate page table entry and one to fetch the actual data. That is not a viable solution.

    One solution is to have a multi-level page table. In the above example, a logical address was divided into two fields, a page field and an offset field. With a multi-level page table the logical address is divided into three fields, one for the primary page table, one for the secondary page table, and one for the offset of the address within the page.

    Here is an example. A system with a 32 bit logical address field divides such an address into a ten bit field for the first page table, a ten bit field for the second level page table and a 12 bit field for the offset, creating a page size of 212 or 4,096 bytes.

    There are potentially 220 entries in the page table, far too many to keep in memory at one time. However, keep in mind that although each process has a potential address space of 232, most processes only use a tiny fraction of this. In particular, when a process is loaded, there is a large chunk of unused memory in the middle of the address space for the stack and the heap to grow into.

    Recall that in the page table design described above, each entry in the page table contained a pointer to real memory. In a two level page table, each entry in the first level page table contains a pointer to a second level page table, and the second level page table provides the address of the page in real memory. There are 210 or 1,024 entries in the first level page table, each pointing to a second level page table. The first level page table is always in memory, but only a handful of second level page tables are usually needed.

    If the 32 bit virtual address is
    00000101 11100110 11100100 01100100 or 0x05EAE864
    to determine the physical address, this logical address is sent to the MMU, which extracts the first ten bits 0000010111 to use as an index to the first level page table. The first level page table provides a pointer to one of 1,024 second level page tables (in fact most of these will not exist). This table is hopefully in memory. The MMU will extract the second ten bits 1001101110 and use this as an index to the second level page table, which will provide a pointer to the actual page in real memory. The last 12 bits 010001100100 will provide the offset in the page for the actual byte which is requested.

    This model is still too slow to be effective. Even if the both of the required page tables are in memory, this now means that one logical memory access would require three memory accesses because the MMU would have to read the first page table and the second page table to be able to generate the physical address.

    The solution is Translation Lookaside Buffers or TLBs. These are based on the principles of temporal and spatial locality which most programs conform to. Most processes, no matter how large they are, are usually only executing steps in a small part of the text segment and accessing only a small fraction of the data during any given time short time period. This means that they are accessing only a small number of pages. The idea behind a TLB is that these pages are cached.

    The TLB is inside the MMU, and contains pointers to a small number of frequently used pages. Whenever the running program needs to access memory, it first checks to see if the pointer to the page containing that address is stored in the TLB. Most of the time it is, and so the MMU can immediately put the real address on the address bus without having to read any other memory.

    Suppose the TLB has slots for 32 pages. Even with this small number, it cannot do a linear search to determine if the page is in the TLB. It needs to search in parallel, and so this has to be implemented in hardware.

    Entries in the TLB would look very similar to entries in the actual page table; each record would contain the virtual page number, the physical page number, a valid flag, a dirty bit, and protection information.

    The relationship between the TLB, the MMU, and the cache is fairly complicated. Here is a typical arrangement.

    The CPU generates a virtual address. First, the TLB is searched to see if the address of the physical address is cached. If there is a hit, we know the physical address. The next step is to search the cache to see if the value is in the cache. If there is a hit, this value is returned to the CPU. In most cases, both of these queries are hits and so the value can be returned quickly without accessing main memory.

    If the TLB does not have a pointer to the page that contains the virtual address, the next step is to search the actual page table. If there is a hit, then we can generate the physical address and so we search the cache to determine if this value is in the cache. We also will add this page to the TLB.

    If the page is not in the page table, then a page fault is generated. In this case, the page will be read in from the disk, the page table will be updated, and the TLB will be updated.

    Inverted Page Tables

    Most page tables are relatively sparse; they contain space for perhaps a million pages, but only a tiny fraction of these pages are actually used by the typical process. Thus, a traditional page table is somewhat wasteful. Also, as 64 bit computers become more common, page tables will be inconceivably huge. The solution is inverted page tables

    In a regular page table, there is an entry for each virtual page, in an inverted page table there is an entry for each physical page, so the page table is much smaller. This would not seem to make much sense, because to access a particular virtual address, it would be necessary to search each page in the inverted page table to see if it is there. Of course the system first checks the TLB, and if there is a hit, there is no need to access the page table at all.

    An inverted page table uses associative or content addressable memory. Each cell in an associative memory contains a key field and a data field. To search an inverted page table, the system uses a hash function, in which the physical address is passed through a function to generate an index to the appropriate row in the inverted page table. You all remember from your data structures course that a hash function allows very rapid search of large data sets.

    Return to the course home page