CSCI.4210 Operating Systems Fall, 2008 Class 9
Virtual Memory

Virtual Memory

The memory management strategies for multiprogramming systems described in the previous lesson were based on two assumptions:

  • A process needed to be loaded into memory in its entirety before it could run.
  • A process needed to be loaded into contiguous memory.

    These seemed like obvious assumptions when multiprogramming was first introduced. However, as programs became larger and as systems started to load larger numbers of processes at the same time, these primitive memory management systems began to break down.

    In the 1970s, most computer systems began to switch to a memory management system called virtual memory in which both of these assumptions were violated. Virtual memory is based on the observation that at any given time, a process is only using a small fraction of its code and data. A process which may consist of tens of thousands of lines of code might only be running in a few functions. In fact, a substantial fraction of most real world code consists of error handling routines which are seldom used.

    Recall the principles of spatial and temporal locality discussed above. These two principles also apply to virtual memory.

    The idea behind virtual memory is that physical memory is divided into fixed size pages. Pages are typically 512 to 8192 bytes, with 4096 being a typical value. Page size is virtually always a power of two, for reasons to be explained below. Loadable modules are also divided into a number of page frames. Page frames are always the same size as the pages in memory.

    Pages frames are loaded into memory only when they are needed. Adjacent page frames are not necessarily loaded into adjacent pages in memory. At a point in time during the execution of a program, only a small fraction of the total pages of a process may be loaded.

    Note that there are two kinds of addresses (or two address spaces), virtual addresses and real addresses. Virtual addresses, also called logical addresses, are program generated. In a system without virtual memory, a physical address and a logical address are the same; the virtual address is placed on the address bus directly because it refers to a physical address in main memory. On systems with virtual memory, there is a mapping between virtual addresses and real addresses. In the above figure, supposed that code for the next instruction to be executed is in page frame 3. The running program only knows the virtual address. Rather than putting the virtual address directly on the address bus, the system sends this address to the Memory Management Unit or MMU. The MMU provides this mapping between virtual addresses and physical addresses. In the example above, this mapping would convert a virtual address in page frame 3 to a real address in page 10. The data structure which contains this mapping is called a page table.

    It is possible that a process might access a page which is not in memory. For example, suppose that the process tries to access data in page frame 8. The MMU will determine that this page is not loaded, and this will generate a page fault. A page fault is an interrupt. The handler for a page fault locates the needed page frame on the disk, copies it to a page in memory, and updates the page table, telling it where the page frame is loaded. Control then returns to the process.

    When a program first starts running, it will generate a number of page faults. Suppose that the first executable instruction is
    MOV A704,AX; copy value at address hex A704 to register AX
    This one instruction will generate two page faults. The instruction itself will be at an address in the code segment and so the page frame which has this instruction will have to be loaded into memory. Then the page in the data segment or stack which contains virtual address A704 will have to be loaded into another frame. However, if the program conforms to the principles of temporal and spatial locality, subsequent instructions are likely to be in the same page frame as this instruction and subsequent accesses to data are likely to be in the same page frame of data.

    Structure of the page table

    A row of the page table typically has the following fields.

    The dirty bit is important. If at least one value in the page frame has been modified since the frame was loaded from disk into memory, there is a discrepancy between the page frame in memory and the page frame on the disk. In this case the dirty bit is on, otherwise it is off. This is important, because there will come a time when this page will be replaced by another page frame, and when this occurs, if the dirty bit is on, the contents of the page frame will have to be copied back to disk by the page fault handler. If the dirty bit is off, the contents of the page frame on the disk are identical to the contents of the page frame in memory, and so the page can be overwritten without losing any data.

    The MMU

    When virtual memory was first proposed, many people thought that it would never work because it would be too difficult to implement paging rapidly enough. The page table provides a mapping between the logical (virtual) address and the real address. This mapping has to be very fast, and the page tables can be very large.

    Obviously this mapping has to be implemented in hardware. The section of the CPU that provides this service is called the Memory Management Unit (MMU). Note that every memory reference has to go through the MMU.

    Let's start with a toy implementation to demonstrate some of the issues that the MMU has to deal with. Suppose a system has a 16 bit virtual address space (the program addresses range from 0 .. 64K), but it only has 32K of memory, divided into eight pages of size 4K. A 16 bit virtual address can be divided into a 4-bit page reference and a 12-bit offset. Thus, each page and page frame is 4,096 bytes (212 The page table consists of 16 entries, one for each possible value of the 4-bit page field. The offset is simply the address within that page. For example, the virtual address 0X5068
    (0101 0000 0110 0100 in binary)
    is divided into two parts. The first four bits refer to the page (0x5), and the last 12 bits refer to the offset within that page (0x068) which is 100 in decimal.

    The page table has an entry for each logical page, 16 in all, and the four bit page field is an index into the page table.

    If the entire page table is loaded into registers, this is trivial to implement in hardware, and performance would be very fast, but it has some serious problems. The obvious one is that most modern systems allow processes to have very large address spaces. An address space of 32 bits per process is typical now. If a page size is 12 bits (4096 bytes, also a typical figure), then the page field has to be 20 bits, so the page table has to have 220 (1,048,57610) entries. This is clearly far too large to load into registers.

    Also, keep in mind that each process would have its own page table, and so every time a context switch occurred, the page table for the new process would have to be loaded into the registers, and this would make context switching very slow, far too slow for a typical system which may make a hundred or more context switches per second.

    The alternative is to keep the page table in main memory, but this is inefficient because each memory reference now involves two memory references, one to get the appropriate page table entry and one to fetch the actual data. That is not a viable solution.

    One solution is to have a multi-level page table. In the above example, a logical address was divided into two fields, a page field and an offset field. With a multi-level page table the logical address is divided into three fields, one for the primary page table, one for the secondary page table, and one for the offset of the address within the page.

    Here is an example. A system with a 32 bit logical address field divides such an address into a ten bit field for the first page table, a ten bit field for the second level page table and a 12 bit field for the offset, creating a page size of 212 or 4,096 bytes.

    There are potentially 220 entries in the page table, far too many to keep in memory at one time. However, keep in mind that although each process has a potential address space of 232, most processes only use a tiny fraction of this. In particular, when a process is loaded, there is a large chunk of unused memory in the middle of the address space for the stack and the heap to grow into.

    Recall that in the page table design described above, each entry in the page table contained a pointer to real memory. In a two level page table, each entry in the first level page table contains a pointer to a second level page table, and the second level page table provides the address of the page in real memory. There are 210 or 1,024 entries in the first level page table, each pointing to a second level page table. The first level page table is always in memory, but only a handful of second level page tables are usually needed.

    If the 32 bit virtual address is
    00000101 11100110 11100100 01100100 or 0x05EAE864
    to determine the physical address, this logical address is sent to the MMU, which extracts the first ten bits 0000010111 to use as an index to the first level page table. The first level page table provides a pointer to one of 1,024 second level page tables (in fact most of these will not exist). This table is hopefully in memory. The MMU will extract the second ten bits 1001101110 and use this as an index to the second level page table, which will provide a pointer to the actual page in real memory. The last 12 bits 010001100100 will provide the offset in the page for the actual byte which is requested.

    This model is still too slow to be effective. Even if the both of the required page tables are in memory, this now means that one logical memory access would require three memory accesses because the MMU would have to read the first page table and the second page table to be able to generate the physical address.

    The solution is Translation Lookaside Buffers or TLBs. These are based on the principles of temporal and spatial locality which most programs conform to. Most processes, no matter how large they are, are usually only executing steps in a small part of the text segment and accessing only a small fraction of the data during any given time short time period. This means that they are accessing only a small number of pages. The idea behind a TLB is that these pages are cached.

    The TLB is inside the MMU, and contains pointers to a small number of frequently used pages. Whenever the running program needs to access memory, it first checks to see if the pointer to the page containing that address is stored in the TLB. Most of the time it is, and so the MMU can immediately put the real address on the address bus without having to read any other memory.

    Suppose the TLB has slots for 32 pages. Even with this small number, it cannot do a linear search to determine if the page is in the TLB. It needs to search in parallel, and so this has to be implemented in hardware.

    Entries in the TLB would look very similar to entries in the actual page table; each record would contain the virtual page number, the physical page number, a valid flag, a dirty bit, and protection information.

    The relationship between the TLB, the MMU, and the cache is fairly complicated. Here is a typical arrangement.

    The CPU generates a virtual address. First, the TLB is searched to see if the address of the physical address is cached. If there is a hit, we know the physical address. The next step is to search the cache to see if the value is in the cache. If there is a hit, this value is returned to the CPU. In most cases, both of these queries are hits and so the value can be returned quickly without accessing main memory.

    If the TLB does not have a pointer to the page that contains the virtual address, the next step is to search the actual page table. If there is a hit, then we can generate the physical address and so we search the cache to determine if this value is in the cache. We also will add this page to the TLB.

    If the page is not in the page table, then a page fault is generated. In this case, the page will be read in from the disk, the page table will be updated, and the TLB will be updated.

    Inverted Page Tables

    Most page tables are relatively sparse; they contain space for perhaps a million pages, but only a tiny fraction of these pages are actually used by the typical process. Thus, a traditional page table is somewhat wasteful. Also, as 64 bit computers become more common, page tables will be inconceivably huge. The solution is inverted page tables

    In a regular page table, there is an entry for each virtual page, in an inverted page table there is an entry for each physical page, so the page table is much smaller. This would not seem to make much sense, because to access a particular virtual address, it would be necessary to search each page in the inverted page table to see if it is there. Of course the system first checks the TLB, and if there is a hit, there is no need to access the page table at all.

    An inverted page table uses associative or content addressable memory. Each cell in an associative memory contains a key field and a data field. To search an inverted page table, the system uses a hash function, in which the physical address is passed through a function to generate an index to the appropriate row in the inverted page table. You all remember from your data structures course that a hash function allows very rapid search of large data sets.

    Page Replacement Algorithms

    When a page fault occurs, the system has to copy a new page frame from disk into a page of memory. This may involve replacing a page which is already loaded. How does the system decide which page to replace? Note that if the page which is being replaced has been modified (the dirty bit is on), it has to be copied to disk. It is important to have an efficient page replacement algorithm, because if the page which is replaced is accessed again soon, the result will be another page fault.

    Page faults are time consuming. If the system is generating too many page faults, it is doing little real work. This is a situation called thrashing.

    Let's assume that we have perfect knowledge of the page reference stream into the future. The forward distance of a particular page at a particular time is the next place in the stream where that page is referenced again. The forward distance would be 1 if the very next reference was to that page, and infinity if that page will never be referenced again. The optimal page replacement algorithm would be to replace that page which has the highest forward distance, in other words, replace that page which will not be used for the longest time.

    Unfortunately, the system has no way of knowing this. However, it is possible to make a reasonable guess about which pages may not be referenced again soon.

    One simple and obvious strategy is the not-recently-used algorithm (NRU). This algorithm says to replace a page in memory which has not been accessed recently. Because of the principle of temporal locality, we can assume that a page which has not been referenced recently is unlikely to be referenced soon.

    The ideal implementation of the NRU algorithm would be to replace that page which has been least recently used (LRU); i.e. has not been accessed for the longest time. However, it is difficult to implement a pure LRU algorithm because this would require keeping track of how long ago each page was referenced, and the overhead associated with this is far too high to be practical. An approximation of LRU will be discussed below.

    It is simple to implement a Not Recently Used algorithm. A row in a page table usually has a flag called the referenced bit. Most systems have a clock which periodically generates an interrupt about once every 20 msec. On each clock interrupt (or similar schedule), the referenced bits of all of the pages in the page table are set to zero. As each page is referenced, its referenced bit is turned on. When a page fault occurs, if an active page needs to be replaced, the system looks for a page where the referenced bit is off, and replaces that page.

    This is far from perfect. If a page fault occurs shortly after a clock interrupt, essentially all of the pages will be marked unreferenced, and so the system will have little information about which page to overwrite.

    This system is further complicated by the fact that if the dirty bit is on, the page has to be copied to disk. In this case, a page fault results in both a read-from-disk operation and a write-to-disk operation. Thus, a variant of this NRU strategy is to look first for a page where both the referenced bit and the dirty bit are off, and it only replaces a page with the dirty bit on if there are no pages with both bits off. Of course this requires additional overhead because the entire page table has to be searched.

    Another low overhead strategy is the First In First Out (FIFO) page replacement algorithm. In this case, the system keeps track of the load time for each page, and it replaces that page with the longest time in memory. This can be easily implemented by putting each new page at the tail of a list in memory, and removing pages from the head.

    The FIFO algorithm should be included in this list for completeness, but it is not used in practice. There is no particular reason to assume that the page which was loaded longest ago is less likely to be accessed soon than a page which was just loaded.

    We can improve on the NRU algorithm. The text describes a second chance algorithm and a clock algorithm, but the latter is just an implementation of the former. The clock algorithm keeps all of the current pages in a circular list. There is a pointer which moves around this list like the hand of a clock. When a page fault occurs, the system examines the page that the hand is currently pointing to. If the referenced bit is off, this page is replaced in memory (copying the contents back to disk if the dirty bit is on). If the referenced bit is on for that page, it is set to off and the hand continues around the list. This continues until it finds a page where the referenced bit is off. The algorithm is guaranteed to find such a page, because the list is circular, and the hand will eventually get back to the page where it started, and this page will have the referenced bit off.

    It goes without saying that the referenced bit is turned on whenever a page is referenced.

    This algorithm has an advantage over the NRU algorithm, because when it finds a page frame with the referenced bit off, it can be certain that this page has not been referenced in one full sweep across all of the pages, whereas with the NRU algorithm, if the system finds a page with the reference bit off, there is no guarantee regarding how long ago it was referenced, only that it has not been referenced since the last clock interrupt.

    The clock algorithm or variants of it are used in many real operating systems.

    It was stated above that the Least Recently Used (LRU) algorithm would be very efficient, but that it would be difficult to implement in practice. The text talks about two ways to implement LRU in hardware.

    One method is for the system to maintain a counter which is incremented each time any instruction is accessed. The row of data in the page frame needs to maintain room for this counter. Whenever a particular page is referenced, the counter is incremented and the value is copied to the page frame.

    When it is time to replace a page in memory, the system checks this value in each page and replaces that page where the value is the lowest.

    Here is an example. Suppose that the value of the counter at the start of the scenario is 1000 (in practice, the counter should be very large, the text suggests 64 bits). There are eight pages in memory, numbered one to eight. Here is a string of page references.

    6, 7, 3, 8, 8, 1, 4, 2, 5, 6, 1

    After executing this string, here is the value of the counter for each of the eight page frames.

    If a page fault occurred at this point in time, and one of these eight pages had to be replaced, the system would look for the lowest value of the counter, see that it was page 7, and it would replace that page.

    This algorithm can only be implemented if there is hardware support for it in the MMU, and few if any hardware developers have implemented such an algorithm.

    A close approximation of the LRU algorithm can be implemented entirely in software. The text refers to this as Not Frequently Used (NFU) with aging. It works like this. Each page has a counter associated with it. Recall that operating systems have a clock interrupt which occurs roughly every 20 msec. At each clock interrupt, each counter is right shifted by one bit (this is equivalent to dividing by 2. The least significant bit is lost). If the page has been referenced since the last interrupt, the most significant bit of the counter is set to one, otherwise it is set to zero. Then all of the referenced flags are set to zero.

    Here is an example. There are six pages in memory, and the counter is eight bits. (Eight bits is usually enough). During each time slice (the period of time between clock interrupts), the following pages are referenced. For example, during the first time slice, pages 2 and 4 were referenced, the others were loaded but not referenced.

    12,4
    21,2,3,4
    32,3,4,5,6
    41,2
    51,2,5,6
    62
    71,2,6
    81,2,5

    After this set of eight time slices, the counters look like this
    111011010 (218)
    211111111 (255)
    300000110 (6)
    400000111 (7)
    510010100 (148)
    601010100 (84)

    When a page fault occurs, the page with the lowest value of the counter is replaced. If a page fault occurred after the above string, it would replace page 3, because its counter had the lowest value.

    Note that any page which was referenced in the last clock cycle will have a higher value than a page not so referenced, because its most significant bit will be one, and the most significant bit of any page not referenced in the last cycle will be zero. Likewise, any page referenced in the past two cycles will have a higher value than a page which has not been referenced in the past two cycles, and so on.

    Working Sets All of the above models of paging use demand paging. In demand paging, a page is only loaded when it is needed, and a page fault always loads exactly one page. When a process is first created, none of its pages are in memory, and so there will always be a number of page faults when a process starts. After a while, enough pages will be loaded into memory so that the process runs for long periods with few page faults. The set of pages which need to be in memory in order for the program to run for an extended period with few page faults is called the working set.

    Operating systems which are running at or near capacity will swap out processes, that is, remove all of their pages from memory and remove the job from the running queue. At some point a swapped out job will be swapped back in, that is, returned to the running queue. If the system is using demand paging, such a job will again generate a number of page faults before having enough pages in memory to run. Thus, it would make sense to keep track of the pages of a process which were in memory when the job was swapped out, and when the job was swapped back in, restore all of these pages to memory before starting to run the job. Loading pages before they generate a page fault is called prepaging, and this model is called the working set model because it restores the entire working set of a process before running it.

    This is more complicated than it seems. The operating system needs to keep track of which pages have been recently accessed. It is possible, indeed likely, that a process may have pages loaded which it has not accessed for a while, but which have not been replaced. These pages should not be considered to be a part of the working set.

    We can define the working set w(k,t) at time t as the set of pages that have been referenced by a particular process within the past k memory references. Up to a point, as k increases, the size of the working set increases, but this function levels off quickly.

    In order to implement this, the system would need to update a register at each memory reference, and this is far too costly to be done even in hardware. However, it is possible to have a reasonable simulation. A field for time of last use is associated with each page (although this cannot be kept entirely current). As with other page replacement algorithms, each page also has a referenced bit. At each clock tick (about once every 20 msec), the system clears this bit (sets it to zero), and each page reference sets it to one.

    All operating systems keep track of the total running time of a process. When a page fault occurs, the system checks all of the pages associated with the current process. If the referenced bit is on, meaning that the page has been referenced since the last clock tick, the time of last use field is updated to the current value of the total run time of the process. If the referenced bit is zero, the value in time of last use field is compared to the current total time value, and if the difference is greater than some value, tau, this page is removed.

    Here is an example. At an instant in time, the currently running process has six pages in memory. The total running time of the process is now 1000 msec (this value is always an approximation, since it is only updated at clock interrupts). The value of tau, set by the system, is 50 msec, meaning that pages which have not been referenced in the past 50 msec of run time for that process are candidates to be replaced because they are no longer in the working set. Here are the values of the six pages when a page fault occurs. Clearly the page table would have other data as well, but I only show the data relevant to this algorithm.

    Page numberTime of last useReferenced flag
    19801
    29800
    39600
    49801
    59301
    69400

    The system needs to find a page to replace. The first page has been referenced since the last clock interrupt, so it is clearly not a candidate for replacement. Its Time of last use field is updated to 1000 (the current running time value). The second page has not been referenced in this clock cycle, but the curren time (1000) minus its time of last use (980) is less than tau (20 is less than 50), so it is not a candidate for removal. The same is true of the third page. Pages 4 and 5 have been recently referenced, so their time of last use field is updated to 1000. Page six has not been referenced since the last clock tick (its referenced bit is zero) and the current time minus its time of last use is 60 msec, which is greater than tau, so it is no longer in the working set and it is a candidate for replacement.

    This algorithm works well, but there is a lot of overhead associated with it because at each page fault, every page associated with that process has to be examined and possibly updated. This algorithm can be combined with the clock algorithm to get an efficient page replacement strategy which is widely used in practice.

    In this algorithm, the pages associated with a process are kept in a circular list. Each page has a referenced bit, a modified bit (dirty bit), and a time of last use field as well as a pointer to the physical page address. At each page fault, the list is searched to find a suitable candidate for replacement. Searching starts where it ended at the last page fault (Picture the hand of a clock). The algorithm differs from the above in that not all pages are searched, searching stops when a suitable page for replacement is found. Also, if the page is not in the working set but the modified bit is on, meaning that the page cannot be replaced until its new values are written to disk, rather than performing the write immediately, the write to disk is scheduled and the hand moves on, looking for a better candidate page.

    Design Issues in Paging

    There are a number of design issues for paging algorithms that can impact performance.

    Local vs. Global Paging Strategies

    If there are a number of runnable processes loaded into memory, and a page fault occurs, the system has to decide which page to eliminate. It can consider all pages in memory, i.e. the pages for all processes, or it can only consider those pages which are a part of the currently running process, which presumably is the process which produced the page fault. The former strategy is called a global page replacement strategy, while the latter is called a local page replacement strategy.

    Local page replacement strategies have the effect of keeping a more or less constant number of pages in memory for a given process. This may seem fair, but there is no reason to assume that all of the processes will need about the same number of pages. Also, during a run of a process, the number of pages that it needs may grow or shrink.

    In general, global replacement strategies work best. However, if the page replacement algorithm is one of the working set algorithms, then it would have to use a local strategy.

    Page Size

    An important design decision for an operating system is the page size. There are tradeoffs to having larger (or smaller) pages.

    A smaller page size means less internal fragmentation. Internal fragmentation is wasted space in a page frame. If a page is 4K, but the program only needs 1K of data, three K are wasted. Also, if page size is small, it is likely that more of what is loaded into memory at any given instant will be exactly what is needed, whereas with larger pages, it is possible that a part of the contents of many of the pages will not be referenced.

    On the other hand, a smaller page size means that there are more pages, which means a larger page table. If there are fewer pages, per process, the page reference is more likely to be in the TLB. Also, as we will see next week, the additional time to load large pages is far less than proportional. It takes essentially the same amount of time to load a large page as a small page. When a new process is started, larger pages will enable the memory management system to load a working set of pages faster. However, simulations of paging seem to show that a larger number of smaller pages produces fewer page faults than a smaller number of larger pages.

    Over time, the page size of operating systems has grown larger. The typical page size now is 4K, although many systems now have pages of 8K. (Page sizes are almost always a power of 2. Why?).

    Shared Pages

    Several processes can share the same page. For example, when a fork occurs on a Unix system, logically the parent process is duplicated but in practice, the code segment is not copied; both the parent process and the child process use the same pages. This saves space in memory and increases the efficiency of a fork system call. However, it complicates the managment of paging because the system needs to keep track of how many processes are using the same page.

    In some cases, both the parent and child share data segments as well. As long as neither process changes the values of any variables, this is efficient. This is called copy on write because when either the parent or the child updates one of its pages, the system is required to make another copy of this page, one for the child and one for the parent. Shared pages with a copy on write strategy works particularly well when the child immediately calls exec because essentially no copying of the parent is done at all. When the child process is created, its page tables are given the same values as the parent and no new memory is allocated.

    Locking pages in memory

    Most paging systems have a mechanism in which pages can be pinned in memory. A page which is pinned is not a candidate to be overwritten, even if it has not been referenced recently. For example, some kernel pages should always be in memory. The pages which contain the code for memory management are an obvious example.

    If an operating system uses Direct Memory Access (DMA) for I/O, in'which I/O devices copy data directly to or from the memory space of an application, then the page(s) which contains this memory must be pinned, even if the process itself is blocked for an extended period of time.

    Stack Algorithms and Belardy's Anomaly

    It should be obvious that the more pages are allocated to a process, the fewer page faults will occur, and in general this is true. However, it is possible to contrive strings of page references where this is not the case. Consider this string of page references.

    1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

    If the page replacement algorithm is FIFO and there are three pages of memory, the run stream would look like this. The asterisks indicate page faults.

    Note that with a memory size of 3, there are nine page faults. Here is the same chart with a memory size of 4

    Note that there are ten page faults with a memory size of four. This is known as Belardy's anomaly, and it is a potential problem, because it should not be the case that increasing resources for a problem decrements performance.

    This problem arises because the set of pages loaded with three pages of memory is not necessarily the same pages that are loaded with a memory size of four pages. For example, when page 5 is first referenced, page 1 is left in memory when size is three, but is replaced when size is 4.

    Belardy's anomaly does not occur if the set of pages which is loaded with the larger memory size always includes the set of pages which would be loaded with the smaller memory size. Such algorithms are called stack algorithms. Most of the algorithms that were described in the previous lesson are stack algorithms, and so in practice, Belardy's anomaly is not an issue.

    Paging and Segmentation

    All of the above discussion treats the logical image of a process as a more or less large contiguous block, divided into fixed sized pages. Many processor architectures, including the Intel series, divide a process into variable sized segments. Often each procedure is in its own segment. With a segmented architecture, the computation of a logical address involves two addresses, the segment base, which is customarily loaded into a register, and the offset from the base.

    Segments differ from pages in two ways. First, they are of variable sizes. Second, the contents of any particular segment are logically related; for example, it might be the code for a single function. This can be contrasted with a pure paging system in which any one page may have the code for several unrelated functions.

    It might make sense to have a virtual memory system based on segments rather than pages. Such a system would have a segment table which corresponds to the page table in a pure paging system. A page fault would then result in a new segment being loaded. The segment table would then have to have an additional field for the segment size. Such a system has the advantage that essentially all of the code loaded into physical memory would be exactly what is needed, because, for example, all of the code for a particular function would be loaded at once when that function was called.

    However, designing a paging system with variable page sizes is more difficult. It can result in many small holes in the memory, which results in external fragmentation. Also, it turns out that many segments are quite small, often they are substantially smaller than the typical page size of modern systems. As a result of these two problems, such a system is not used today.

    The Intel processors use a combination of segments and paging, in a fairly complex system. The architecture makes reference to a Global Descriptor Table (GDT), whose address is stored in a specific register, the GDTR. In addition, each process has its own Local Descriptor Table (LDT). The former contains information about segments for the operating system itself, while the latter contains information about segments for a particular process (code segments, data segments, stack segments etc).

    To calculate an address, the system has to load a segment base from the LDT or GDT into a register. Then a complete address is calculated by adding the offset to the address in the base register. This is called a linear address. The operating system can either enable or disable paging, by setting a bit in the global control register. If paging is disabled, a linear address is treated as an absolute physical address. If paging is enabled (as is the case for any operating system that we discuss), then the linear address is in turn passed to the page tables as described in the first lesson of this week.

    Program Design Issues

    Cache memory and paging only work because of the principle of locality. There are two trends in programming which are tending to weaken this principle, thus leading to more page faults.

    Paging in Unix and Windows

    Windows 2000

    In Windows 2000 every process has 4 GB (232 ) of address space. This is much larger than the physical address of any computer, and also much more memory than most processes will need. In fact, half of this memory space is for reference addresses used by the operating system (supervisor space).

    When an executable (a .exe file) is created by the linker, the number of pages needed by the code and by the global variables is known, but programs need more memory as the program executes because the run time stack grows (and shrinks) and the program can dynamically request more memory with malloc or new. There are two phases to this. A region of virtual memory (a block of pages) can be reserved. This does not cause anything to be written to the page table. Pages are only allocated when a region is committed. An example of pages which are reserved but not committed are pages in the stack space but which are not a part of the stack because the stack has not grown large enough. When the stack grows into a new page, that page is then committed. An entry is made for it in the page table.

    Pages are reserved in blocks. The block size is typically 64K, but can be larger or smaller. By default a thread has 1MB of stack size reserved, although this can be overridden with the CreateThread call (recall that one of the arguments was the stack size).

    Windows 2000 uses a two level page table. All virtual addresses are 32 bits. On an Intel processor, the most significant 10 bits refer to an entry in the page directory. Each process has one page directory table and its entries are called page directory entries (PDEs). Each PDE contains the address of a page table. The next ten bits of the address refer to an entry in the page table pointed to by the entry in the page directory. There are potentially 1024 210 page tables for each process, but most processes will have only a few page tables. The last 12 bits of the address refer to the offset within the page, so the page size is 4K.

    Windows 2000 uses a working set with a clock algorithm for page replacement. Each process is given a default working set size between 20 and 50 pages, and the working set size is not allowed to grow above a maximum set by the system administrator. The size of the working set for a particular process is computed dynamically by the system based on the availability of memory (if there are a lot of free pages, the system can increase the working set size of a process without other processes suffering a performance penalty) and the number of page faults. Ironically, if a process generates too few page faults and the system is heavily used, its working set size may be reduced. Likewise, if a process has many page faults, its working set size is increased.

    Page replacement strategies are local, not global.

    When a page is requested, the system loads the requested page and a small number of pages which follow it. Using the principle of spatial locality, it is likely that these pages might be accessed soon, and the cost of loading additional pages is small once the first page has been found.

    You can view the performance of the Windows 2000 memory system by invoking the task manager (hit cntl-alt-del and choose task manager from the popup window) and then choosing the Performance tab.

    Programmers can control much of the behavior of the paging system with Win32 APIs. For example a program can reserve or commit memory with the VirtualAlloc function. A program can lock a page or pages in memory with the VirtualLock function (or release them with VirtualUnlock).

    A single program can have more than one heap. The API
    HANDLE HeapCreate(DWORD flOptions, DWORD initialsize, DWORD dwMaxSize)
    returns a handle which is a pointer to a new heap. The reason why a programmer might want to do this is if there was a dynamic data structure such as a tree, if all of the nodes are in this heap, it can be freed all at once with
    BOOL HeapDestroy(HANDLE hHeap)

    Unix

    There are many flavors of Unix, each with their own virtual memory systems. By necessity, the virtual memory system must be tightly coupled to the hardware.

    Solaris has two separate memory management schemes, one for the kernel and one for user processes. Most kernel pages are locked and thus not technically a part of the paging system. The user process memory management system is a paging system as described above. It uses pure demand paging, and a global replacement system, with no attempt to preload pages or maintain a working set for a process. It uses several data structures:

    A key feature of the paging system of many implementations of Unix is the paging daemon. This is a process which is awakened several times per second. If the memory is more than 75 percent full, it swaps out some pages; Otherwise it goes back to sleep. This means that when a page fault occurs, there are always free page frames available so that each page fault does not need to do the work associated with deciding which page to remove and whether to copy it. This is far more efficient than requiring the page fault interrupt handler to have to locate a page to overwrite.

    The page daemon uses a two handed clock algorithm on the page frame data table for page replacement. Like other page replacement algorithms, it uses the referenced bit of the page table. This is set to zero when the page is loaded, and set to one when it is referenced. The front hand sweeps through the list of page frames, setting the referenced bit back to zero. The back hand sweeps through some time later, checking the referenced bit. If it is still zero, meaning that the page has not been referenced since the front hand passed it, that page is placed on a list to be paged out. The rates at which the hands sweep though the page frames and the distance between the two hands can both be set by the system administrator. Note that the further apart the two hands are, the more this algorithm resembles the classic clock algorithm.

    When the page daemon runs the two handed clock algorithm, it does not necessarily go through the entire page frame data table; it runs until enough pages have been marked for deletion (the default is 25%).

    Unix has two system calls, brk and sbrk to deal with memory management. These change the amount of space allocated for the calling process's data segment The change is made by resetting the process's break value and allocating the appropriate amount of space. The break value is the address of the first location beyond the end of the data segment. The amount of allocated space increases as the break value increases.

    System calls such as malloc and new use these brk and/or sbrk as a part of their memory management strategy. Ordinary users should not use these, particularly if the program also uses malloc or new.

    Return to the course home page