The principles of spatial and temporal locality described for caching also apply to virtual memory.
The idea behind virtual memory is that physical memory is divided into fixed size pages. Pages are typically 512 to 8192 bytes, with 4096 being a typical value. Page size is virtually always a power of two, for reasons to be explained below. Loadable modules are also divided into a number of page frames. Page frames are always the same size as the pages in memory.
Pages frames are loaded into memory only when they are needed. Adjacent page frames are not necessarily loaded into adjacent pages in memory. At a point in time during the execution of a program, only a small fraction of the total pages of a process may be loaded.
There are two kinds of addresses (or two address spaces), virtual addresses and real addresses. Virtual addresses, also called logical addresses, are program generated. In a system without virtual memory, a physical address and a logical address are the same; the virtual address is placed on the address bus directly because it refers to a physical address in main memory. On systems with virtual memory, there is a mapping between virtual addresses and real addresses This mapping is done by the Memory Management Unit (MMU), which accesses a data structure called a page table.
Structure of the page table
A row of the page table typically has the following fields.
Page tables are potentially huge. On a 32 bit word computer with a page size of 4KB (212), the page table has 220 (roughly a million) entries. This is far too big to store in main memory. One solution to this problem is a multilevel page table, in which each page in the first level points a page in the second level, but only a few pages are in memory at any time
This model is still too slow to be effective. Even if the both of the required page tables are in memory, this now means that one logical memory access would require three memory accesses because the MMU would have to read the first page table and the second page table to be able to generate the physical address.
The solution is a cache of the page table, called the Translation Lookaside Buffer or TLB. The TLB is inside the MMU, and contains pointers to a small number of frequently used pages. Whenever the running program needs to access memory, it first checks to see if the pointer to the page containing that address is stored in the TLB. Most of the time it is, and so the MMU can immediately put the real address on the address bus without having to read any other memory. The TLB has to do parallel search, so it is implemented in hardware.
The relationship between the TLB, the MMU, and the cache is fairly complicated. Here is a typical arrangement.
Inverted Page Tables
In the last class we discussed the problem of how to store the page table. Most page tables are relatively sparse; they contain space for perhaps a million pages, but only a tiny fraction of these pages are actually used by the typical process. Thus, a traditional page table is somewhat wasteful. Also, as 64 bit computers become more common, page tables will be inconceivably huge. The solution is inverted page tables
In a regular page table, there is an entry for each virtual page, in an inverted page table there is an entry for each physical page, so the page table is much smaller. This would not seem to make much sense, because to access a particular virtual address, it would be necessary to search each page in the inverted page table to see if it is there. Of course the system first checks the TLB, and if there is a hit, there is no need to access the page table at all.
An inverted page table uses associative or content addressable memory. Each cell in an associative memory contains a key field and a data field. To search an inverted page table, the system uses a hash function, in which the physical address is passed through a function to generate an index to the appropriate row in the inverted page table. You all remember from your data structures course that a hash function allows very rapid search of large data sets.
Page Replacement Algorithms
When a page fault occurs, the system has to copy a new page frame from disk into a page of memory. This may involve replacing a page which is already loaded. How does the system decide which page to replace? Note that if the page which is being replaced has been modified (the dirty bit is on), it has to be copied to disk. It is important to have an efficient page replacement algorithm, because if the page which is replaced is accessed again soon, the result will be another page fault.
Page faults are time consuming. If the system is generating too many page faults, it is doing little real work. This is a situation called thrashing.
Let's assume that we have perfect knowledge of the page reference stream into the future. The forward distance of a particular page at a particular time is the next place in the stream where that page is referenced again. The forward distance would be 1 if the very next reference was to that page, and infinity if that page will never be referenced again. The optimal page replacement algorithm would be to replace that page which has the highest forward distance, in other words, replace that page which will not be used for the longest time.
Unfortunately, the system has no way of knowing this. However, it is possible to make a reasonable guess about which pages may not be referenced again soon.
One simple and obvious strategy is the not-recently-used algorithm (NRU). This algorithm says to replace a page in memory which has not been accessed recently. Because of the principle of temporal locality, we can assume that a page which has not been referenced recently is unlikely to be referenced soon.
The ideal implementation of the NRU algorithm would be to replace that page which has been least recently used (LRU); i.e. has not been accessed for the longest time. However, it is difficult to implement a pure LRU algorithm because this would require keeping track of how long ago each page was referenced, and the overhead associated with this is far too high to be practical. An approximation of LRU will be discussed below.
It is simple to implement a Not Recently Used algorithm. A row in a page table usually has a flag called the referenced bit. Most systems have a clock which periodically generates an interrupt about once every 20 msec. On each clock interrupt (or similar schedule), the referenced bits of all of the pages in the page table are set to zero. As each page is referenced, its referenced bit is turned on. When a page fault occurs, if an active page needs to be replaced, the system looks for a page where the referenced bit is off, and replaces that page.
This is far from perfect. If a page fault occurs shortly after a clock interrupt, essentially all of the pages will be marked unreferenced, and so the system will have little information about which page to overwrite.
This system is further complicated by the fact that if the dirty bit is on, the page has to be copied to disk. In this case, a page fault results in both a read-from-disk operation and a write-to-disk operation. Thus, a variant of this NRU strategy is to look first for a page where both the referenced bit and the dirty bit are off, and it only replaces a page with the dirty bit on if there are no pages with both bits off. Of course this requires additional overhead because the entire page table has to be searched.
Another low overhead strategy is the First In First Out (FIFO) page replacement algorithm. In this case, the system keeps track of the load time for each page, and it replaces that page with the longest time in memory. This can be easily implemented by putting each new page at the tail of a list in memory, and removing pages from the head.
The FIFO algorithm should be included in this list for completeness, but it is not used in practice. There is no particular reason to assume that the page which was loaded longest ago is less likely to be accessed soon than a page which was just loaded.
We can improve on the NRU algorithm. The text describes a second chance algorithm and a clock algorithm, but the latter is just an implementation of the former. The clock algorithm keeps all of the current pages in a circular list. There is a pointer which moves around this list like the hand of a clock. When a page fault occurs, the system examines the page that the hand is currently pointing to. If the referenced bit is off, this page is replaced in memory (copying the contents back to disk if the dirty bit is on). If the referenced bit is on for that page, it is set to off and the hand continues around the list. This continues until it finds a page where the referenced bit is off. The algorithm is guaranteed to find such a page, because the list is circular, and the hand will eventually get back to the page where it started, and this page will have the referenced bit off.
It goes without saying that the referenced bit is turned on whenever a page is referenced.
This algorithm has an advantage over the NRU algorithm, because when it finds a page frame with the referenced bit off, it can be certain that this page has not been referenced in one full sweep across all of the pages, whereas with the NRU algorithm, if the system finds a page with the reference bit off, there is no guarantee regarding how long ago it was referenced, only that it has not been referenced since the last clock interrupt.
The clock algorithm or variants of it are used in many real operating systems.
It was stated above that the Least Recently Used (LRU) algorithm would be very efficient, but that it would be difficult to implement in practice. The text talks about two ways to implement LRU in hardware.
One method is for the system to maintain a counter which is incremented each time any instruction is accessed. The row of data in the page frame needs to maintain room for this counter. Whenever a particular page is referenced, the counter is incremented and the value is copied to the page frame.
When it is time to replace a page in memory, the system checks this value in each page and replaces that page where the value is the lowest.
Here is an example. Suppose that the value of the counter at the start of the scenario is 1000 (in practice, the counter should be very large, the text suggests 64 bits). There are eight pages in memory, numbered one to eight. Here is a string of page references.
6, 7, 3, 8, 8, 1, 4, 2, 5, 6, 1
After executing this string, here is the value of the counter for each of the eight page frames.
If a page fault occurred at this point in time, and one of these eight pages had to be replaced, the system would look for the lowest value of the counter, see that it was page 7, and it would replace that page.
This algorithm can only be implemented if there is hardware support for it in the MMU, and few if any hardware developers have implemented such an algorithm.
A close approximation of the LRU algorithm can be implemented entirely in software. The text refers to this as Not Frequently Used (NFU) and illustrates this with an algorithm called aging. It works like this. Each page has a counter associated with it. Recall that operating systems have a clock interrupt which occurs roughly every 20 msec. At each clock interrupt, each counter is right shifted by one bit (this is equivalent to dividing by 2. The least significant bit is lost). If the page has been referenced since the last interrupt, the most significant bit of the counter is set to one, otherwise it is set to zero. Then all of the referenced flags are set to zero.
Here is an example. There are six pages in memory, and the counter is eight bits. (Eight bits is usually enough). During each time slice (the period of time between clock interrupts), the following pages are referenced. For example, during the first time slice, pages 2 and 4 were referenced, the others were loaded but not referenced.
1 | 2,4 |
2 | 1,2,3,4 |
3 | 2,3,4,5,6 |
4 | 1,2 |
5 | 1,2,5,6 |
6 | 2 |
7 | 1,2,6 |
8 | 1,2,5 |
1 | 11011010 (218) |
2 | 11111111 (255) |
3 | 00000110 (6) |
4 | 00000111 (7) |
5 | 10010100 (148) |
6 | 01010100 (84) |
Note that any page which was referenced in the last clock cycle will have a higher value than a page not so referenced, because its most significant bit will be one, and the most significant bit of any page not referenced in the last cycle will be zero. Likewise, any page referenced in the past two cycles will have a higher value than a page which has not been referenced in the past two cycles, and so on.
Working Sets All of the above models of paging use demand paging. In demand paging, a page is only loaded when it is needed, and a page fault always loads exactly one page. When a process is first created, none of its pages are in memory, and so there will always be a number of page faults when a process starts. After a while, enough pages will be loaded into memory so that the process runs for long periods with few page faults. The set of pages which need to be in memory in order for the program to run for an extended period with few page faults is called the working set.
Operating systems which are running at or near capacity will swap out processes, that is, remove all of their pages from memory and remove the job from the running queue. At some point a swapped out job will be swapped back in, that is, returned to the running queue. If the system is using demand paging, such a job will again generate a number of page faults before having enough pages in memory to run. Thus, it would make sense to keep track of the pages of a process which were in memory when the job was swapped out, and when the job was swapped back in, restore all of these pages to memory before starting to run the job. Loading pages before they generate a page fault is called prepaging, and this model is called the working set model because it restores the entire working set of a process before running it.
This is more complicated than it seems. The operating system needs to keep track of which pages have been recently accessed. It is possible, indeed likely, that a process may have pages loaded which it has not accessed for a while, but which have not been replaced. These pages should not be considered to be a part of the working set.
We can define the working set w(k,t) at time t as the set of pages that have been referenced by a particular process within the past k memory references. Up to a point, as k increases, the size of the working set increases, but this function levels off quickly.
In order to implement this, the system would need to update a register at each memory reference, and this is far too costly to be done even in hardware. However, it is possible to have a reasonable simulation. A field for time of last use is associated with each page (although this cannot be kept entirely current). As with other page replacement algorithms, each page also has a referenced bit. At each clock tick (about once every 20 msec), the system clears this bit (sets it to zero), and each page reference sets it to one.
All operating systems keep track of the total running time of a process. When a page fault occurs, the system checks all of the pages associated with the current process. If the referenced bit is on, meaning that the page has been referenced since the last clock tick, the time of last use field is updated to the current value of the total run time of the process. If the referenced bit is zero, the value in time of last use field is compared to the current total time value, and if the difference is greater than some value, tau, this page is removed.
Here is an example. At an instant in time, the currently running process has six pages in memory. The total running time of the process is now 1000 msec (this value is always an approximation, since it is only updated at clock interrupts). The value of tau, set by the system, is 50 msec, meaning that pages which have not been referenced in the past 50 msec of run time for that process are candidates to be replaced because they are no longer in the working set. Here are the values of the six pages when a page fault occurs. Clearly the page table would have other data as well, but I only show the data relevant to this algorithm.
Page number | Time of last use | Referenced flag |
---|---|---|
1 | 980 | 1 |
2 | 980 | 0 |
3 | 960 | 0 |
4 | 980 | 1 |
5 | 930 | 1 |
6 | 940 | 0 |
The system needs to find a page to replace. The first page has been referenced since the last clock interrupt, so it is clearly not a candidate for replacement. Its Time of last use field is updated to 1000 (the current running time value). The second page has not been referenced in this clock cycle, but the current time (1000) minus its time of last use (980) is less than tau (20 is less than 50), so it is not a candidate for removal. The same is true of the third page. Pages 4 and 5 have been recently referenced, so their time of last use field is updated to 1000. Page six has not been referenced since the last clock tick (its referenced bit is zero) and the current time minus its time of last use is 60 msec, which is greater than tau, so it is no longer in the working set and it is a candidate for replacement.
This algorithm works well, but there is a lot of overhead associated with it because at each page fault, every page associated with that process has to be examined and possibly updated. This algorithm can be combined with the clock algorithm to get an efficient page replacement strategy which is widely used in practice. It is called the WSClock algorithm.
In this algorithm, the pages associated with a process are kept in a circular list. Each page has a referenced bit, a modified bit (dirty bit), and a time of last use field as well as a pointer to the physical page address. At each page fault, the list is searched to find a suitable candidate for replacement. Searching starts where it ended at the last page fault (Picture the hand of a clock). If the time of last use is greater than tau, the page is a candidate for eviction because it is not in the working set. The algorithm differs from the above in that not all pages are searched, searching stops when a suitable page for replacement is found. Also, if the page is not in the working set but the modified bit is on, meaning that the page cannot be replaced until its new values are written to disk, rather than performing the write immediately, the write to disk is scheduled and the hand moves on, looking for a better candidate page.
Here is a summary of Rage Replacement Algorithms
Design Issues in Paging
There are a number of design issues for paging algorithms that can impact performance.
Local vs. Global Paging Strategies
If there are a number of runnable processes loaded into memory, and a page fault occurs, the system has to decide which page to eliminate. It can consider all pages in memory, i.e. the pages for all processes, or it can only consider those pages which are a part of the currently running process, which presumably is the process which produced the page fault. The former strategy is called a global page replacement strategy, while the latter is called a local page replacement strategy.
Local page replacement strategies have the effect of keeping a more or less constant number of pages in memory for a given process. This may seem fair, but there is no reason to assume that all of the processes will need about the same number of pages. Also, during a run of a process, the number of pages that it needs may grow or shrink.
In general, global replacement strategies work best. However, if the page replacement algorithm is one of the working set algorithms, then it would have to use a local strategy.
Page Size
An important design decision for an operating system is the page size. There are tradeoffs to having larger (or smaller) pages.
A smaller page size means less internal fragmentation. Internal fragmentation is wasted space in a page frame. If a page is 4K, but the program only needs 1K of data, three K are wasted. Also, if page size is small, it is likely that more of what is loaded into memory at any given instant will be exactly what is needed, whereas with larger pages, it is possible that a part of the contents of many of the pages will not be referenced.
On the other hand, a smaller page size means that there are more pages, which means a larger page table. If there are fewer pages, per process, the page reference is more likely to be in the TLB. Also, as we will see next week, the additional time to load large pages is far less than proportional. It takes essentially the same amount of time to load a large page as a small page. When a new process is started, larger pages will enable the memory management system to load a working set of pages faster. However, simulations of paging seem to show that a larger number of smaller pages produces fewer page faults than a smaller number of larger pages.
Over time, the page size of operating systems has grown larger. The typical page size now is 4K, although many systems now have pages of 8K. (Page sizes are almost always a power of 2. Why?).
Shared Pages
Several processes can share the same page. For example, when a fork occurs on a Unix system, logically the parent process is duplicated but in practice, the code segment is not copied; both the parent process and the child process use the same pages. This saves space in memory and increases the efficiency of a fork system call. However, it complicates the management of paging because the system needs to keep track of how many processes are using the same page.
In some cases, both the parent and child share data segments as well. As long as neither process changes the values of any variables, this is efficient. This is called copy on write because when either the parent or the child updates one of its pages, the system is required to make another copy of this page, one for the child and one for the parent. Shared pages with a copy on write strategy works particularly well when the child immediately calls exec because essentially no copying of the parent is done at all. When the child process is created, its page tables are given the same values as the parent and no new memory is allocated.
Locking pages in memory
Most paging systems have a mechanism in which pages can be pinned in memory. A page which is pinned is not a candidate to be overwritten, even if it has not been referenced recently. For example, some kernel pages should always be in memory. The pages which contain the code for memory management are an obvious example.
If an operating system uses Direct Memory Access (DMA) for I/O, in which I/O devices copy data directly to or from the memory space of an application, then the page(s) which contains this memory must be pinned, even if the process itself is blocked for an extended period of time.
Stack Algorithms and Belardy's Anomaly
It should be obvious that the more pages are allocated to a process, the fewer page faults will occur, and in general this is true. However, it is possible to contrive strings of page references where this is not the case. Consider this string of page references.
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
If the page replacement algorithm is FIFO and there are three pages of memory, the run stream would look like this. The asterisks indicate page faults.
Note that with a memory size of 3, there are nine page faults. Here is the same chart with a memory size of 4
Note that there are ten page faults with a memory size of four. This is known as Belardy's anomaly, and it is a potential problem, because it should not be the case that increasing resources for a problem decrements performance.
This problem arises because the set of pages loaded with three pages of memory is not necessarily the same pages that are loaded with a memory size of four pages. For example, when page 5 is first referenced, page 1 is left in memory when size is three, but is replaced when size is 4.
Belardy's anomaly does not occur if the set of pages which is loaded with the larger memory size always includes the set of pages which would be loaded with the smaller memory size. Such algorithms are called stack algorithms. Most of the algorithms that were described in the previous lesson are stack algorithms, and so in practice, Belardy's anomaly is not an issue.
Paging and Segmentation
All of the above discussion treats the logical image of a process as a more or less large contiguous block, divided into fixed sized pages. Many processor architectures, including the Intel series, divide a process into variable sized segments. Often each procedure is in its own segment. With a segmented architecture, the computation of a logical address involves two addresses, the segment base, which is customarily loaded into a register, and the offset from the base.
Segments differ from pages in two ways. First, they are of variable sizes. Second, the contents of any particular segment are logically related; for example, it might be the code for a single function. This can be contrasted with a pure paging system in which any one page may have the code for several unrelated functions.
It might make sense to have a virtual memory system based on segments rather than pages. Such a system would have a segment table which corresponds to the page table in a pure paging system. A page fault would then result in a new segment being loaded. The segment table would then have to have an additional field for the segment size. Such a system has the advantage that essentially all of the code loaded into physical memory would be exactly what is needed, because, for example, all of the code for a particular function would be loaded at once when that function was called.
However, designing a paging system with variable page sizes is more difficult. It can result in many small holes in the memory, which results in external fragmentation. Also, it turns out that many segments are quite small, often they are substantially smaller than the typical page size of modern systems. As a result of these two problems, such a system is not used today.
The Intel processors use a combination of segments and paging, in a fairly complex system. The architecture makes reference to a Global Descriptor Table (GDT), whose address is stored in a specific register, the GDTR. In addition, each process has its own Local Descriptor Table (LDT). The former contains information about segments for the operating system itself, while the latter contains information about segments for a particular process (code segments, data segments, stack segments etc).
To calculate an address, the system has to load a segment base from the LDT or GDT into a register. Then a complete address is calculated by adding the offset to the address in the base register. This is called a linear address. The operating system can either enable or disable paging, by setting a bit in the global control register. If paging is disabled, a linear address is treated as an absolute physical address. If paging is enabled (as is the case for any operating system that we discuss), then the linear address is in turn passed to the page tables as described in the first lesson of this week.
Program Design Issues
Cache memory and paging only work because of the principle of locality. There are two trends in programming which are tending to weaken this principle, thus leading to more page faults.
Paging in Unix and Windows
Windows
In recent Windows Operating Systems every process has 4 GB (232 ) of address space. This is much larger than the physical address of any computer, and also much more memory than most processes will need. In fact, half of this memory space is for reference addresses used by the operating system (supervisor space).
When an executable (a .exe file) is created by the linker, the number of pages needed by the code and by the global variables is known, but programs need more memory as the program executes because the run time stack grows (and shrinks) and the program can dynamically request more memory with malloc or new. There are two phases to this. A region of virtual memory (a block of pages) can be reserved. This does not cause anything to be written to the page table. Pages are only allocated when a region is committed. An example of pages which are reserved but not committed are pages in the stack space but which are not a part of the stack because the stack has not grown large enough. When the stack grows into a new page, that page is then committed. An entry is made for it in the page table.
Pages are reserved in blocks. The block size is typically 64K, but can be larger or smaller. By default a thread has 1MB of stack size reserved, although this can be overridden with the CreateThread call (recall that one of the arguments was the stack size).
Windows uses a two level page table. All virtual addresses are 32 bits. On an Intel processor, the most significant 10 bits refer to an entry in the page directory. Each process has one page directory table and its entries are called page directory entries (PDEs). Each PDE contains the address of a page table. The next ten bits of the address refer to an entry in the page table pointed to by the entry in the page directory. There are potentially 1024 210 page tables for each process, but most processes will have only a few page tables. The last 12 bits of the address refer to the offset within the page, so the page size is 4K.
Windows uses a working set with a clock algorithm for page replacement. Each process is given a default working set size between 20 and 50 pages, and the working set size is not allowed to grow above a maximum set by the system administrator. The size of the working set for a particular process is computed dynamically by the system based on the availability of memory (if there are a lot of free pages, the system can increase the working set size of a process without other processes suffering a performance penalty) and the number of page faults. Ironically, if a process generates too few page faults and the system is heavily used, its working set size may be reduced. Likewise, if a process has many page faults, its working set size is increased.
Page replacement strategies are local, not global.
When a page is requested, the system loads the requested page and a small number of pages which follow it. Using the principle of spatial locality, it is likely that these pages might be accessed soon, and the cost of loading additional pages is small once the first page has been found.
You can view the performance of the Windows memory system by invoking the task manager (hit cntl-alt-del and choose task manager from the popup window) and then choosing the Performance tab.
Programmers can control much of the behavior of the paging system with Win32 APIs. For example a program can reserve or commit memory with the VirtualAlloc function. A program can lock a page or pages in memory with the VirtualLock function (or release them with VirtualUnlock).
A single program can have more than one heap. The API
HANDLE HeapCreate(DWORD flOptions, DWORD initialsize, DWORD dwMaxSize)
returns a handle which is a pointer to a new heap. The reason why a programmer
might want to do this is if there was a dynamic data structure such as a tree, if
all of the nodes are in this heap, it can be freed all at once with
BOOL HeapDestroy(HANDLE hHeap)
Unix
There are many flavors of Unix, each with their own virtual memory systems. By necessity, the virtual memory system must be tightly coupled to the hardware.
Solaris has two separate memory management schemes, one for the kernel and one for user processes. Most kernel pages are locked and thus not technically a part of the paging system. The user process memory management system is a paging system as described above. It uses pure demand paging, and a global replacement system, with no attempt to preload pages or maintain a working set for a process. It uses several data structures:
The page daemon uses a two handed clock algorithm on the page frame data table for page replacement. Like other page replacement algorithms, it uses the referenced bit of the page table. This is set to zero when the page is loaded, and set to one when it is referenced. The front hand sweeps through the list of page frames, setting the referenced bit back to zero. The back hand sweeps through some time later, checking the referenced bit. If it is still zero, meaning that the page has not been referenced since the front hand passed it, that page is placed on a list to be paged out. The rates at which the hands sweep though the page frames and the distance between the two hands can both be set by the system administrator. Note that the further apart the two hands are, the more this algorithm resembles the classic clock algorithm.
When the page daemon runs the two handed clock algorithm, it does not necessarily go through the entire page frame data table; it runs until enough pages have been marked for deletion (the default is 25%).
Unix has two system calls, brk and sbrk to deal with memory management. These change the amount of space allocated for the calling process's data segment The change is made by resetting the process's break value and allocating the appropriate amount of space. The break value is the address of the first location beyond the end of the data segment. The amount of allocated space increases as the break value increases.
System calls such as malloc and new use these brk and/or sbrk as a part of their memory management strategy. Ordinary users should not use these, particularly if the program also uses malloc or new.
Return to the course home page