CSCI.4210 Operating Systems Fall, 2009, Class 8

CSCI.4210 Operating Systems Fall, 2009 Class 8
Process and Thread Scheduling

Process Scheduling

If there are several runnable jobs, the operating system has to decide which job to run next, a process known as Process Scheduling.

In olden days, when computers ran batch jobs, this was not an issue. The computer operator simply submitted the jobs in the order that they were delivered to him or her, and each job ran to completion. We can call this algorithm First come first served, or FIFO (first in first out).

However, even this primitive system had problems. Suppose there are five jobs waiting to be run. Four of the five jobs will take about ten seconds each to run, and one will take ten minutes, but the ten minute job was submitted first. In a FIFO system, the four fast jobs will all be held up for a long time by a large job that happened to be delivered first.

In a batch system, this was not serious since jobs were never interactive. However, if we knew ahead of time how long each job would take, we could maximize throughput. In fact, some computer centers in the '60s could actually do this. When a user submitted a job, he or she was also asked to specify an estimated run time. The operator would then allow a job to run only for that amount of time (or perhaps a few seconds longer). This was important, because even then, programmers could write programs with infinite loops. If a program exceeded its estimated run time, the operator killed it.

This permitted the operator to run jobs using a shortest job first algorithm. As the name implies, instead of running jobs in the order that they are delivered, the operator would search through all available jobs and run that job which had the shortest run time. This is provably the fastest job scheduling algorithm. A simple example will demonstrate this.

Suppose four jobs arrived at about the same, but not exactly the same, time. We happen to know exactly how long each job will take. Here are the run times of the four jobs, in the order that they arrived.

Job One 25
Job Two 10
Job Three 20
Job Four 15

If the jobs are run with a FIFO algorithm, here are the total times to completion for each of the four jobs (assuming that it takes no time to load a job).

Job One 25
Job Two 35
Job Three 55
Job Four 70

The average time to completion was 46.25 seconds ((25 + 35 + 55 + 70) / 4).

If the jobs are run shortest job first, (Job Two, Job Four, Job Three, Job One) here are the total times to completion of the four jobs.

Job One 70
Job Two 10
Job Three 45
Job Four 25

It still takes 70 seconds to run all four jobs, but the average time to completion for a job was 37.5 seconds ((70 + 10 + 45 + 25) / 4).

Aside: Asking users to estimate the run time of their jobs put them in somewhat of a bind, because often they did not have an exact guess. If they submitted a high guess, their job would almost certainly run to completion, but it might be delayed by the operator. On the other hand, if they submitted a somewhat lower time estimate, the operator would start running it sooner, but there was a greater chance that it would time out before completing.

Process scheduling on a modern multiprogramming operating system is far more complex. Recall our state diagram for the states of a process

The earlier discussion made reference to a kernel process called the scheduler or the dispatcher. This is a process which decides what process to run next. The scheduler has a number of competing demands on it.

Efficiency: Keeping the CPU and other computer resources as busy as possible. If the CPU is idle but there are jobs to run, everything is slowed down
Minimizing response time for interactive jobs: Modern computers are interactive, there are users waiting for responses to their commands. It is important to schedule those jobs first where the user is waiting for a response.
Maximizing throughput: Throughput is the number of jobs completed per unit time. Clearly this should be as high as possible
Minimizing turnaround time: The average time to complete jobs should be as low as possible.
No starvation: Even long running batch jobs should be making progress.
Low overhead: The process scheduler itself is a process, but this process should not take too much time. We can use the term overhead to refer to all of the time which is taken by the process scheduler and by the work involved in switching from one process to another.
Enforcing priorities: On a large system, some jobs may be run at a higher priority than others, because their owner is more important or they are doing more important work. Scheduling algorithms have to take these priorities into account.
Enforcing real-time deadlines: A real time process is one which has strict deadlines because it is impacting events in the world. An example is a process which is displaying a video in which each frame has to be loaded within a strict schedule in order that it not appear jittery. A system which is running many processes, some of which are real time processes, has to make sure that the real time process deadlines are met.

Modern process schedulers are preemptive; this means that it is possible for a running job to be replaced by another process before it has completed. Here are some reasons why a process might be preempted

It is waiting for an event to occur, such as a keystroke from the user or a read from a file or a mutex to become unlocked. In this case, the process is moved from the running state to the blocked state, and some other process is allowed to run.
It has used up its time quantum (time slice). One of the features of modern process scheduling is that if there are several runnable jobs, one job runs for a while, and then it is replaced by another job. The maximum amount of time that a job can run before being preempted is set by the system and is called the time quantum. It is typically on the order of several hundred milliseconds.
A process with a higher priority has become runnable. This most often occurs because a job which had been blocked, waiting for an event, is awakened because the event has occurred. It may be the case that this recently awakened process has a higher priority than the currently running job, and so it replaces it.

The process of switching from one job to another is called a context switch. At a minimum, all that is involved is saving the register values of the process which is being preempted, copying the register values of the new process which is being run, and changing the Program Counter (the register in the CPU which contains the next instruction to be run). In some cases, however, a context switch may involve loading pages into memory or other much more time consuming tasks. In the interest of minimizing overhead, context switches should take very little time.

There are many types of jobs which may be running more or less concurrently. Some jobs are interactive; the user is sitting at the monitor hitting keys and moving the mouse and waiting for a response. Other jobs may be running in background. For example, a web server has no user, but it still requires relatively rapid response to requests. Still other jobs are pure batch jobs; with no user waiting for an immediate response. On many mainframes, files of transactions are read in and a database is updated, but this is not happening in real time. Kernel processes, such as interrupt handlers, should usually run with very high priority.

We can distinguish two categories of jobs, compute intensive jobs and I/O intensive jobs. Any running process consists of alternating periods of computation followed by an I/O operation. In a compute intensive job, the computation phase is long relative to the I/O phase. In an I/O intensive job, the computation phase is very short; the computer reads a record from a file for example, performs a trivial computation on it, then reads the next record.

Question: Which type of job should be given priority, a compute intensive job or an I/O intensive job?

Answer: The I/O intensive job should be given priority, because it will run for a very brief period of time before issuing its next I/O command, which will put it to sleep, and then the compute intensive job can run.

Here is a simple example: Two jobs are submitted at the same time, P1 is compute intensive and P2 is I/O intensive. The compute intensive job will require 300 milliseconds to run, the I/O intensive job will perform 3 I/O operations. Each I/O operation takes 60 msec, and the process will need 20 milliseconds to process the data from one I/O operation. There is also an initial startup of 20 msec, so it requires a total of 80 msec of CPU time. The time slice is 100 msec.

Consider two scenarios. In the first scenario, the compute intensive job is given priority. In the second scenario, the I/O intensive job is given priority and will preempt a running compute intensive job whenever it is runnable. To keep things simple, we will assume that a context switch takes no time. Here are the two time lines.

In the first scenario, P1 takes a total of 340 msec and P2 takes 440 msec. In the second scenario, P1 takes 380 msec but P2 takes only 260 msec.

It is important to keep in mind throughout all of this discussion that the scheduler does not know how long a process will run or whether or not it will be I/O intensive or CPU intensive. Also, the scheduler does not usually know whether or not a particular job is interactive or not. The only information that it has is the priority and its past history.

If we look at the kinds of jobs that are actually run on computers doing real work, it turns out that we can make some assumptions.

A process which has been IO intensive in the recent past will probably be IO intensive in the near future. The same is true of CPU intensive jobs.
A process which has been running for a long time is less likely to end soon than a process which has just started. This may be somewhat counterintuitive, but it turns out that the mix of jobs on most computers consists of a large number of very short jobs (the user typing ls for example) with a small number of long running jobs.

Earlier we said that the best algorithm for a non-preemptive batch system was a shortest job first algorithm. When we try to generalize this to a preemptive system, we conclude that the algorithm which maximizes throughput is to always run the job which has the least time remaining. In practice, this is impossible to implement because the scheduler has no way of knowing how long a particular job has to run. But we can get some clues from the above two observations.

Scheduling Algorithms and strategies

Let's consider some simple scheduling algorithms to see how well they meet our criteria.

We have already discussed two non-preemptive algorithms

First-come-first-served (FCFS) - Just run the jobs as they arrive. This is simple to implement, but it means that a long running job can block a quick job, so it is not very efficient.

Shortest job first - Run the shortest job in the queue first. This is the most efficient non-preemptive strategy, but it is based on the assumption that the system knows how long each job will run, and generally this is not known. Thus, it is not practical.

Preemptive strategies

Essentially all modern operating systems use a preemptive scheduling strategy, so we can discuss several such algorithms.

Round Robin

A simple yet fairly effective scheduling algorithm is called Round Robin. This is a variant of the FCFS algorithm discussed above with a time slice preemption added. In a Round Robin scheduler, the processes which are ready to run are kept in a FIFO queue. A running job continues running until it terminates, it becomes blocked, or it uses up its time slice. When its time slice is used up, it is put on the end of the ready queue. When sleeping processes are awakened or when new processes are started they do not preempt the currently running job, they are put on the end of the queue.

The Round Robin Scheduling algorithm has the following advantages

It is simple to implement. Simplicity generally means low overhead.
There is no danger of starvation for any job

It has the following disadvantages

It cannot take scheduling priorities into account
Interactive jobs are not necessarily given priority, so there can be longer than optimal waits.
IO intensive jobs are not necessarily given priority.

Let's compare Round Robin to our idealized (run shortest remaining job first) algorithm. Suppose that we have four jobs, with the following times remaining.

Job 1	40
Job 2	10
Job 3	30
Job 4	20

Here is how the timing would look with Round Robin scheduling using a time quantum of 10.

The average time to completion is 70 msec. With shortest job first scheduling, the average time to completion of the four jobs is 50 msec.

The size of the time quantum is an important consideration. Our examples have ignored the time to perform a context switch, but in practice there is some overhead associated with switching from one running process to another. If the goal is to minimize overhead time, then it is better to have a large time quantum. In practice most processes will not use up their entire quantum because they will be blocked on an IO operation or a mutex.

However, if the time quantum is large, the turnaround time on short jobs will be longer because they will spend more time waiting for CPU intensive jobs to complete their time slice.

Multiple queues based on priority

If job priority is an important criterion, as it is in real time systems for example, another algorithms is to have several queues depending on priority. High priority jobs are placed on one queue, low priority jobs on another. High priority jobs are run first. This algorithm has the risk that low priority jobs might starve if there are enough high priority jobs, so one solution to this is that jobs that are ready to run but have not received any CPU time recently are moved up in priority.

Multiple queues based on prior usage (feedback)

A variant of this is to assign jobs to different priority queues depending on recent CPU usage. Consider a simple system with two queues. Newly started jobs or jobs that have just been awakened after being blocked are placed on the high priority queue. A job which is preempted because it uses up its time slice is placed on the lower priority queue. Jobs on the high priority queue are run first. If there are no jobs on this queue, then jobs on the low priority queue are run. Each queue operates on an FIFO basis.

This algorithm has a number of advantages over a simple round robin scheduler. Interactive processes are given priority because they tend to be either very short or to spend most of their time blocked waiting for input. The disadvantage of this is that long running, CPU intensive jobs can starve if there are enough interactive jobs. One solution to this problem is to have the scheduler periodically scan the low priority queue to look for jobs which have not received any CPU time recently and move them to the higher priority queue.

If two queues are better than one queue, perhaps three or more queues are better than two. With multiple queues, a job which is on, say queue 2 which uses up its time slice is then placed on queue 3, (lower numbers mean higher priority), and so on. Thus a long, compute intensive job will quickly migrate to a lower priority queue. In order to avoid starvation, there needs to be a mechanism to move low priority jobs which have not received any CPU time for a while to move to a higher priority queue.

Yet another variant of this, which helps to address the problem of low priority jobs starving, is to vary the time slice depending on the priority. Low priority jobs run less often, but when they get to run, they are given a longer time slice.

Scheduling systems which use multiple queues based on prior usage are sometimes called feedback models.

Calculating CPU usage with decay

A multiple queue system assigns jobs a priority based on CPU usage. One simple but poor solution is to just assign priority based on total CPU usage so far. This effectively starves long running jobs. A second solution is to base priority only on CPU usage during the recent past, the past second for example. Thus, a job which received no CPU time in the past second is treated the same as a brand new job. An even better solution is to use an algorithm which is a hybrid between these two such that recent CPU time reduces priority more than CPU time in the distant past.

For example, suppose we divide time up into discrete periods of, say, one second. We want to count usage in the most recent time period more heavily than usage in the second most recent time period, and so on. This is called a decay function. Each second, a value T is calculated for each current process according to a formula such as the following:

T_N = .5 * Time_N-1 + .25 * Time_N-2 + .125 * Time_N-3 + .0625 * Time_N-4 ...

In this function, T_N is our total usage indicator at time N. Time_N-1 is the amount of CPU time that the process used in the last time period, Time_N-2 is the amount of CPU time that the process used in the second to last time period, Time_N-3 is the amount of CPU time that the process used in the third to last time period, and so on.

Jobs are prioritized by T, the job with the lowest value of T is run first, the job with the second lowest value of T is run second, and so on.

This algorithm has several advantages

Newly arrived jobs are given the highest priority because they have no prior CPU time.
IO intensive jobs are given priority, because, since they spend most of their time sleeping, they do not accumulate much CPU time.
Long running, CPU intensive jobs do not starve, because the longer they wait for CPU time, the lower their value of T is.

It would seem that it would require a lot of overhead to compute such a function, but in fact that is not the case. The following function performs this computation.

T_N = P * Time + (1 - P) * T_N-1

In this formula Time is the amount of CPU time that the process used since T was last recomputed, and P is a value between zero and one. If P has the value of .5, this exactly computes the expansion above. A higher value of P gives more weight to the recent past. In other words, the function decays more quickly.

Fair-share scheduling

All of the models described above treat each process as independent. However, on a modern multiuser system with threading, this may not be the case. In the discussion of threads, we made a distinction between two different ways to implement threads. On system which implements user threads, thread scheduling is handled entirely within a process without the kernel dispatcher knowing anything about the threads. In this case, there is a thread library which is a part of the process structure, and this will contain a scheduling mechanism, but of course, it only schedules threads within that process when that process is actually running. The opposite is kernel threads, in which the kernel knows about multiple threads in a single process and is responsible for scheduling these threads.

In the latter case, the scheduler might want to schedule threads such that each process gets its fair share of the CPU, in contrast to giving a process with, say, six threads, six times as much run time as a process with only a single thread. This is known as fair-share scheduling (FSS). In a typical FSS system, the system divides the executable threads into related groups, and allocates a fraction of the total CPU time to each group. Scheduling is thus based on a complex priority system which can take into account the recent history of that particular thread, the total usage of all of the threads in that group, and perhaps the priority of the group, and/or the priority of the threads within that group.

Swapping

Multiprogramming only works efficiently if all runnable processes are in memory. During times of heavy load, it may be the case that not all processes will fit into memory at the same time. This leads to a lot of memory page faults (exception conditions in which a process or a component of a process needs to be loaded into memory from disk before it can run), which are extremely inefficient. One solution which is adopted by many real world operating systems is to swap out some jobs during times of peak load. This means that entire jobs are copied onto disk and removed from the ready queue, even though they are potentially runnable. This not only frees up memory for higher priority jobs, but it also makes the scheduler work more efficiently because it has fewer potentially runnable processes to consider. Only low priority jobs should be swapped out.

The obvious disadvantage is that swapping can dramatically slow down long running jobs, so there needs to be a mechanism to periodically copy swapped out jobs back to memory and give them some CPU time.

Swapping processes in and out of memory is called medium level scheduling. This can be contrasted with short term scheduling on the one hand, which was the kind of scheduling that was discussed in the rest of this lesson, and long term scheduling, on the other hand, which is deciding which and how many jobs to run. Although many operating systems texts discuss long term scheduling, modern systems usually have little control over when new processes are created, and so it is only relevant for a few large mainframe systems that run mostly batch jobs.

A Final Note

Although process scheduling is a crucial component of the study of operating systems, it is only important for systems which are running many more or less simultaneous processes. In practice, the typical PC or Unix workstation is only running one or at most a small number of active processes at one time, and so process scheduling is not important.

Scheduling in Unix and Windows

In spite of the disclaimer at the end of the last lesson, both of the operating systems that we discuss in this course use scheduling algorithms which are more complex than any of those discussed above.

Unix (4.4BSD)

Unix uses multilevel feedback queues. All runnable processes are assigned a scheduling priority that determines which queue they are placed in. Each queue uses round robin. Priorities are dynamically adjusted.

When a nonrunning process with a higher priority becomes available, it immediately preempts the current running process if the current process is in user mode, otherwise it switches when the current process exits the kernel.

Unix processes have a priority value called niceness, which ranges from -20 (highest priority) to +20 (lowest priority). The default value for a user process is zero. A user can run a process with a lower priority with the nice command at the shell. For example
nice a.out
This sets its niceness value to 10. It is called nice because users who use this are being nice to other users by running their jobs at a lower priority. Only the superuser is allowed to increase the priority of a job.

Short term priority of user processes is based on two values, recent cpu utilization and niceness. Every four clock ticks (40 msec) priority is recalculated base on this equation.

priority = PUSER + (recentcputime/4) + 2 * Niceness.

recentcputime is incremented each time that the system clock ticks, and the process is found to be executing. It is adjusted once per second by a decay factor which causes 90 percent of the cpu usage over the past second to be decremented over a period of time which is dependent on the system load factor. PUSER is a constant.

recentcputime = ((2 * load) / (2 * load + 1)) * recentcputime + nice

load is the average length of the run queue. Thus, this function uses a decay function as described above with the rate of decay determined by system load. When the system is lightly loaded, decay is rapid, when the system is heavily loaded, decay is slower.

Each type of job has its own priority. This is determined by the constant in the equation. User jobs have the value PUSER, Processes running in kernel mode have a higher priority than user jobs, so the constant is higher.

Linux uses a multilevel feedback queue with 140 levels of priority. 0-99 are used for kernel tasks and real time tasks and the rest are for user processes. The time quantum for tasks with priority 0-99 is greater than the time quantum for user tasks.

Process Scheduling in Windows 2000

Windows 2000 also uses a priority driven, preemptive scheduling algorithm. A running job is assigned a time quantum, but size of the time quantum varies. Unlike Unix, what is scheduled is threads, not processes, and in general, no consideration is given to what process the thread belongs to. This means that a process with many runnable threads would receive much more CPU time than a process with only one runnable thread.

Windows 2000 uses 32 priority levels (higher numbers indicate higher priority). The top 16 are called real time levels and the bottom 16 are for ordinary user processes.

A thread in a real time process is assigned a real time priority and this does not change. However, note that in spite of the name, Windows 2000 does not support true real time scheduling (i.e. guaranteed turnaround time); these simply refer to higher priority jobs. Many kernel threads run with real time priority.

The priority of a thread in an ordinary user process changes dynamically. Each thread starts out with its process base priority (the default is 8). However, priority can be increased by the following events

Completion of an I/O operation
after waiting on a semaphore or mutex
when a GUI thread wakes up because of a windowing activity
when a thread that is ready to run hasn't run for some time (starvation).

The scheduler also changes the length of a time quantum, thus effectively increasing the CPU time of certain processes. One of the advantages of integrating the Graphical User Interface with the kernel is that the system knows which window has the focus, and can increase the time quantum for the process or processes associated with that window (foreground processes). This makes sense because these processes are more likely to be interactive.

The dispatcher runs higher priority jobs first. A process that uses up its time quantum has its priority reduced.

Return to the course home page

CSCI.4210 Operating Systems Fall, 2009 Class 8 Process and Thread Scheduling

CSCI.4210 Operating Systems Fall, 2009 Class 8
Process and Thread Scheduling