This class will discuss some of the important programming concepts that you will need for this course. Even if you think that you are an excellent programmer, you should pay close attention.
The C Language
Most of the example code for this course will be in C. This is because a lot of what we will be doing will be making system calls, and the system call interface is identical to the C function call interface for both of the operating systems that we will cover in this course. We might look at some Unix kernel source code and most versions of Unix are written mostly in C, with a little assembler.
Much of the Windows operating system code is also written in C, although the source code is not publicly available.
C is a subset of C++. Most students have been programming in C++ rather than C, so this may take some adjustments. If you have not programmed in C before, here is a brief C tutorial for C++ programmers
The Unix C compiler that we will use is gcc (formerly Gnu C Compiler, now Gnu Compiler Collection).
To use gcc in its simplest form to compile a C program called hello.c, enter this at the command line.
> gcc hello.c
If there are no errors, the prompt will be displayed almost immediately. The name of the executable program will be a.out To run the program, just type ./a.out at the command line.
Digression: The Unix shell (in fact all Unix processes) have an area of memory called the environment, where a number of environment variables are define. The Unix command printenv displays all of the currently set environment variables. These always take the form
VARNAME=value
By convention, the variable names are in all upper case. Here is a sample
USER=ingallsr
HOME=/cs/ingallsr
TERM=dttermOne of these environment variables is called PATH. This consists of a list of directories, delimited by colons, where the shell searches for executable files when you type a command. To display the value of PATH, enter
> echo $PATH
The output might be something like this./usr/local/bin/:/usr/local/sbin/:/usr/bin/:/usr/local/X11R6/bin/
When you type a command such as a.out, the shell has to find where this executable is. It searches the first directory in the path (/usr/local/bin), then the second (/usr/local/sbin), then the third (/usr/bin), and so on until it finds an executable file of that name, and it executes it. If there is no executable program called a.out in any of the directories in the path, the shell will display command not found. If there are executable files called a.out in more than one of the directories in the path, it will execute only the first one that it finds.
The first dot in the command ./a.out refers to the current directory. This tells the shell not to search the directories in the path, but only to search in the current working directory. There is some controversy about whether . should be in the list of PATH directories, and if so, where it should be. One camp says that . should be the first entry in the path. If that is the case, then you can just type a.out, and it will execute the program that you just compiled. Another camp says that the dot should be at the end of the path. In this case, if you type a.out, it will execute the file that you just created with the compile, unless there happens to be another executable file called a.out in one of the directories in the path, in which case, that will be executed instead. A third camp says not to put the dot in your path at all. The system administrator will set a default path for you and usually dot is not in the default path.
The argument against putting a dot in your path is that it is a potential security problem. If someone breaks into your computer, they can put an executable called ls in your home directory, which does something malicious. The next time that you log in and type ls, which is the most commonly used command, instead of executing the system ls it will execute the malicious version.
There should be a hidden file in your home directory called .bashrc. (In Unix, a hidden file has a dot as the first character. If you want ls to show hidden files, use the -a option (ls -a)). This file contains statements that are run whenever bash is started. If you want to put . at the end of your path, add the following line to .bashrc
export PATH=$PATH:.The export command is a shell command that sets the environment to be exported to any program that it launches. This command says to set the value of PATH to its current value with a :. concatenated onto the end.
To put . at the start of your path, type this
export PATH=.:$PATH
Back from the digression.
The C compiler has literally hundreds of options. You can learn about all of them by typing man gcc. There are only a few that will be important for this course.
Here is a link to a short overview of gdb.
Any large program will have more than one source file, and it is possible to pass in several source files as arguments to gcc, and these will be combined to produce a single executable. There must be one and only one function called main in one of these files, and this will be the entry point.
For example
>gcc FileOne.c FileTwo.c FileThree.c
will produce a single executable called a.out
>gcc -o hello -g FileOne.c FileTwo.c FileThree.c
will produce an executable file called hello and you will
be able to use gdb to debug it.
What happens during a compile
It is worth spending some time looking at what happens when a C program is compiled. The example will be from the Unix compiler but the principles apply to any C compiler. The process of creating the executable file involves at least four separate steps, preprocessing the input, the actual compilation to produce an assembler file, assembling this to create an object file, and linking to create an executable.
The C Preprocessor
A preprocessor is a program that takes as input a C program which has preprocessor directives in it, and it expands these directives. The output is a C program with the preprocessor directives expanded. Preprocessor directives are sometimes called macros, particularly in assembler. Expansions always take the form of string substitution.
C preprocessor directives are indicated by a #
in
column 1 of a line.
One directive that you are almost
certainly familiar with is
#include
This is followed by the name of a file (by convention the file
has a .h suffix) and the contents of that file are copied
into the output file. The file name is enclosed either in < >
or in double
quotes. If the file name is in < >
, the preprocessor
will look in the directory /usr/include
(or some other directory
set by the administrator). If the file is in double quotes, the preprocessor
will look in the current directory or will treat it as an absolute
pathname.
Another simple preprocessor directive is
#define
This usually takes two arguments. The preprocessor would simply
replace any instance of the first string by the second string.
For example:
#define BUFSIZE 1024
The preprocessor would replace all instances of BUFSIZE in the
input file with 1024 in the output file.
The string char buffer[BUFSIZE]
would be replaced by
char buffer[ 1204 ]
Alert: A common error is to put a semicolon at the end of the second
string. This results in a compiler error which is hard to detect.
For example
#define BUFSIZE 1024;
would result in the output string
char buffer[ 1024; ]
which is syntactically wrong.
It is possible to write macro expansions that take arguments.
For example
#define SQUARE(X) X * X
If there were a line in the text that looked like this:
n = SQUARE(m);
This would be expanded to
n = m * m;
Here is a more complex example
#define SWAP(TYPE,M,N) {TYPE temp; temp=M; M=N; N=temp;}
The following line in the text
SWAP(int,a,b)
would be expanded to
{int temp; temp=a; a=b; b=temp;}
There are a number of predefined preprocessor macros. For example
__FILE__
will expand to the name of the current input file
__LINE__
will expand to the current line
__TIME__
will expand to the current time in the form hh:mm:ss
(Note that this is the time of the compile, not the time at which the program starts running)
Note that no actual processing, or even checking for correct syntax takes place during preprocessing. The preprocessor is simply substituting one string for another.
The preprocessor can define a variable without actually setting
a value. For example:
#define _MYHEADER_H_
This is used for controlling Conditional Compilation, another
feature of the preprocessor. Conditional compilation means that
certain code will be compiled only if a variable is defined or not
defined, with the preprocessor keywords #ifdef
(if defined) and
#ifndef
(if not defined) along with a matching #endif
For example:
#ifndef _MYHEADER_H_ #define _MYHEADER_H_ /* code for my header.h, which will only be compiled if _MYHEADER.H_ had not been previously defined */ #endif
It is possible to define variables on the gcc command line as well
with the -D
option. For example
> gcc -D __sparc__ -o outfile infile.c
You could then put code in infile.c
like this:
#ifdef __sparc__ /* code to be compiled for sparc, but not for other architectures. */ #endif
The C preprocessor is cpp. The input is a file with preprocessor directives
and the output is a file with all of the preprocessor directives expanded.
This file will be given a temporary name with an .i suffix.
The temporary file is automatically deleted after it has been
used, but you can stop the process after this step by passing the -E flag
to gcc. The output file will be written to standard output, so
you can redirect it to a file if you would like to see what the preprocessor
did. For example
> gcc -E myfile.c > myfile.i
Click here for an exercise on the preprocessor
The actual compilation
Once a program has been run through the preprocessor, the next step is the actual compilation. The input to the C compiler is the file which was the output of the preprocessor, in other words, a pure C file. The output is an assembler file. On most Unix systems, including FreeBSD, assembler files have a .s suffix. If you would like to look at the assembler file, you can stop the process before assembly by passing the -S flag to gcc. Otherwise, after the assembly process, the .s file is deleted.
Here is the free BSD assembly file for helloworld.c
Assembly
The gcc script next evokes the assembler on the host machine. On most Unix machines, the native assembler is as, and the GNU equivalent is gas. The input is the assembler file produced by the compiler, the output is an object file, with a .o suffix (a .obj suffix on Windows).
You can stop the process after the creation of the object file but before the creation of the actual executable by using the -c flag with gcc.
If there is more than one input file, the above process (preprocessing, compiling, assembling) is repeated for each input file before the next step.
Linking
Suppose we have two source files which look like this.
First File, file1.c
/* file1.c */ #include <stdio.h> int g; /* a global variable */ extern double dg; /* another global var, defined in some other file */ void fctnOne(); /* a function prototype */ int main() { int x; /* a variable local to main (an automatic variable) */ x = 3; dg = 3.14; g = 17; fctnOne(); printf("x is %d, g is %d, dg is %f\n", x, g, dg); return 0; } void fctnTwo() { int x; x = 5; g = 11; dg = dg * 2; }
Second File file2.c
/* file2.c */ extern int g; double dg; void fctnTwo(); /* function prototype */ void fctnOne() { int x = 44; g = x; dg = dg + 2; fctnTwo(); }
If the compile line looks like this;
> gcc -g -Wall file1.c file2.c
preprocessing, compiling and assembling the two source files would
produce two object files called file1.o
and file2.o. However, both of these
have unresolved references, i.e. references to variables and
functions which are defined in some other file. When file1.o is
produced, it will contain a call to a function called fctnOne, but it
does not know the address of this function at assembly time.
Likewise, there is a reference to a variable dg and the assembler does not know the address of
this variable. There is also a call to printf, which is not defined
in either file. The file file2.o has
unresolved references to fctnTwo and
g.
The job of the linker is to resolve all of these unresolved
references. It does this by using two tables which are
attached to each of the object files. One table, the
definition table, lists all of the global functions
and variables which are defined in that file, along with
the address of each of these. The other table, the
use table lists each instance where an undefined
variable or function is used.
Here are the four tables for these two files.
Definition table for file1.c | Use table for file1.c |
---|---|
g | gd (line 13) |
main() | fctnOne() (line 15) |
fctnTwo() | printf() (line 16) |
dg (line 17) | |
dg (line 26, first instance) | dg (line 26, second instance) |
Definition table for file2.c | Use table for file2.c |
---|---|
gd | g (line 9) |
fctnOne() | fctnTwo() (line 11) |
The linker usually does its job in two passes. It first goes through all of the definition tables and builds a global definition table consisting of all variables and functions defined in any of the files along with their addresses. It then goes through all of the files again replacing all unresolved references listed in the use table with the actual address.
Finally, the linker has to search libraries to resolve yet more unresolved references. The linker is usually configured to automatically search the standard C library libc which contains code for functions such as printf. You can tell the linker to search other libraries with the -l flag (which can be passed to the call to gcc. For example, if you use functions in the math library, you can tell the linker to link to this with the -lm flag.
Most modern compilers use dynamic linking to link to library functions. With static linking, the executables for the libraries are directly linked as a part of the executable. With dynamic linking, library symbolic names are stored in the executable, and while the program is running, when a call to a library function is encountered, the operating system has to look up the location of the code for that executable before it actually executes it. The advantage of dynamic linking is that there is only one instance of the code for often used library functions like printf (with static linking, the code for printf is copied into each executable that uses it). The disadvantage of dynamic linking is that there is a small run time penalty because the system has to look up the address of a library function each time that it is called.
The linker also has to combine all of the various inputs into a single executable image, and this may need to involve relocation of addresses in some or all of the modules.
Here is an exercise on the linker
Passing arguments to programs
When you run a program from the command line, you can pass arguments
to the program. The function main()
can access these arguments.
To do this, write your main as if it had two arguments int argc
and char *argv[]
. The variable argc
will
automatically be set at run time to contain the total number of arguments,
including the name of the executable itself. The variable argv
(the argument vector) is an array of pointers to character strings. The size of
the array is argc
. Here is a short program which displays its
arguments, one per line.
#include <stdio.h> int main(int argc, char *argv[]) { int i; for (i=0;i<argc;i++) { printf("%s\n",argv[i]); } return 0; }If this is the command line
a.out first second third
a.out first second thirdThe value of
argc
would be 4.
Actually there is a third argument which is also passed to main, a pointer to the
enviroment, which is an array of pointers. This is usually called envp
.
Here is a short program which prints out the environment, one value per line.
#include <stdio.h> int main(int argc, char *argv[], char *envp[]) { int i; for (i=0;envp[i]!= NULL;i++) { printf("%s\n",envp[i]); } return 0; }
Here is the output when I ran this program. Your output may differ.
Large software systems often involve many files, and it is time consuming to have to recompile all of these files whenever a single change is made to one header file or .c file. There is a utility called make which keeps track of dependencies between files, and only recompiles those files which need to be rebuilt.
The make utility uses a resource file called a makefile which keeps track of instructions of compiling each file and also keeps track of dependencies between files. Examples of dependencies between two files would be a .c file which uses header files, or a source file which calls a function in a different source file or which accesses a global variable in a different source file.
A simple makefile consists of rules with the following form:
target ... : prerequisites ... command ... ...A target is usually the name of a file that is generated by a program; examples of targets are executable or object files.
A prerequisite is a file that is used as input to create the target. A target often depends on several files.
A command is an action that make carries out. A rule may have more than one command, each on its own line. Note that you need to put a tab character at the beginning of every command line!
Here is a very simple example of a make file.
Suppose you are working on a system that has two c source files, called test1.c and test2.c. The file test1.c uses a header file test1.h and the file test2.c uses a header file called test2.h. These two files are linked to create an executable called test.exe. Here is the makefile for this.
test.exe: test1.o test2.o gcc -o test.exe test1.o test2.o test1.o: test1.c test1.h gcc -c test1.c test2.o: test2.c test2.h gcc -c test2.cNote that there is a tab character before the three command lines (which all start with gcc in this example.).
The first line says that building the executable test.exe depends on the two object files test1.o and test2.o.
The next line is the command to create test.exe.
The next line says that the object file test1.o depends on the source file test1.c and the header file test1.h.
The next line is the command to create test1.o Recall that the -c flag tells the compiler to build the object file, but not to call the linker to create an executable.
You should be able to understand the last two lines.
If you change one line of the file test2.h, and then call make, the make routine will see that test2.h has been updated since test.exe was last built, and it will see that test2.o depends on test2.h so it will recompile test2.c and rebuild test.exe, but it will not recompile test1.c.
When make is called from the command line with no arguments, it looks for a file called makefile. You can ask make to use a different file with the -f flag (e.g. make -f mymakefile).
For this course, you must submit a makefile for any project that has more than one file. Your make file can be a trivial one such as the example above. Make has a much more complex syntax. You can learn more about make at The Gnu make man page