This document is the beginning of a collection of fine points, "missing details", gotchas, and the like that I think would be useful to students switching from programming in C++ to C or from "generic" coding to coding with Unix system calls. It makes no pretense at being complete.
There are some other resources you need to know about. A huge number of "news groups" have carried to net-users worldwide discussions about a variety of topics, both technical and nontechnical. It has been common for people, frequently naive users, to post questions before checking resources at hand -- often such questions were answered in manual pages. That behavior frequently prompted responses to simple questions of RTFM -- Read The Fine Manual (or Read The Fruitful Manual or Read The Fantastic Manual or ...) More commonly one of those words has been an emphatic vulgarity.
To reduce the volume of repetitive questions, many groups developed FAQs -- lists of Frequently Asked Questions with answers. Many such FAQs are archived on the web at http://www.faqs.org/. Of special interest to you will be the lists for newsgroups comp.unix.questions (and several other comp.unix.* groups) and for comp.lang.c (and several other comp.lang.* groups). The home site for the C-language FAQ (comp.lang.c's FAQ) is http://www.eskimo.com/~scs/C-faq/top.html. That document is available in several forms, including as a book.
I have an old document I prepared about using C -- mostly about style issues. It is an ASCII file, formatted for printing, with headings and page numbers. If you want to see it, follow this link. Years ago, Andy Tanenbaum posted some opinions about C-style -- I saved a copy of it.
Some items to be discussed in the first edition of this document:
Future? or Now?
And here's the same list with the discussion inserted....
C, like many other programming languages, has gone through a number of versions -- even a number of versions of "the standard". Some of the code you will be writing will be used on systems whose compilers aren't brand-new. It behooves you to exercise restraint in using new features and/or extensions. Remember: simple is good.
Note in particular
My first choice would be to declare non-global variables at the beginning of the function to which they are to be local. My second choice -- in case there is good reason for not announcing your intentions at the beginning of a function -- is to have an inner-block (an inner, anonymous block) in which the variable is declared. For example:
int some_function(int ar1)
{
int local_1, local_2;
....executable code...
{ int local_inner_var; /* the { begins inner block A */
....executable code...
} /* end inner block A */
....executable code...
}
struct complex {
double real_part;
double imaginary_part; };
struct complex x, y, z;
Note that the structure tag (complex) isn't omitted, and that
declarations of variables repeat "struct complex" -- they don't
just use "complex".
Note also that you may list variables to be declared
between the "};" in that top part, as in,
struct complex {
double real_part;
double imaginary_part; } x, y, z;
Look at the manual pages and/or a reference like K and R or Harbison and Steele.
(double)sum/n sum/(double)n (double)sum/(double)nI prefer the third, but the type conversion rules (coersions) dictate that the other two mean the same thing.
(void)printf(............);announces that you've thought about it and you really want to discard printf()'s return value.
The GNU (GNU is Not Unix) tool set includes a C compiler. The command to invoke that compiler is gcc. Currently the preferred C compiler on CS is gcc. gcc implements standard C and some extensions; the default mode is to enable those extensions. However, gcc will enforce the standard if you put both these options in the gcc command line: -ansi -pedantic
C (and C++) actually have many more kinds of numbers.
The common types for signed integers are:
short, int, and long.
These types differ in the amount of storage assigned for them, and consequently
in the range of available values. Int is expected to be the "natural"
signed integer type for the computer the program will run on.
For example,
on a PDP-11 an int was 2-bytes (16-bits) and
on a DEC VAX or an IMB-370 an
In addition to these signed types, C has unsigned versions of each of them.
Unsigned arithmetic has only values >= 0 and doesn't treat "spills" as
"overflow". The ranges of values used for signed and unsigned types are, thus,
different. For example, 16-bit signed integers run from -X through +32767 (X
depends on how negatives are represented -- it's either 32767 or 32768)
while 16-bit unsigned integers run from 0 throught 65535.
It is possible convert a value from one type to another (called a "cast"
in C). The facts of number ranges mean that some casts can lose information.
The rules say that if you cast from a signed type to an unsigned type of
the same size, the bit-pattern is preserved.
That means that in 16-bit 2s complement, -1 becomes 65535.
16-bit sign/magnitude and 16-bit ones complement also fold -1 onto a big
integer (but it's not 65535).
Casting an unsigned integer to a signed type of the same size also preserves
bit-patterns, and can convert large unsigned values to things that are
negative. Such conversions are usually said to have overflowed.
There is yet more in integer arithmetic. In C, the type char
is actually a type of small integers. Char is currently normally
8-bits wide. A compiler can treat char as either signed
or unsigned values. Which one is chosen affects some comparisons and affects
what happens when chars are converted to wider integers.
You should not be surprised to see types like unsigned char,
since forcing a choice of signed/unsigned on char then affects conversions
between char and int.
I have not yet discussed double and float.
Since about 1978 there has been an IEEE standard for floating point.
Actually there's more than one version of it. (See Tanenbaum's quip:
"The nice thing about standards is that there are so many to choose from.")
However, many machines have backward compatibility issues to machines
that existed long before those standards. So, you'll see a variety
of floating point representations. The first thing you need to know
are that all floating point types are inherently approximate. They're
like scientific notation, and after each operation the number of retained
digits is adjusted to what fits. Most, but not necessarily all, machines
use binary (base 2) or hexadecimal (base 16) representations -- since
which fractions "terminate" depends on the radix, rounding might not happen
as you'd expect. For now we'll stop there.
There is yet another issue that can affect numbers if you're writing
them to files or pipes or ... without converting them to characters.
(The statements 'printf("%d", integer);' and 'cout << integer;'
convert the internal representation of 'integer' to a character string.
The statement 'write(1,integer,4);' doesn't.)
Most current machines are byte addressable. Most numeric types are wider
than one byte. Which byte of a multi-byte number has the smallest address
varies -- and if the number is wider than two bytes, there are more than
two possible orders. If you stay withing the same machine, byte order
is probably not a problem, but....
NUL is the spelling of the ASCII character with all it's bits
zero. C doesn't acknowledge that spelling (but standard
documents often do). You can spell that character as '\0'
or as 0 -- usually we prefer the former.
Several standard headers define NULL for use as a spelling of
the pointer explicitly marked as pointing at nothing.
You can also use 0 as a spelling of it.
You'll sometimes see 0 cast to some pointer type, but usually
you're safer sticking with one of the first two spellings.
A null string (you might spell it "") is different from either
of the "nulls" above. That null is a pointer to a byte
that contains 0 (ASCII NUL).
A character is a small integer. A character variable is commonly
a one-byte In C, characters and strings of characters are quite different
sorts of things.
A character is a small integer. A character variable is commonly
one-byte of storage whose value is the character code (usually, but
not necessarily, ASCII code) of some character in the machine's
character set. That value can be treated as a signed value or as
an unsigned value -- it's up to the implementation. Whether
the value is signed or not affects what happens when the value is
involved in comparisons and what happens when the value is involved
in type conversions (casts).
The notation 'a' is called a character constant. It is an integer
whose value is the character code of the single character between
the apostrophes.
A string is actually an array of characters that contains the
indicated characters followed by an ASCII NUL as a terminator.
Thus
A few symbols with more than one character between
apostrophes have meaning -- these symbols are the so-called "escape
sequences" or "character escape codes", things like \n or \t or \r.
All of those escape
sequences consist of a backslash followed by another character.
And, although it requires several keystrokes to type them, each
represents a single character code.
The codes can be used in character constants (as in '\n') or
in character strings (as in "\n").
ANSI C defines:
A pointer is a variable whose value is "a reference to" some
other thing. Often that other thing is another variable or
a dynamically allocated block of storage, but it can also be
the entry point of a function (in a so-called function pointer).
Most implementations use machine addresses as the values of
pointers, but that need not be the case. In particular,
there have been "word addressed" machines whose "words" would
hold several characters; such machines might use an address
for an int-pointer and something else for a char-pointer.
There is nothing in the rules that requires that pointer values
be addresses or that the NULL pointer have all its bits 0 (although
that representation is common). There is a rule that says that
when the integer value 0 is cast to a pointer type, the result must be
a null pointer -- but that doesn't say the NULL pointer is all 0s.
When a pointer is defined, storage is allocated for the pointer
value, but the pointer is not made to point at anything.
No storage is allocated for anything for the pointer to point to.
The initialization rule for static variables (initially 0)
will usually make pointers of static storage class be null pointers.
Automatic variables (ones allocated on the stack in functions)
are not initialized; the garbage in the pointer's storage
area might point at something (but who knows what?) or it might
be zero (you should be so "lucky") or it might be something that
causes an addressing error if the pointer is de-referenced
(e.g., an address out of bounds or something that violates alignment
rules).
So far, so good. Arrays and pointers are quite different.
Now let us look at operators. C allows "pointer arithmetic".
Incrementing a pointer that points to something of type
X makes the new value of the pointer find the thing of type X next
higher in memory from the thing found by the old value.
Decrementing a pointer that points to something of type
X makes the new value of the pointer find the thing of type X next
lower in memory from the thing found by the old value.
If you add an integer and a pointer,
the integer is rescaled by the size of the type that the pointer
points at (so that adding a positive integer is equivalent to
incrementing it multiple times).
You may write integer+pointer or pointer+integer.
(There is also a rule about storage blocks -- it says, essentially,
that the pointer values both before and after arithmetic must point
at elements of an array or just past the last element of an array.)
The rules of pointer arithmetic
mean that pointer arithmetic can be used to
"march through" arrays.
In fact, the subscripting operator is defined in terms of
pointer arithmetic. And, herein lies the root of the confusion.
In C, when the name of an "array of X" appears in an expression,
the name is converted to type "pointer to X" and the value of that
pointer is &(array_name[0]) (i.e., pointer to the first element).
Consider a function call with an array argument. If we want
to pass the array by reference (so we don't have to copy it),
this rule does exactly the right thing.
So, what if we have an expression like array_name[integer]?
C defines that
Those rules now mean that we can write
A pointer can "dangle" because the block of storage "malloc( )'d"
onto it has been "free( )'d" -- without any change having been made
to the value of the pointer. (Because parameters are passed
by value, that is the expected behavior of a free(pointer)
call.)
A pointer can also "dangle" because it had pointed to an
automatic local variable of a function and the function has returned,
because the pointer is an uninitialized automatic variable,
or because the pointer is in a union that has been overwritten.
Some might describe "dangling" pointers as "stale" pointers,
because they often reflect a situation that used to exist but
has changed without an update to the pointer.
If you're lucky,
dangling pointers result in addressing exceptions.
If you're not lucky, dereferencing a dangling pointer will find
a storage area that is being used for something else and the
reference won't cause an exception.
In such a case, you'll be "reading" garbage or "scribbling" on
something else.
On a PC, errors like this have wiped out out a hard disk's FAT --
a major disaster.
Machines whose addressable unit is small must use several
addressable units for variables that must represent many possible
values. For example, it is now common for computers to
have 8-bit bytes as their addressable unit. A byte can hold
only 2**8 = 256 different values, so it has much too little
storage capacity for floating point values and for most
integer applications.
Many such machines have had operations
that support use of 2-byte and 4-byte integers and 4-byte and
8-byte floating point values (among other things).
Some machines allow these 2, 4, or 8 byte things to begin
at any address, but many others have imposed so-called
alignment requirements. An alignment requirement specifies that
one of these bigger values may not be stored at an an arbitrary
address. Typical alignment requirements might be that a 2-byte
value be stored at an address that is a multiple of 2, a 4-byte
value be stored at an address that is a multiple of 4, or
an 8-byte value be stored at an address that is a multiple of 8.
Imposing such requirements can simplify the design of the
interface to main memory.
When the compiler assigns storage to variables, it must consider
whatever alignment requirements the run-time system imposes.
While the compiler is allocating storage, if the next available
storage location does not satisfy the alignment requirements of
the next variable to allocate, the compiler must leave some bytes
unused and/or it must reorder the variables in memory.
Bytes skipped over in this way are often referred to as padding bytes.
Starting at an address satisfying the most restrictive alignment
requirement (8-byte in the above) and first allocating all the widest
variables, then all the ones of next width, etc., can reduce
the amount of padding required.
In a record structure, the compiler may be constrained to allocate
fields in an order that requires padding to be placed between them.
The padding requirements may be different on different machines.
On a machine that has alignment rules, a compiler might achieve
better data storage efficiency by violating the alignment rules in
memory and inserting extra code to do things like shifting and
masking to assemble wide values in registers. Doing that costs
extra instructions, so it isn't free. It could be that two
different compilers on the same machine would do padding differently.
Because structs may contain padding bytes, the size of
a struct can be bigger than the sum of the sizes of its components.
Because different compilers might pad differently, knowing things
about the machine architecture might not be enough to let a
programmer correctly compute the size of a struct. You should
always use sizeof( ) applied to the structure type to
obtain the size of the structure, for example
sizeof(struct node).
On a machine with alignment requirements, casts of pointer types must
be used with caution -- a pointer to a correctly aligned int
might have a value that wouldn't be correctly aligned for a
double.
The authors of malloc( ), C's dynamic storage allocator,
must understand alignment, and must deliver blocks of storage that
satisfy the system's most stringent alignment requirements.
On the other hand, C has no way to express that a function
returning a pointer makes such a promise. Sometimes you'll
get warnings on correct code because the declarations can't
tell the compiler that a function does better than the worst
case allows.
Since C allows you to have pointers to named variables,
you can pass a pointer and use that to change the value
of one of the caller's variables.
Note the discussion of arrays above. A call listing an
array name (as: a) rather than an array element name
(as: a[3]), has an item that's a pointer -- the value is
"pointer to first element". So the value the subprogram has is
a pointer to the origin of the array. Note the description
above of how subscripts work -- the array element references in
the subprogram "find" the storage of the caller's array
(but there isn't subscript bounds checking unless you hand code
it).
Here's an example of a simple C program that uses a
"void function" to do addition.
C permits recursive functions. That means that each invocation
of a function needs to have it's own set of parameters and local
variables. The traditional way of implementing that behavior is to have
a single run-time stack and have each function call allocate
a block on that stack to hold the parameters, local variables,
scratch storage, etc. for that invocation. It's reasonable for
you to expect that something like that is going on.
C also allows global variables. Global variables retain
their values across function calls and returns.
Global variables are declared outside function definitions.
The scope of those definitions is from the point of declaration
to the end of the file containing the definition.
C allows programs to be broken into multiple files.
It provides a mechanism for code in one file to access global
variables defined in another file.
It also provides a mechanism for having global
variables defined in a file, last for the run of the program
but be accessed only in the file in which it is defined.
Now the terminology....
Descriptions of C talk about storage class ....
The storage class of a variable can be automatic.
That's the storage class for ordinary local variables in a
function -- the automatic comes from the storage
allocation behavior -- it's automatically allocated when
the function is entered (and deallocated when the function returns).
The storage class of a variable can be static.
That's the storage class for global variables -- for variables that
are allocated storage at program load time and retain the same
storage allocation for the run of the program.
C also provides for on demand allocation and deallocation of
storage by use of library functions malloc(),
free(), and related functions. Such items do not necessarily
remain allocated for the run of the program, but their lifetime
is unrelated to the "function call tree".
Variables declared outside of functions have static storage class.
By default, such variables are made entry points
in the object module generated for the file. That means that
code in other files can gain access to the variable, if thay "say"
the right thing. That "right thing" for the
int variable x is
Sometimes we want a variable to be global to the functions in
a file, but not be accessible to code in other files.
To achieve that we put the word static at the front
of the declaration. Similarly, functions defined in a file
default to being entry points in the object module generated from it.
That means that if file XX contains the definition of an int
function doit(int a, double b), not only can code in XX
call it, but also code in other files can call it if they
include the prototype:
Essentially what's going on is that if a declaration that would
normally generate an entry point is preceded by the word
static that definition doesn't generate an entry point.
Variables declared inside functions, by default, have automatic
storage class. Prefixing a variable definition inside a function
with the word static gives it static storage class --
it is allocated storage in a block of memory separate from the run-time
stack, and it remains associated with that storage location
for the run of the program. That means that it retains values
across function call/return/call and that, if the function is
recursive, every invocation of the function uses the same storage
location for that variable. Scope of the variable is unaffected
by the word static.
The preprocessor is a program that examines the source code
of the program very "early" in compilation. Lines of
source code that have a # in column 1 are pre-processor
directives. The # is followed by a word that tells
what the directive asks the preprocessor to do. Whitespace (runs
of spaces and tabs) is allowed between the # and the word
(although you won't often see any). Preprocessor directives
include:
The crucial facts about C macros are that they are "called" by
textual substitution, that they are expanded early in compilation
(before expressions are analyzed), and that their arguments
might be evaluated more than once.
Re-define-ing a macro is an error unless the new definition
is identical (almost) to the old definition, but
undefing an undefined macro is not erroneous.
These rules account for some of the care taken in headers to arrange that
symbols are not redefined. They also account for the #undef
#define sequences you sometimes see.
Constants created with #define have values during preprocessing,
and so they can be used in preprocessor conditionals (#if, #ifdef,
#ifndef) in addition to being used in ordinary C statements as
named constants or to create something approximating in-line functions.
Their values can also be set by "-D...." directives on the "cc"
command line.
You will therefore see them used to do things like select or omit
code for specialized versions of programs -- that is, they are used
as "configuration constants".
If in the "#define" line
the name of a macro is immediately followed by a left parenthesis
(in a construction that looks like a function reference), the macro
has substitutable parameters. For example,
The parameter substitution, like macro
expansion, is purely textual substitution, and it occurs long before
the compiler deals with issues like operator precedence.
Because of the substitution rules, omitting some of the parentheses
might produce unexpected results. For example, if you define
Macros with arguments look like functions but they're
actually quite different.
Signals can be sent to a process because of
asynchronous events, such as the user typing "Control-C" or a timer
alarm "going off" or a child process terminating.
Signals can also be sent because of events that are synchronous with
the code of the process, for example, the process fetches an invalid
instruction for execution or it generates an arithmetic exception or
it generates an invalid memory reference or it uses invalid parameters in
a system call.
Every signal has a default action associated with it, usually
to terminate the process.
For most signals, a process can choose to ignore the signal or to "catch"
it rather than have the default action invoked.
A process wishing to "catch" a signal supplies the operating system with
a pointer to a function (called a signal handler) that is to be invoked
when the signal is received.
If a process is sent a signal that it is "catching", the code being
run is suspended and the signal handler is invoked. If the signal handler
returns, execution resumes where it left off. Note that signals might
arrive when the process is executing a library routine. The signal handler
has no way to know what the program was doing just before the signal
handler began execution. Since many libraries are not reentrant, signal
handlers that invoke library routines might not be consistent with resuming
execution where it was interrupted. Signal handlers that intend to
resume execution where it was interrupted should
probably just set a global flag that a "main loop" will test.
The signal handler we used in boxer and in racing sorts
was a "clean-up and exit" routine -- it restored the state of the terminal
interface and terminated the process. If we had been using temporary files,
our handler might have closed and unlinked (removed) them.
To use a signal handler as a "reset" function, you'll either need to
have it set a flag and resume execution where it was interrupted or need to
have it use setjmp/longjmp.
Exploring exactly how signals work and exploring setjmp/longjmp
are both beyond the scope of this document.
Header files are text files (ASCII, ordinarily) that contain declarations
of interfaces or of global entities. For example, here are some extracts
from CS's
Header files provide the compiler the information it needs to select correct
machine instruction sequences to do things like call functions. Normally
they do not provide code to carry out the function (library calls implemented
as macros are an exception).
For each *.c file the compiler generates a relocatable object module and
writes it into a file with a .o extension. That relocatable object module
contains machine code for the functions defined in the .c file but it
does not contain the code for functions called by code in the .c file
but not defined there. Addresses are (mostly) relocatable -- they're
recorded in a way that allows a linker to take this module and other ones
and decide EXACTLY where in memory they will go and then make the references
absolute. The object module also contains two lists of symbols (identifiers):
A list of entry points (names defined here that are to be made available to
other modules); A list of external symbols required (names used here but
not defined here -- the linker must find them as entry points in some other
module).
The cc command (or the gcc command, if you're using GNU's cc) is not
actually the compiler -- it's a driver program that knows where the compilers
are, where the assembler is, where the linker (loader) is, and the name
and location of the standard C library. It also knows which things to use
to make a .o file from a .c file and which things to use to make a .o
file from a .s (assembly source code) file. After cc has produced .o
files from all the files you named to it, it invokes the linking loader (ld)
telling it to make a bound program from those modules and the standard C
library (which it identifies to ld). ld tries to make a bound program.
If the collection of modules and libraries given to ld contains external
symbols that do not appear as entry points in the collection, it complains
about "unsatisfied externals".
For a variety of reasons, some of them historical and some for current
utility, there is not a single library. Some frequently used routines
are in libraries other than the standard C library. The first such example
most programmers encounter is the math library -- the modules that implement
the math functions, like sqrt, cos, sin, etc. are stored in a separate
library. To use them you invoke cc with a line like:
So is the name of the math library -lm ? Unfortunately it's not that
simple. The command line argument -lm consists of the "flag" '-l' and
the shorthand name 'm'. That shorthand name is used to construct the
pathname to the library. On CS (and many other Unix systems) the pathname
to the math library is: /usr/lib/libm.a The name is constructed by
concatenating "/usr/lib/lib", the shorthand string, and ".a" There are
lots of libraries on CS:
The answer to the first of those questions is, "at the right end".
Think about it this way... The linker (loader) is putting things in memory
and keeping track of the entry points that have been defined and the
external symbols that have been used (but not yet defined). It is supposed
unconditionally to load the modules explicitly mentioned (the *.o files).
The libraries are intended to be used to supply the "missing pieces".
The obvious implementation of checking libraries is to scan through
them one by one looking for missing pieces.
Unless the linker makes multiple passes it probably will have difficulty
deciding what is needed from a library unless the library is at the end.
In addition, it is possible that when multiple libraries are named, their
order might be important.
The answer to the other two questions
is that the manual page has that information. The way it does that
is to identify the name of the library and to expect you to figure the
command line option from that.
Here's a piece of a termcap(3X) manual page:
The subtlety of the reference to libraries isn't the only subtle reference
on the manual page. You'll rarely see a manual page that says "You must
include
We're not here to debate whether these conventions are good. They've been
in use for decades and are deeply enshrined in standard practice. Besides,
even if you could wave a wand and change them for new releases, you're still
likely to encounter "legacy" releases that use them.
See a good book on C for more information.
Read carefully written code to see examples -- there are some in the Minix
source tree.
If I get really energetic, I'll add things here.
A common mistake in use of these routines is to forget that the routines
are of type int and not of type char. The routines return
the next character if there is one; at end of file they return EOF
(which is defined as -1). Putting the returned value into a char
rather than an int can make it impossible to detect end of file.
(If it doesn't do that, it loses you the ability to read the character that
EOF folds onto.)
Some things that you might not expect to have values, actually have
values. For example, x=5, (where x is an int)
has the value 5. When you write the assignment as a
statement, x=5;, the expression still has a value, but
you're choosing to ignore and discard it. The C compiler doesn't care.
That rule should help you understand why a compiler wouldn't
gripe about your use of a value returning function as if it were
a void function. C comes from an old heritage that says,
"The human (programmer) knows what he's doing -- AND the human
knows best" -- even if the request is strange, we'll quietly
do what we're told. You may have used software that subscribes to
a rule like "I (the software) know what you really need, so I'll
give you what I think is good for you." That's not the C style (nor
is it the Unix style).
A common idiom for reading a file (stdin in the example below)
is to have an int variable (here c) that is given
successive characters of a file. The following loop copies what's left
in the file to stdout.
A similar idiom is often used in making library or system calls that
can fail.
A different idiom is used in copying strings. The code below presumes
When a user types the data you are requesting, the characters are seen first
by a TTY device driver. That driver is capable of several different types
of input processing. Normally it buffers keystrokes until it sees
a line terminator (normally the character generated by the "big key" --
on a PC it's usually labeled "Enter", but it might say "Return" or something
else).
This style of input buffering is sometimes called "line buffering".
While it collects the characters of the line, the driver normally
honors backspaces (and the like) and does echoing.
When the "return" is seen, the driver is prepared to hand-over
the line it saw to whatever user-space routine did the read( )
system call.
In some cases the user actually wrote a read( ) call, but in many
others the user wrote getchar( ) or fgets( ) or
scanf( ) or some other library call.
(In C++ the cin >> statements are similar to stdio
calls.)
Most of those calls are part of the stdio package.
The stdio package usually does buffering --
but exactly what buffering it does is mostly up to it.
Note that stdio can read from files or from the keyboard.
In both of those situations it normally buffers, but the buffering strategy
may be different.
For output, most users write printf( ), fprintf( ),
putchar( ), or similar stdio calls.
(In C++ the cout << statements are similar to stdio
calls.)
The stdio package buffers output.
How it buffers may depend on whether or not the output is being sent
to a terminal device or to a file.
When the output is sent to a file,
the size of the buffer might depend on the device the file resides on.
For output sent to a "terminal", it is common to use "line buffering" --
hold characters until you see a "newline" or until the buffer is full or
until an request for input is made.
When stdio decides that it is time to "flush" the buffer, it
does a write( ) system call.
The write( ) system call hands a sequence of bytes to the operating
system's "file system".
The file system may buffer things.
The file system (eventually) hands the bytes to a device driver.
Some drivers (e.g., the TTY driver) might do buffering.
The file system commits to the semantics that bytes given to it get to
their destination (eventually) if the system doesn't crash.
It can do that because whatever it is buffering can be kept separate from
user-process address spaces -- the crashing of a user process doesn't
trash those buffers.
Stdio, on the other hand, can't make such a promise because
it is a subroutine package bound into the user process -- the operating system
doesn't see it as special (it doesn't see it at all).
Sometimes programs using stdio crash with useful data in
stdio buffers -- and the user never sees it.
You can tell stdio that you want it to flush its buffers associated
with a particular stream by making an fflush( ) call.
Library calls that are not really "system calls" often use the same
error reporting interface as true system calls. In what follows,
statements made about "system calls" also apply to most "library calls"
(even if they're implemented as user-space code).
Almost any request you make has some possibility of failure, even if
that failure is that you botched the information you wrote in the
statement making the request.
Many of the Unix system calls can fail for reasons of some substance.
For example, you might not be authorized to do what you're requesting,
a needed file might not exist, or the system might not have the required
resources available.
The Unix system call interface specifies that when you make a system call
request, either the system will do it and return what you ask for
or the system will refuse and will return to you a code that can be
used to determine why the system refused.
System calls do not generate printed error messages nor do they terminate
processes making erroneous calls.
The system calls are functions returning int.
A correct system call that contains a request that is honored returns
a nonnegative value (>= 0).
An incorrect or refused call returns a negative value (ordinarily -1) and
puts an integer code in errno.
If you make a system call, you MUST check the value it
returns. Failure to make the check produces a program that muddles along
after a failed call and often leads you to look the wrong place for
the causes of and fixes for problems you detect later.
The call interface says that if a call returns a failure
code, then errno "tells why".
It promises nothing about the value of errno after a call that
succeeds.
Don't look at errno unless the return value of a call tells you
that it failed -- errno can be non-zero when all is well.
The header file <errno.h> declares errno
and defines the symbols used by the manual pages to describe error codes.
If your code detects that a call has failed, it can include code that
checks the value of errno to determine an appropriate response --
you might have a Plan B.
The library routine perror can be used to display an appropriate
error message, if that's what you want (RTFM).
Some systems also have an array of strings that can be indexed
by errno to obtain a text message describing the code.
16-bits <= bit_length(short) <= bit_length(int) <= bit_length(long)
Binary computers have used an assortment of representations for negative
numbers: sign/magnitude (what you learned in junior high),
ones complement, and twos complement.
For at least the last couple of decades, twos complement has been the most
commonly used of these, but neither current standards nor the facts of
computer design assure that your program will be portable if you assume
twos complement arithmetic.
bit_length(short) < bit_length(long)
We use the word pronounced "null" for several things including:
a throw-away character, a pointer explicitly marked as
not pointing at anything, a string containing no characters.
In C, characters and strings of characters are quite different
sorts of things.
----------------
"abc" is stored as | a | b | c | \0 |
----------------
And the expression "abc" is the name of a constant array of characters.
\a alert (e.g., bell {ASCII BEL is ^G or 7 or 07})
\b backspace {ASCII BS is ^H or 8 or 010}
\f formfeed {ASCII FF is ^L or 12 or 014} aka NP
\n newline {ASCII NL is ^J or 10 or 012} aka LF
\r carriage return {ASCII CR is ^M or 13 or 015}
\t horizontal tab {ASCII HT is ^I or 9 or 011} aka TAB
\v vertical tab {ASCII VT is ^K or 11 or 013}
\\ backslash
\' single quote
\" double quote
\? question mark
An array is an homogeneous aggregate accessed positionally.
Saying that a little "slower", we have...
An array of X (where X is some type) is a sequence
of Xs next to each other in storage. The X that is at the lowest
address is the first one and it is accessed as array_name[0].
The one right after ("above") that one is the second
and it is accessed as array_name[1]. And, so forth.
Note that some datatypes require alignment at an address that is
a multiple of some number (usually 2, 4, 8, or 16); if an item
of the datatype
requires alignment but is "small", some padding bytes might be
required between elements.
As you well know, the subscript can be an expression of
almost any complexity, so long as it produces an integer in
the range 0, 1, ... (1 less than declared size).
a[b] means *(a + b),
provided that one of a and b is a pointer and
the other is an integer.
Combining the type change rule on array names with this rule means
that
array_name[integer] means *(&(array_name[0]) + integer),
and, applying the rules of pointer arithmetic gets us the array
element we want.
Slick! (???)
int *p;
.....
a = p[3];
providing that the value of p points into an array in a way that
three elements past it doesn't jump too far past the end of the
array. The rules also mean that p[3] and 3[p] are synonyms.
A pointer is said to "dangle" if its current value doesn't
find the storage area of a currently "valid" thing.
A struct is an aggregate that may be non-homogeneous
and whose pieces are accessed using names given to them.
Some languages use the name record for essentially
the same kind of thing.
In C, all function parameters are passed by value.
There is no call by reference.
#include <stdio.h>
void get_sum(int a, int b, int *c);
int main(int argc, char *argv[])
{ int i, j, k;
i = 3;
j = 5;
get_sum(i, j, &k);
printf("The sum of %d and %d is %d\n", i, j, k);
return 0;
}
void get_sum(int a, int b, int *c)
{
*c = a + b;
}
C is a "block-structured" language --
A block in C looks so:
{
declarations
...
executable statements
}
The "{" corresponds to ALGOL's begin and "}"
to end.
The scope of variables declared in a block is that block.
Note that the body of a function definition is a block.
When a function returns or when control "falls through" the
end of a block, the local variables are no longer accessible,
and any values they had are "forgotten".
extern int x;
Think about the things you learned about entry points and external
symbols when you learned about writing assembly language code.
The notions are essentially the same.
extern int doit(int a, double b);
Sometimes we want to write "helper functions" to be called by other
functions defined in the same file. Often it would be inappropriate
for code in other files to call such functions. We can restrict
access to a function to the file in which it is defined by putting the
word static before its return type in the function definition.
C compilers older than the ANSI C standard lacked the const
qualifier that most programmers are accustomed to use to create
named constants. The mechanism available was pre-processor
#define directives. This mechanism is still available
in ANSI C.
#define to specify replacements for certain words (macros)
#undef to remove a macro
#include to insert the content of another file here
#if, #ifdef, #ifndef, #else, #endif, .....
to specify whether certain lines should be
ignored or should be seen by the rest of the compiler
Our concern in this item is the first two of those.
#include is discussed with header files.
The others are discussed with "conditional compilation".
#define prod(a,b) ((a)*(b))
defines a macro with two arguments and its value is the product of them.
While
#define prod (a,b) ((a)*(b))
defines a macro that expands to (a,b) followed by ((a)*(b))
with no substitution for a or b.
#define p(a,b) a*b
then x*p(y+1,z+1)*w
expands to x*y+1*z+1*w
which is not x times the product of (y+1) and (Z+1) times w
That is, it "looks like" x*(y+1)*(z+1)*w
but it actually is x*y + z + w
Also, the commonly used macro,
#define max(a,b) (((a)>(b))?(a):(b))
computes the larger of its two arguments. However, the way it works
evaluates one of the arguments (the larger one) twice. Using max( )
with an argument that has a side effect produces an undefined
expression. (Look elsewhere for a
discussion of "undefined" == "Don't do that!".)
The Unix system call for changing what program a process is running is
known as "exec". There are several interfaces to this facility. The
manual page on CS lists:
int execl(const char *path, const char *arg, ...);
int execv(const char *path, char * const argv[ ]);
int execle(const char *path, const char *arg, ... char * const envp[ ]);
int execve(const char *path, char * const argv[ ], char * const envp[ ]);
int execlp(const char *file, const char *arg, ...);
int execvp(const char *file, char * const argv[ ]);
Note that these calls either include an argv array or a variable
number of pointers to argument strings (forms with an ellipsis, "...").
When a form with a variable number of arguments is used, the called
routine needs some piece of information that marks for the routine the
end of the list of argument strings. That piece of information is
an argument to the exec function that is a NULL pointer.
If you don't supply that argument, the function is likely to do nasty things
because it doesn't recognize the end of the argument list.
In a Unix environment, signals are analogous to the hardware
functions: interrupts and traps. In other programming
environments, the notion exception is similar.
When you use languages like C in a Unix environment, it is important for you
to understand the roles of header files and libraries. Please note that
in writing about C, a distinction is normally made between "declaration"
and "definition" -- a "declaration" tells you about properties of a thing
but does not "create its storage image"; a "definition" not only tells about
properties of a thing but also has the compiler allocate the thing's memory
image.
| ....
| typedef unsigned long size_t;
| ....
| #define EXIT_FAILURE (1) /* exit function failure */
| #define EXIT_SUCCESS 0 /* exit function success */
| ....
| extern double atof __((const char *));
| extern int atoi __((const char *));
| ....
Standard system headers are rarely just typedefs, #defines, and prototypes.
Usually they contain a lot of preprocessor directives for conditional
compilation -- #if, #ifdef, and the like.
cc -o prog main.c sub1.c sub2.c -lm
The -lm at the end of the line says to use the math library for linking --
in addition the *.c file that uses the routine(s) must #include
| ls /usr/lib/lib*a
|
| /usr/lib/libAF.a /usr/lib/libaio_raw.a /usr/lib/libots.a
| /usr/lib/libDXm.a /usr/lib/libaud.a /usr/lib/libots2.a
| /usr/lib/libDtHelp.a /usr/lib/libbkr.a /usr/lib/libots3.a
| /usr/lib/libDtSvc.a /usr/lib/libbsd.a /usr/lib/libpacl.a
| /usr/lib/libDtTerm.a /usr/lib/libc.a /usr/lib/libpas.a
| /usr/lib/libDtWidget.a /usr/lib/libc_r.a /usr/lib/libpdf.a
| /usr/lib/libFS.a /usr/lib/libcdrom.a /usr/lib/libpset.a
| /usr/lib/libFutil.a /usr/lib/libcfg.a /usr/lib/libpthread.a
| /usr/lib/libICE.a /usr/lib/libcob.a /usr/lib/libpthreads.a
| /usr/lib/libMrm.a /usr/lib/libcomplex.a /usr/lib/libresolv.a
| /usr/lib/libPW.a /usr/lib/libcsa.a /usr/lib/librpc.a
| /usr/lib/libSM.a /usr/lib/libcurses.a /usr/lib/librpcsvc.a
| /usr/lib/libUfor.a /usr/lib/libcxx.a /usr/lib/librsvp.a
| /usr/lib/libUil.a /usr/lib/libdb.a /usr/lib/librt.a
| /usr/lib/libX11.a /usr/lib/libdbm.a /usr/lib/libst.a
| /usr/lib/libXETrap.a /usr/lib/libdnet_stub.a /usr/lib/libstor.a
| /usr/lib/libXIE.a /usr/lib/libfilsys.a /usr/lib/libsys5.a
| /usr/lib/libXau.a /usr/lib/libfor.a /usr/lib/libsys5_r.a
| /usr/lib/libXaw.a /usr/lib/libisam.a /usr/lib/libtask.a
| /usr/lib/libXaw3d.a /usr/lib/libkdbx.a /usr/lib/libtermcap.a
| /usr/lib/libXdmcp.a /usr/lib/libl.a /usr/lib/libtermlib.a
| /usr/lib/libXext.a /usr/lib/liblmf.a /usr/lib/libtli.a
| /usr/lib/libXi.a /usr/lib/libln.a /usr/lib/libtps_stub.a
| /usr/lib/libXie.a /usr/lib/liblsm.a /usr/lib/libtt.a
| /usr/lib/libXm.a /usr/lib/libm.a /usr/lib/libutil.a
| /usr/lib/libXmu.a /usr/lib/libm_c32.a /usr/lib/libvti.a
| /usr/lib/libXp.a /usr/lib/libmach.a /usr/lib/libxkbfile.a
| /usr/lib/libXpm.a /usr/lib/libmld.a /usr/lib/libxproc.a
| /usr/lib/libXt.a /usr/lib/libmme.a /usr/lib/libxti.a
| /usr/lib/libXtst.a /usr/lib/libmp.a /usr/lib/liby.a
| /usr/lib/libXv.a /usr/lib/libndb.a /usr/lib/libz.a
| /usr/lib/libaio.a /usr/lib/libnuma.a
|
So now the obvious questions are,
"Where on the command line do you put the reference to the library?",
"How do you know that you need to refer to a library?",
and
"How do you name the library on the command line?"
|
| NAME
|
| tgetent, tgetnum, tgetflag, tgetstr, tgoto, tputs - Terminal independent
| operation routines
|
| LIBRARY
|
| Termcap library (libtermcap.a or libtermlib.a)
|
| SYNOPSIS
|
|
The "LIBRARY" section is the message that you need to make sure that the
loader uses a particular library. You almost certainly won't see a sample
cc command. The manual page expects you to know that libtermcap.a corresponds
to -ltermcap and libtermlib.a corresponds to -ltermlib.
In careful discussions about C-code, a distinction is made between
a definition and a declaration.
A definition both describes something and allocates storage for
it. For example, the definition of a variable both shows its
type and causes the compiler to allocate storage; the definition
of a function occurs where we write the code for the body of the function.
A declaration describes properties of an entity. For example,
a file might contain an extern declaration for a variable defined in
another file. A function prototype is a special kind of declaration.
With variables, only the definition may contain an initialization.
Preprocessor directives
#if,
#ifdef,
#ifndef,
#else, and
#endif
allow you to have a single file that contains multiple
versions of certain passages of code, one of which will be used.
Reading the directives is tedious, but not profound.
The endif is required -- that eliminates the "dangling else" ambiguity.
advanced header files
See what Andy has to say about his use of
EXTERN
in the Minix sources.
I'll add more here later (I hope).
Until I write some more and dig out some sample code, I'll leave you
with this advice: RTFM (Read The Fantastic Manual).
Seriously, the getopt(3) manual page on most systems is really useful,
and it ususlly contains an example of using the routine.
Use the command:
man 3 getopt
(Many system have a getopt(1) for use in scripts, and you're not interested
in that one today.)
The library routines getc( ) and getchar( )
can be used to get the next character from an input source.
getc( ) returns the next character from a stream
that is its argument (a stream is something of type
FILE* -- the type is defined in stdio.h); getchar( )
is the same as getc(stdin).
In C, assignment is an operator rather than
a statement format. An expression followed by a semicolon
is a statement.
while( (c=getchar()) != EOF ) {
putchar(c);
}
Note that an assignment to c happens as a side-effect of
the expression that controlls the loop. There's what you might think
is an extra set of parens in that expression -- but it's not extra.
!= has higher precedence than =, so if you omit
those parens, the comparison is done before the assignment and the value
of c is either 0 or 1. (Recall that there is another gotcha
in this idiom -- if c is of type char, you lose
either your ability to copy the byte value 0xFF (assuming
your computer does twos complement arithmetic;
if not, it's a different byte value that you lose) or
your ability to recognize end-of-file, depending on whether char
is signed or not.)
if( (fp=fopen("path_name","r")) == NULL ) {
perror("my_prog_name");
/* Maybe some more error messages out
or a "recovery" or, more likely.... */
exit(1);
}
or
if( (pid=fork()) < 0) {
/* Code for error case -- fork() failed */
}else if( pid == 0 ) {
/* Code for child process. There are calls for
child to learn its pid and its parent's pid */
}else {
/* Code for parent
The positive value of pid identifies the new process */
}
Omitting the parenthesis pair just to the left of the relational operator
in either of these cases does not produce a syntax error, but it doesn't
do what you want either. As coded above, the variable to the left of
the = gets the value the function returns -- which is what
the program needs. Without that parenthesis pair, the variable gets
the result of the comparison, either 0 or 1.
And, now the code:
while( *q++ = *p++ ) ;
This isn't the only place where C-Code uses something of
type char * where it quietly assumes
Violating any of those conditions can lead to disasters.
Depending on where the violation occurs, such disasters
can lead to almost anything, up to and (sometimes)
including a root-compromise of the system.
Most input and output are buffered.
You need to remember that.
You saw make and Makefiles in CS 3481-3482.
The sort of stuff you saw was probably
something like this.
You can also learn about make and makefiles from the manual pages
and by reading makefiles distributed with various pieces of code (e.g., Minix).
There is still a lot more to learn.
There are books about make (see, for example, O'Reilly's list of titles).
Table driven code is often very compact and easy to maintain.
I won't say much about it here, but I do refer you to the main program
of the sort code given to you for the racing sorts exercise and
to some sample code related to termcap and signals.
You'll find examples of table driven code there.
checking return values, using perror( )
A system call is a special kind of function call.
It looks like an ordinary library function call, but it is
different in that the "guts" of the implementation reside in the
operating system rather than in code bound into "user space".
The exact list of system calls is implementation dependent.
For example, the Unix environment provides several variations on "exec".
They might all be implemented as system calls or an implementation might
have only one "exec" system call and the others might be library
routines providing alternate "front ends" to that call.
----
Modified: 21-Feb-2002, 17-Mar-2002, 31-Jan-2003 (fixing typos),
24-Jan-2005 (Typos, portability note, link to cmd-line-args, C++ vs C,
parameter passing, scope and lifetime)
File time-stamp: Wednesday, 09-Feb-2005 11:10:36 EST