Draft Version 0.5, November 10 2003
Programming languages today are based on concepts identical to the ones spoken languages are based on. This was not the case with early programming languages, which were created ad-hoc. Languages improved when designers applied formal language theory and formal semantics.
Programming language syntax is usually defined in a formal language such as Extended Backus-Naur Form or syntax diagrams. Semantics are much harder to define and are created with the aid of an 'abstract machine' to model the computer hardware.
Most programming languages belong to a class called context-free languages that are easy to parse using a stack. Context-free means that the language's syntax definitions do not depend on the context in which syntax elements appear. This makes it possible to generate the lexer and parser (parts of the compiler) automatically from lexical specifications and syntax. The programs that do this are called, respectively, a lexer generator and a parser generator.
Open a book about a programming language and scan the contents. You find chapter headings like:
The similarities are apparent, but so are the differences. At the highest level, a programming language consists of:
Syntax errors - When program code fails to correspond to the syntax rules of the language, a syntax error occurs. Examples of syntax errors include using the wrong number or type of arguments in a function call; using improper capitalization of keywords in a case-sensitive language; misspelling a keyword; using comment delimiters not supported by the language.
Semantic errors - When the meaning of program text does not make sense in the context in which it appears during program execution. Thus, semantic errors are often called runtime errors because they often manifest only at runtime. Examples of semantic errors include type mismatch - when a program object (e.g. variable) is used in the wrong context; using an undefined value, i.e. the value in a variable which has not been initialized; attempting to use a method not exposed by a class.
(The above modified from Fischer and Grodzinsky)
If syntax and semantics were all we cared about there might be only one programming language. However there are many, and they exist because programs are used in different kinds of work. These kinds of work fall into two categories that form our highest-level idea of what a program does. (ibid):
These two concepts indicate very different fundamental requirements in a language, and therefore two major categories of languages. I will refer to these as systems languages (first in the above list) and non-systems languages. Requirements for languages in these categories differ in terms of level of support for abstraction, capability for implicit and explicit communication, capacity for alternate expressions of a process or model, and the way data is mapped onto the machine.
After writing a few programs with a particular language and getting to know its capabilities, a programmer may remark on the language's lack of support to perform certain tasks, or that it requires writing lots of code to accomplish. The language lacks the ability to represent in a sufficiently abstract way the ideas the programmer needs to communicate.
Capacity to abstract ideas using a language is indicated by three qualities of data representation. Systems languages have less capacity in these qualities than do non-systems languages (ibid).
Source code is read by a program called a parser or compiler. There are several stages in the process:
Besides semantic clarity - a quality of a language whereby the semantic intent of program text is easy to determine - language quality is characterized by the quality and availability of its features (ibid):
One must keep in mind that languages are more alike than different, and often belong to more than one category. These family classifications are intended to extend the reader's vocabulary more than to indicate their differences. This is only a summary of the more well-known language categories and is not meant to be thorough. (ibid).
Metalanguage describes the parts of a programming language used to manipulate the language rather than data in a program. This section describes the elements usually found in a metalanguage, and comments on the tradeoffs due to the way these features are implemented (ibid).
When parsing tokens, the lexer follows a set of lexical rules defined by the language. Lexical rules describe the tokens supported by a language. Common types include:
Rules for delimiting tokens have a huge impact on the language - tradeoffs are made in order to
Many languages use a single pair of marks to delimit beginning and end of scope in all cases. There are tradeoffs in doing this: the language is easier to learn and simpler to translate, but at the price of semantic clarity. For example, nesting errors can be misinterpreted as a logic error by the compiler, and these are notoriously difficult and time-consuming to debug.
The kinds of comments permitted in languages today are:
Allowing comments to begin and end anywhere is considered overly-general because it is easy to forget or misplace an end marker. This could result in a semantically invalid program that compiles, though it would not work correctly.
You need some understanding of the hardware to really understand primitive types. Such understanding includes knowing that computer memory (RAM) consists of bits, bytes and words; that bytes are segments of 8 bits; and that words on modern hardware are 4 or 8 bytes long.
The most primitive of the primitive data types are bytes and words. All other data types are mapped onto these. Computer instruction sets have a few logical instructions for operations on bytes and words. These are the logical instructions that perform left-shift, and, or, xor, and bitwise complement. The other instructions are for use with all other data types, which are represented by encodings (e.g. ASCII) mapped onto bit strings of bytes and words.
Three physical properties determine how a type's elements are mapped to memory.
Language designers choose a set of primitive types to include in a language, but language implementors may choose to support additional types.
There are two kinds of object:
Semantics differentiate variables and pure values:
Initialization and Assignment
Storage objects that have not been initialized or assigned to are said to contain garbage, or that the value is undefined.
Assignment is usually designated using an assignment operator such as '=', ':='. In some languages assign returns a value, allowing multiple assignment. In other languages assign is a statement and the operator cannot be included in the middle of an expression.
Dereferencing Contents of Storage Objects
Dereferencing - extracting contents from a storage object - is often accomplished by merely using the storage object's name. Many languages use context to determine when to dereference a storage object. Besides adding complexity to a language, this means that a particular name will sometimes be a reference and other times a pure value.
Operations that dereference storage objects require a reference to the object and return a value. When the object reference is a pointer variable, the return is another reference. Languages that support pointer expressions provide an operator or combined operations to designate a dereference, rather than performing the dereference automatically. E.g. C provides the "*"
Pointer assignment - dereferencing
In these expressions dereferencing may be automatic (e.g. in Pascal) or the language may allow for non-automatic dereferencing (e.g. C provides the "&" to prevent automatic dereferencing).
Note that language-specific syntax can vary for pointer assignment: right side assignments that refer to an array or a function are not dereferenced in C.
Storage Objects and Memory
The language translator has three strategies to use in managing storage objects: static storage, stack storage, and heap storage. Stack and heap storage comprise dynamic storage.
Static storage objects - called "immortal" - are allocated before execution and remain there until the program terminates. The number of these objects is fixed. The kinds of static objects vary by language. Global variables are static; some languages (C, Visual Basic) allow nonglobal object declarations using the keyword "Static." Usefullness of static local variables over global variables is that it prevents accidental change in unrelated routines.
Dynamic Storage - Stack
Procedure calls and block structured code give rise to nested lifetimes of storage objects. Storage for these objects is easy to manage using a stack.
A storage manager updates a stack allocation pointer when a block is entered and when it is exited. Allocating local storage objects increments the pointer the number of bytes required, deallocating decrements the pointer.
Rule of thumb: when a language supports both heap and stack storage, use stack storage whenever possible. Lifetime control of stack-allocated objects is very efficient and automatic with a stack - this is where the "auto" storage (type) specifier in C comes from.
Structure of stack frames varies depending on scope rules and whether a language is block-structured or not. Assuming a block-structured language which is lexically-scoped, the storage manager will create a new stack frame whenever control enters a new block of code (e.g. in C this would be a function or any control structure that sets off code with curly brackets).
Fischer, Alice E. and Grodzinsky, Frances S., 1993, The Anatomy of Programming Languages: Prentice Hall, pages. ISBN 0-13-035155-5
Kernighan, Brian W. and Ritchie, Dennis M., 1988, The C Programming Language, Second Edition: Prentice Hall, pages. ISBN 0-13-110370-9
Schwartz, Randal L., Olson, Erik and Christiansen, Tom, 1997, Learning Perl on Win32 Systems: O`Reilly & Associates, pages. ISBN 1-56592-324-3