Introduction to langages and compilers

Course notes for Section 2, Chapter 2.1 of the course SEG 2106
by Gregor v. Bochmann, February 2004 (translated to English: January 2008, revised 2009, 2010, 2012, 2013, 2015)

Natural langages

In a (natural) language, a sentence is a sequece of words, and a word (also called lexemes of lexical units) is a sequence of characters (possibly a single one). The set of characters used in a language is finite. The set of possible sentences that can be formed in a language is infinite (there is no limit on the lenght of a sentence). One uses a dictionary to list all words of a language. These words (or lexemes) are usually classified into different lexical categories, such as verb, noun, pronoun, preposition, etc. One uses a grammar (also considered a set of syntax rules) to determine which sequences of words are well-formed (that is, those that are part of the language). The well-formed sentences have normally some meaning that humans can understand.

Examples of sentences (in french)

Analysis of sentences in three levels:

Examples of (french) sentences that have problems

Object-oriented view of sentence analysis

Computer languages

Similarly, in computer (or programming) languages, one speaks about a program (corresponding to a sentence) which is a sequence of lexical units, and the lexical units (or lexemes) are sequences of characters.

Language processing

A compiler translates a well-formed program in the source language into object-code (a program written the target language). Normally, one wants that the semantics of the generated object-code (as defined by the target language) is the same as the semantics of the program (as defined by the source language). Sometimes one also says that the semantics of the source program is the semantics of the object-code generated by the compiler, however, this definition is usually not very useful since the semantics of the generated object-code would be difficult to understand.

A compiler usually realizes the translation is several steps; correspondingly, it contains several components. This may be represented by different diagrams, such as the following: Version-1 , Version-2. Usually, a compiler includes (at least) separate components for verifying the lexical and syntax rules:

These rules can be defined in different manners and using different formalisms. In general, syntax rules are more complex than the lexical rules. Therefore the formalism used to express the syntax rules of a language is normally more powerful than the rules used for describing the lexical rules. Certain formalism may be used for both purposes, comme for instance the context-free grammars (see chapter on syntax analysis).

In general, there are two problems to be dealt with in relation with computer languages (and also for natural languages when they are processed by computers):

Certain definition formalisms are more suitable for the generation problem, while others are more suitable for the recognition problem. Clearly, a compiler has to resolve the recognition problem.

In the following two chapters of this course, we will discuss lexical analysis and syntax analysis.


Last update: February 3, 2015