SEG-2106 - Software Construction - Winter 2016

Assignment 2 : Lexical Analysis

Given: February 22 - Due date: Monday or Tuesday, March 14 or 15, during the lab session - Students may work individually or in groups of two.

Please submit: Report on paper. Lex files in a zip file sent by e-mail to the TA Rui Chen ‎[rchen040@uottawa.ca]‎ who will mark your report.

1. Find regular expressions for languages (15 points)

Find regular expressions that define the following languages:

XML tags where the opening tags may include attribues. The form of such tags is given by the following examples:
- opening tag: <name-x attribute-1="some string" attribute-2="xyz">
- opening tag with closing: <funny laugh123="go" s-t-o-p=";>"/>
- closing tag: </name-x>
- Note: The alphabet is composed of the sets alpha, digit and {-,<,>,/, ", =, ; }.
All strings over the alphabet {a, b} that do not contain the substring aba.
All strings over the alphabet {a, b} for which the number of "a" is a multiple of 3 (including zero).

Clarifications:

Les noms des tabs et attributs ont la forme habituelle des identificateurs, excepté que le caractère "-" peut être utilisé.
Pour la définition des expressions régulières, vous pouvez vous servir des extensions aux opérateurs de base suivantes: +, ?, character class; mais non pas le ^ .

2. Find accepting automata (15 points)

For each of the languages defined under point (1) above, find a deterministic accepting automaton that accepts exactly that language.

3. Comparison of regular expressions (20 points)

We consider the following pairs of regular expressions. For each pair, you should indicate whether the two expressions define the same language or not. If they are equivalent, you should give a proof by considering the algebraic properties of the regular expression operators. If they are not equivalent, you should give an example of a string that is included in one of the languages, but not in the other.

(a|b|c)* a*|b*|c*
a(b|c)|(aa|bb)c ac|bbc|ab|aac
a(bca)*bc ab(cab)*c
a (b*a|cba)* (ab*|acb)*a

4. Construct a lexical analyser using Lex (50 points)

Your task Construct a lexical analyzer for a subset of the Pascal language The analyzer must (at least) be able to count the occurences of identifiers (including keywords and reserved words), numbers, comments and lines of code in the program. Note: for this exercise, you should assume that keywords and reserved words count as identifiers.

Method: You should use Lex (or Flex) as you have seen in Lab-5.

Information: In the following, you find the definition of the Pascal syntax, some notes on lexical analysis, and a sample Pascal program. The syntax is defined in a version of BNF notation (you do not have to understand this notation in detail. Here is some explanation of the notation used in the grammar definition:

Words in italics represent non-terminals (sub-language identifiers) which you may ignore. They are important for syntax analysis.
The token id and num are explained below.
Words written in bold are mostly keywords or reserved words (for instance program, array, etc.). You can count them as identifiers.
Other characters, such as [, ], :, ;, (, ), are also tokens. They need not be counted.
Some words written in bold are operators, such as assignop (assignment operator, ":="), or relop (relational operator, e.g. "<", ">", etc.) and others.

Syntax definition: