Building an expression parser using Antlr and JavaCC parser gerenators

Building an expression parser using Antlr and JavaCC parser generators:

The objective of this tutorial is to show how can you write a lexer and a parser for the following Grammar:

The following is the set of rules:

expr := sum

sum := term ( ( PLUS | MINUS ) term )*

term := element ( ( MULTIPLY | DIVIDE ) element )*

element := INTEGER | LPAREN sum() RPAREN

The following are the terminals:

PLUS := "+"

MINUS := "-"

MULTIPLY := "*"

DIVIDE := "/"

LPAREN := "("

RPAREN := ")"

INTEGER := ( DIGIT ) +

DIGIT := "0" – "9"

First a few words about each tool:

ANTLR Installation:

Download the file: antlr_2_5_0.zip from: http://www.ANTLR.org/
Unpack it your home directory, using the superzip utility.
The directory antlr-2.5.0 will be created.
Setup your CLASSPATH to point to this directory, example:

CLASSPATH=H:\antlr\antlr-2.5.0;.

To run antlr on the grammar file mygrammar.g, use the command:

java antlr.Tool mygrammar.g

JavaCC Installation:

Download the file: JavaCC0_8pre1.class from: http://www.suntest.com/JavaCC/
This is a java class file, you an install JavaCC by running this file:

java JavaCC0_8pre1

The version of JavaCC has changed, so you may find a different file, on the JavaCC website.
The installation program will guide you through the installation process, you will have to provide an installation directory.
Add the installation path to your PATH environment variable.
To run JavaCC on the grammar file mygrammar.jj, use the following command:

JavaCC mygrammar.jj

They both are written in Java, and produce java code.

Both use LL(k) analysis

Syntax for comments, character literals and string literals have the same syntax as in Java.

Building the Grammar file for Antlr:

Grammar file has the following structure:

header
{

// A header section contains source code that must be
//placed before any ANTLR-generated code. Usually you will
//have the imported classes here.
}

class MyParser extends Parser;
options
{
}

{
// you can enter code for additional methods and
// variables here ….
}

parser_ruls:
.
.
.

class MyLexer extends Lexer;
options
{
}

{
// you can enter code for additional methods and
// variables here ….
}

lexer_ruls:
.
.

EOF

As you see the syntax for the parser and the lexer is the same, the only difference is that the Lexer rules match characters on the input stream, and the parser rules match tokens on the token stream. The lexer rules start with an upper case letter and parser rules start with lower case letter.

The syntax of the rule is:

rule_name

[int a, String s] // input arguments

returns [int x] // return values

...

;

EBNF Rule Elements

Antlr supports extended BNF notation according to the following four subrule:

( P1 | P2 | ... | Pn ) -> match P1 or P2 or …. Pn

( P1 | P2 | ... | Pn )? -> match zero or one of P1 or P2, …. Pn

( P1 | P2 | ... | Pn )* -> match zero or more of P1 or P2 ….Pn

( P1 | P2 | ... | Pn )+ -> match one or more of P1 or P2 ….Pn

Writing the Grammar for the Calc Example:

Create a new file and call it "alc.g" (the name has no influence on the grammar)
Enter a header, although we do not need it for our parser, I just include it as an example:

header

{

import java.lang.*;

}

I will write the lexer rules, to define the terminals or the tokens for our calc grammar:

class CalcLexer extends Lexer;options{

// a lookahed of 2

k=2;

}

// white space, these will be skipped by the lexer:WS: ( ' ' | '\t' | '\r') { _ttype = Token.SKIP; };

// terminals:

EOL       : '\n' ;
LPAREN    : '(' ;
RPAREN    : ')' ;
PLUS      : '+' ;
MINUS     : '-' ;
MULTIPLY : '*' ;
DIVIDE    : '/' ;
INT       : (DIGIT)+ ;

// protected rules, can not be accessed from the parser, can be used
// only from other lexer rules:

protected
DIGIT : '0'..'9' ;

Unless changed manually, each lexer rule returns a token.

Now we will write the rules for the parser:

class CalcParser extends Parser;
expr returns [float f]
{
   f=0;
}
:
     f=sum
     ( EOL )*
     EOF
;

sum returns [float f]
{
   f=0;
}
:
   term
   (
      ( PLUS term | MINUS term )
   )*
;

term returns [float f]

{
   f=0;
}
:
    element
    (
      ( MULTIPLY element | DIVIDE element )
    )*
;

element returns [float f]
{
f=0;
}
:
INT | LPAREN sum RPAREN
;

Provide actions in the parser rules, to handle the calculation of the expression:

For example:

sum returns [float f]
{
    f=0; float f2=0;
}
:
    f=term //assign to f the value returned by the rule term
    (
       (    PLUS f2=term { f += f2; }
         | MINUS f2=term { f -= f2; }
       )
    )*
;

Notice that at the end of the rule, you do not need to explicitly return the value f, it will be done by antlr.

The rest of the modification to the other rules is very similar to the sum rule, The complete file is listed here:

class CalcParser extends Parser;

expr returns [float f]
{
   f=0;
}
:
    f=sum
    ( EOL )*
    EOF
;

sum returns [float f]
{
   f=0;
   float f2=0;
}
:
   f=term
   (
      (    PLUS f2=term { f += f2; }
        | MINUS f2=term { f -= f2; }
       )
    )*
;

term returns [float f]
{
   f=0;
   float f2=0;
}
:
    f=element
    (
       ( MULTIPLY f2=element { f *= f2; }
         | DIVIDE f2=element { f /= f2; }
       )
    )*
;

element returns [float f]
{
   f=0;
   Float x;
}
:
    d:INT { x = new Float(d.getText());
            f = x.floatValue();
          }
    |
      LPAREN f=sum RPAREN
;

class CalcLexer extends Lexer;
options
{
k=1;
}

WS: ( ' ' | '\t' | '\r')
   {
      _ttype = Token.SKIP;
   }
;

EOL      : '\n' ;
LPAREN   : '(' ;
RPAREN   : ')' ;
PLUS     : '+' ;
MINUS    : '-' ;
MULTIPLY : '*' ;
DIVIDE   : '/' ;
INT      : (DIGIT)+ ;

protected
DIGIT : '0'..'9' ;

To generate the Parser and Lexer code, we need to run antlr.

On the command line, type the following command:

Java antlr.Tool calc.g

I am assuming that your path is pointing to the directory where java is installed, if not then just add the following path to your PATH environment variable:

Q:\WINDOWS\jdk-w32\bin

antlr should terminate generating few java files. To compile them , run javac, example:

javac *.java

this will produce the class files.

The following program, shows how can we use the lexer and the parser:

import java.io.*;

import antlr.*;

class Calc
{
    public static void main(String[] args)
    {
      try
      {
        CalcLexer lexer = new CalcLexer(new DataInputStream(System.in));
        CalcParser parser = new CalcParser(lexer);

        // Parse the input expression
        System.out.print("Enter Expression: ");
        System.out.flush();

float res = parser.expr();

        System.out.println("Result: " + res);
      }
      catch(IOException e)
      {
         System.err.println("IOException: " + e);
      }
      catch(ParserException e)
      {
         System.err.println("ParseException: " + e);
      }
   }
}

Put the above code into a file called "Cal.java", compile it and run it using the java command:

Java Calc

Building the Grammar file for JavaCC:

Grammar file has the following structure:

options {
// options for the parser generator
}

PARSER_BEGIN(my_parser_name)

public class my_parser_name
{
// code for methods and variables for the class.
}PARSER_END(my_parser_name)

// the SKIP section lists the white space that separates the tokens:
SKIP :
{
}

// The TOKEN section defines the tokens for the lexer:
TOKEN :
{
}

production_rules
.
.
.

EOF

Production rules have the following syntax:

java_return_type
identifier ( parameter_list )
:
{
//java code
}
{
expansion_choices
}

Writing the Grammar for the Cal Example:

Create a file and call it "cal.jj", again the filename has no influence to its contents.
Enter a header for the file, in this case the header will contain the options that will change the behavior of Javacc.

options
{
LOOKAHEAD=1;
}

Enter the compilation unit, this is the code that allows the user code to access the parser, for now it will actually creates the parser, and reads input from stdin:

PARSER_BEGIN(Calc)
public class Calc
{
   public static void main(String args[])
   {
      try
      {
         Calc parser = new Calc(System.in);
         System.out.print("Enter Expression: ");
         System.out.flush();
         float res = parser.expr();
         System.out.println("Result: " + res);
      }
      catch (ParseException e)
      {
         System.out.println("Exception: " + e);
      }
   }
}

PARSER_END(Calc)

Enter SKIP and TOKEN specification:

// we want to skip these characters, notice that here characters are enclosed in "", not as antlr // syntax.

SKIP :
{
" "
| "\r"
| "\t"
}

// the end of line is a token
TOKEN :
{
< EOL: "\n" >
}

// the terminals in the grammar:
TOKEN :
{
< PLUS: "+" >
| < MINUS: "-" >
| < MULTIPLY: "*" >
| < DIVIDE: "/" >
| < LPAREN: "(" >
| < RPAREN: ")" >
}

// here the definition of an INT depends on the definition of DIGIT,
// DIGIT is prefixed with "#", to mean that it an not be referenced
// from the parser, and it an only be part of another token (similar to // the protected lexer methods in antlr.

TOKEN :
{
< INT: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}

Write the Parser rules:

float expr() :
{
   float f = 0;
}
{
    sum()
    ( <EOL> )*
    <EOF>
}

float sum() :
{
   float f = 0;
   float f2 = 0;
}
{
   term()
   (
     (   <PLUS> term() | <MINUS> term())
   )*
}

float term() :
{
   float f = 0;
   float f2 = 0;
}
{
   element()
   (
     ( <MULTIPLY> element() | <DIVIDE> element() )
   )*
}

float element() :
{
   float f = 0;
   Token t;
   Float x;
}
{
   ( <INT> | <LPAREN> f=sum() <RPAREN> )
}

Notice that references to other rules is similar to function calls and references to tokens is enclosed in <>.

Run JavaCC:

javaCC calc.jj

Compile and run the Calc example:

javac *.java

java Calc

The complete grammar file with the actions inserted, is listed below:

options

{

LOOKAHEAD=1;

}

PARSER_BEGIN(Calc)

public class Calc
{

    public static void main(String args[])
    {
       try
       {
          Calc parser = new Calc(System.in);
          System.out.print("Enter Expression: ");
          System.out.flush();
          float res = parser.expr();
          System.out.println("Result: " + res);
       }
       catch (ParseException e)
       {
          System.out.println("Exception: " + e);
       }
    }
}

PARSER_END(Calc)

SKIP :
{
" "
| "\r"
| "\t"
}

TOKEN :
{
< EOL: "\n" >
}

TOKEN :
{
< PLUS: "+" >
| < MINUS: "-" >
| < MULTIPLY: "*" >
| < DIVIDE: "/" >
| < LPAREN: "(" >
| < RPAREN: ")" >
}

TOKEN :
{
< INT: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}

float expr() :
{
float f = 0;
}
{
    (
       f=sum()
       ( <EOL> )*
       <EOF>
    )

    {
       return f;
    }
}

float sum() :
{
float f = 0;
float f2 = 0;
}
{
    f=term()
    (
       (
            <PLUS> f2=term()   { f += f2; }
         | <MINUS> f2=term()   { f -= f2; }
       )
    )*

    {
       return f;
    }
}

float term() :
{
float f = 0;
float f2 = 0;
}
{
    f=element()
    (
       (
            <MULTIPLY> f2=element() { f *= f2; }
          |   <DIVIDE> f2=element() { f /= f2; }
       )
    )*

    {
        return f;
    }
}

float element() :
{
float f = 0;
Token t;
Float x;
}
{
   (
      t=<INT> {
                x= new Float(t.image);
                f= x.floatValue();
              }

      |
          <LPAREN> f=sum() <RPAREN>
   )

   {
      return f;
   }
}