An Unsuccessful Attempt: Using An Antlr Tree Parser with Non-ANTLR Trees

This note involves ANTLR's C++ "code generation". Although I know that some academic compilers have been written in Java, I am not convinced that Java is an ideal language for implementing a modern optimizing compiler. In most cases Java has the performance problems associated with an interpreted language. Also, although Java is a safer language than C++, it does not provide the power or direct interface with POSIX that C++ provides.

One way to build a compiler is to have the parser, in this case created by ANTLR, create a syntaticly correct abstract syntax tree (AST) that reflects the input source. If there are syntax errors found by the parser (see Recognizing ANTLR Parser Errors) the compiler terminates.

If the parse phase completes without finding a syntactic error, the next phase of the compiler resolves type, variable and named constant definitions. The name resolution phase walks the abstract syntax tree and associates a symbol table pointer with every identifier in the statement AST and associates types with most of the non-terminal AST nodes (e.g., operators like plus, times, etc...) After all names are resolved the declaration trees are removed from the AST.

A semantic analysis phase follows the name resolution phase. The semantic analysis phase checks semantic rules like assignment compatibility.

Following the semantic analysis phase there is a constant folding phase (e.g., fold subtrees like 3 + 4 into the constant 7).

A tree transformation phase hoists expressions out of function calls and performs other transformations on the trees.

Each of these phases is based on a tree walker. Usually these tree walkers are written by hand. A better approach might be to use ANTLR to create a tree parser for each phase. Here a grammar and a set of actions would be defined for each phase. Using tree grammars should make the phases easier to understand and make them easier to maintain. The tree parser is also more likely to be correct since it is based on a conflict free grammar that has been automaticly checked by ANTLR.

The AST used in a compiler is one of the central data structures of the compiler. So I want to be able to exactly define the structure of the AST. I also want to control the allocation and deallocation of the AST, so I don't want to use the default new and delete operators. As a result of these concerns I have not used ANTLR's automatic AST generation. Instead I have generated trees from the actions associated with the ANTLR grammar productions.

In theory using a custom tree should not prohibit using an ANTLR tree parser. The ANTLR documentation on Tree Parsers states:

ANTLR tree parsers can walk any tree that implements the AST interface, which imposes a child-sibling like structure to whatever tree data-structure you may have. The important navigation methods are:

Each AST node is considered to have a list of children, some text, and a "token type". Trees are self-similar in that a tree node is also a tree. An AST is defined completely as:


public interface AST {
        /** Get the token type for this node */
        public int getType();
        /** Set the token type for this node */
        public void setType(int ttype);
        /** Get the token text for this node */
        public String getText();
        /** Set the token text for this node */
        public void setText(String text);
        /** Get the first child of this node; null if no children */
        public AST getFirstChild();
        /** Set the first child of a node. */
        public void setFirstChild(AST c);
        /** Get the next sibling in line after this one */
        public AST getNextSibling();
        /** Set the next sibling after this one. */
        public void setNextSibling(AST n);
        /** Add a (rightmost) child to this node */
        public void addChild(AST c);
}
    

Note that the above definition is a Java interface definition. An interface my be implemented by any class that satisifies the interface. C++ has no equivalent to the interface defintion.

But ignoring the differences between Java and C++ the above seems to imply that if the abstract syntax tree class is named AST, the above class functions are provided and the ANTLR tree structure is used (e.g., the child and sibling pointers), ANTLR's tree parser generator can be used to walk any tree. To better understand this I wrote a small assignment expression tree walker. The grammar is shown below.


options {
        language="Cpp";
}


class ExprTreeParser extends TreeParser;

options {
        importVocab=TreeNode;
}


exprlist
   : #( N_EXPR_LIST ( assign )+ )
   ;

assign
   : #( N_ASSIGN N_ID expr )
   | term
   ;

expr
   : unary_expr
   | binary_expr
   ;

unary_expr
   : #( N_UMINUS term )
   ;

binary_expr
   : #( N_TIMES term term )
   | #( N_DIVIDE term term )
   | #( N_MOD term term )
   | #( N_PLUS term term )
   | #( N_MINUS term term )
   ;

term
   : N_ID
   | N_CONST
   | N_NULL
   ;

Note that this grammar imports a token vocabulary named TreeNode. Normally ANTLR token vocabularies are generated from an ANTLR grammar. If automatic AST generation is selected, this grammar defines the nodes in the AST trees. If these ANTLR trees are walked by an ANTLR tree parser this grammar would be imported. In this case, however, the trees are being generated by grammar productions and the nodes are defined by the code to support tree creation. Since the trees are generated "by hand" a token vocabulary file must be created for the tokens associated with the tree nodes. An example, for the expression grammar, is shown below:

 
//
// file TreeNodeTokenTypes.txt
//
TreeNode      // Expression tree grammar tokens (created by hand)
N_ASSIGN = 5
N_CONST = 6
N_DIVIDE = 7
N_EXPR_LIST = 8
N_ICON = 9
N_ID = 10
N_MINUS = 11
N_MOD = 12
N_PLUS = 13
N_TIMES = 14
N_UMINUS = 15

At this point it becomes clear that ignoring the differences between Java and C++ is not a good idea. Java would allow an object like class node to be defined that implements the AST interface. There is no way to do this in C++.

ANTLR will generate a tree parser from the above grammar and token vocabulary. This parser will compile properly, but it will only parse abstract syntax trees composed of ANTLR AST nodes. At least in the case of C++, it will not parse arbitrary trees. In the C++ generated parser the type of the node is not abstracted out. For example, as the start of the ANTLR generated parser shows, the Token.hpp and AST.hpp file are included. Token.hpp defines the RefAST type, which is ANTLR's reference counted node type.

/*
 * ANTLR-generated file resulting from grammar expr_tree.g
 * 
 * Terence Parr, MageLang Institute
 * with John Lilley, Empathy Software
 * ANTLR Version 2.6.1; 1996-1999
 */

#include "ExprTreeParser.hpp"
#include "antlr/Token.hpp"
#include "antlr/AST.hpp"
...
...

So currently it appears that ANTLR cannot be used to create C++ tree parsers for trees that are not at least derived from the ANTLR AST type. In fact it is very difficult, if not impossible to divorce ANTLR tree construction and parsing from the ANTLR type structure, at lease in the case of the C++ parser.

With enough work there may be a way around this. It may, for example, be possible to completely redefine the AST class and the other supporting classes. But since the stock ANTLR code must be linked into the compiler as well, the names will collide. Perhaps the C++ name space feature can be used to avoid name collision. But in the end, as elegant as it would be to define middle passes with tree grammars, it just seems easier to write these passes by hand.

Ian Kaplan, September 21, 1999


back to ANTLR table of contents