Why Use ANTLR?

Introduction

This Web page discussions parser generators and the ANTLR parser generator in particular. See Compiler Front End and Infrastructure Software for compiler construction software suites.

If you just stumbled on this Web page you may be wondering what a parser generator is. I should warn you that this is probably not the best place to learn the answer. There are a number of excellent books that discuss parser generators and compiler construction. These include the all time classic compiler text (the so called "red dragon book") Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman, Addison Wesley, 1986 and Advanced Compiler Design and Implementation by Muchnick, Morgan Kaufmann, 1997. Briefly, a parser generator is a program that reads a grammar that describes a language and outputs source code (in language like C, C++, or Java) that will recognize the language described by the grammar. The most commonly used parser generator is the UNIX program YACC (the Free Software Foundation program Bison is a GNU version of YACC). Both UNIX and Windows NT versions of Bison are available from Cygnus.

A low key, largely non-technical introduction to compiler constructon has been written by Jack Crenshaw, who currently writes an outstanding column for Embedded Systems.

Another longer and more academic Web based reference is the book Compilers and Compiler Generators: an introduction with C++ by P.D. Terry, Rhodes University, 1996. This web pages publishes P.D. Terry's book in a variety of formats, including postscript and Adobe pdf format.

ANTLR is a second generation parser generator. The first generation was PCCTS. Both parsers were designed and implemented by Terence Parr. Various people have helped over the years. See the ANTLR and PCCTS source for the complete credits.

Although ANTLR is implemented in Java, it will generate either a Java or a C++ parser. I sometimes get e-mail asking which parser should be used: ANTLR or PCCTS. I don't have enough experience with PCCTS to comment on this. My view is that ANTLR represents the current main stream of development. Also, I would assume that the lessons Terence learned implemented PCCTS were incorporated into ANTLR. But there is an active PCCTS user base, so I am sure there are other, probably better informed view on this.

Although there is some brief introductory material, this Web page is written for people who already have some knowledge of compiler and language processor construction.

The ANTLR documentation and software can be found at www.antlr.org. ANTLR is a available in source form, with only the restriction being that the authors are credited. ANTLR evolved from PCCTS (the Purdue Compiler Construction Tool Set), which was developed by Terance Parr as part of his PhD thesis work. ANTLR is a rewrite of PCCTS. A history of ANTLR can be found here.

Automaticly Generated Parsers vs. Hand Crafted Parsers

The first compiler I worked on was the UCSD Pascal compiler. This compiler was based on Niklaus Wirth's Pascal compiler written at ETH in Switzerland. At the time, parser generators were in their infancy. Parsers were written by hand. The Pascal parser was a recursive decent parser, whose structure mirrored the Pascal syntax.

The technique for creating this kind of recursive decent parser was to carefully develop the BNF syntax for the language to be parsed and then write the parser so that it mirrored this syntax. Any change in the grammar had to be mirrored with a change in the software. This could be time consumming and any mistakes in the parser software would result in incorrect parsing.

Parser generators allow the programmer to work at a higher level of abstraction. Creating the language parser involves writing the grammar, instead of directly writing the software that parses the grammar. Changes that might take a few hours to implement in a hand crafted recursive decent parser can be made in a few minutes when a parser generator is used. A few hundred lines of grammar is also much easier to understand than a few thousand lines of hand crafted parser. Two arguments have classicly been used against automaticly created parsers:

  1. Automaticly generated parsers are slow and use lots of memory.

  2. Error reporting is poor in automaticly generated parsers.

Most language processors spend a fraction of their time parsing. For example, most compilers spend more time in optimization and code generation than they do in language parsing. On a modern computer system the time taken by the parser will probably be dominated by disk access, not parser execution. Nor is the issue of memory usage an argument against automaticly generated parsers. On a computer system with 128 Mb of RAM and perhaps 256 Mb of Virtual memory, the amount of memory used by the parser will usually be insignificant.

Error reporting is a problem, especially for YACC/Bison. ANTLR generates a recursive decent parser from the grammar and has fairly good error reporting (this is discussed at greater length below). Of course "good" is a matter of opinion.

Although I started my software career writing recursive decent parsers and have written hand crafted parsers for a variety of lanuages, I now greatly prefer automaticly generated parsers.

Since we're on the topic of YACC, a minor digression:

Another problem with YACC are the obscure shift-reduce and reduce-reduce conflicts. Terance Parr writes that PCCTS, the predecessor to ANTLR, was created in response to his horror of having to debug YACC grammars. If you do have to debug YACC grammars, Reiner Pesch has written a tool that might help. I have not tried it out and have included this link primarily because it might be of help in the future, since I am sometimes forced to use YACC on projects at work. Reiner writes, in comp.compilers (February 12, 2001):

Due to my diploma thesis in computer science I have implemented a tool which intends to assist developers using LALR parser generators like yacc with the analysis of shift/reduce- and reduce/reduce-conflicts. (The tool currently works with yacc, byacc, bison and happy).

So if you're a user of any LALR parser generator mentioned above, you could do me (and hopefully yourself) a favour by taking a look at ...

http://www-users.rwth-aachen.de/reiner.pesch/yca/yca.html.

The yca.html Web page publishes a tool called YCA (YACC Conflict Analyser[sic]). YACC/Bison grammar conflicts can take a lot of time to understand so any tool that helps is worth looking into.

Parser Generator Options

Lookahead

The term lookahead refers to the number of lexical tokens that a parser looks at, beyond the current one, to determine the matching grammar production. Parser generators have a long history and they originally generated parsers to run on machines that, by modern standards, had very little memory (e.g., 129K bytes to 4Mb). Since token lookahead tends to increase parser memory use and make the parser slower, most parser generators created parsers that supported only one token lookahead. This is denoted by a number next to the grammar type (for example, LALR(1) is an LALR grammar with one token lookahead). Several more recent parser generators (ANTLR and JavaCC) can create parsers that can selectively lookahead more than one token (e.g., LL(k)).

Creating a grammar for a complex language is a lot of work. It is easy to accidently create a grammar with ambiguities, where the parser cannot tell two or more productions apart. Parser generators will report errors or warnings when ambiguous grammars are encountered. However, with every parser generator I've ever used, these messages are difficult to interpret. You must pour over the grammar considering how the grammar productions collided to create ambiguity. At first glance, the grammar writer may think that an LL(k) grammar will allow sloppier grammars. This is not actually the case. A grammar that contains ambiguity can actually be more ambiguous with k lookahead. Having k lookahead does allow some productions to be more clearly recognized, however. The most common example of this is the else clause in C, C++ and Java which can be clearly specified in a grammar production with selective k lookahead. This allows the grammar to be somewhat smaller. A parser with k lookahead can also recognize a wider range of languages than a parser with only one token lookahead. But these languages must still be unambiguously specified. So k lookahead provides power and flexibility, but I have not found that it greatly decreased the work necessary to create a grammar.

Open Source or Free (no license fee) Parser Generators

Until YACC became part of the UNIX software tool set, parser generators existed only in academia and at a few companies that kept the tools to themselves. Yacc is now a quarter century old. The original paper on yacc was written in July, 1975. An HTML version can be found here.

There are now a remarkable variety of parser generators available in addition to ANTLR. These parser generators include open source software, free binaries and commercial products. Originally this list came from a google search. I have updated it with information from comp.compilers. This list is getting long enough that I should probably break it out into its own web page.

Commercial parser generators

Tree Parsers

Related to parser generators are tree parsers. Tree parsers read in a grammar that describes an abstract syntax tree that is generated by a compiler front end or compiler middle pass and generates a tree walker that processes the trees. ANTLR can be used to generate tree parsers. Another tree parser generator is iburg, developed by C. W. Fraser, D. R. Hanson, and T. A. Proebsting at Princeton and used to construct the code generator for Fraser and Hanson's lcc ANSI C compiler.

State Machine Generators

State machine generators are closely related to the deterministic finite automata (DFA) generated by scanner generators like the one built into ANTLR. However, scanner generators are aimed at language applications and are not well suited for state machine generation outside this area. The Libero state machine generator from iMatix supports state machine generation for a variety of applications.

The input to Libero is a state machine diagram expressed in a textural language. The output can be in a variety of languages ranging from Java and C++ to COBOL. The Libero tool is provided as open source software under the Free Software foundation GPL. The code generated by Libero can be used without restriction.

Other Lists of Parser Generators

Why did I become interested in ANTLR?

I have used YACC (which is more or less interchangeable with Bison) for many years. YACC has several notable problems however:

I am planning to use the parser generator for a production quality compiler that I hope may be a commercial product one day where I might license the compiler source. So I had some other requirements:

Given the issues listed above, ANTLR is the most attractive parser generator I have looked at:

Although ANTLR has some compelling advantages, it has some disadvantages as well:

Afterward

I did use ANTLR to generate a Java parser based on the John Mitchell, Terence Parr, John Lilley and Scott Stanchfield grammar. This is a very nice grammar and represents a lot of work on the author's part. I would like to publicly thank them for putting this grammar in the public domain. I am very happy with ANTLR. The parser is now complete and consists of about 5K lines of grammar, C++ actions and comments.

I do not use ANTLR tree generation. In part the reason for this is that I generate a C++ parser and C++ does not have an "interface" construct, like Java has. The Java parser defines the tree via a Java interface, which allows the Java parser user to define the actual implementation of the tree in any number of ways. For the C++ user however, the tree base class must be used and this had more features (like reference count) than I wanted. So I generate my own light weight trees using C++ actions.

Tree node definitions aside, there is another reason to generate AST through C++ or Java actions. The AST generated by the parser represents the core information for the statement or expression. As a result, tree shape is very important and a lot of work can go into generating trees that will make later processing easier. If trees are generated directly from ANTLR, tree shape tends to be controlled by modifying the grammar. This can fragment rules into several sub-productions. A grammar for a language like Java is already large. Fragmenting rules may make it more difficult to maintain. Direct generation of trees gives allows more freedom than generating trees using ANTLR. The drawback is that it is also more work.


Ian Kaplan, July 19, 1999
Revised most recently on: February 13, 2001


back to main ANTLR page