Java Symbol Table Design Issues

Many Java language processors do not read Java. Instead they read the Java class file and build the symbol table and abstract syntax tree from the class file. The Java represented in the Java class file is already syntatically and semantically correct. As a result the authors of these tools avoid the considerable difficulty involved with implementing a Java front end.

The designers of the Java programming language did not have ease of implementation in mind when they designed the langauge. This is as it should be, since easy of use in the language is more important. One of the difficulties encountered in designing a Java front end which does semantic analysis is symbol table design. This web page provides a somewhat rambling discussion of the issues involved with the design of a Java symbol table.

The front end phase of a compiler is responsible for:

Parsing the source language to recognize correct programs and report syntax errors for incorrect language constructs. In the case of the BPI Java front end, this is done by a parser generated with the ANTLR parser generator. The output of the parser is an abstract syntax tree (AST) which includes all declarations that were in the source.
Reading declaration information in Java class files and, for a native Java compiler, building ASTs from the byte code stream. This also involves following the transitive closure of the classes required to define the root class. (Def: transitive closure - All the nodes in a graph that are reachable from the root. In this case the graph is the tree of classes that are needed to define all the classes read by the compiler).
Processing the declarations in the AST and class files to build the symbol table. Once they are processed the declarations are pruned from the AST.

The output of the front end is a syntactically and semantically correct AST where each node has a pointer to either an identifier (if it is a leaf) or a class type (if it is a non-terminal or a type reference like MyType.class).

The term "symbol table" is generic and usually refers to a data structure that is much more complex than a table (e.g., an array of structures). While symbols and types are being resolved, the symbol table must reflect the current scope of the AST being processed. For example, in the C code fragment below there are three variables named "x", all in different scopes.


  static char x;

  int foo() {
     int x;

     {
        float x;
     }
  }

Resolving symbols and types requires traversing the AST to process the various declarations. As the traversal moves through scope in the AST, the symbol table reflects current scope, so that when the symbol for "x" is looked up, the symbol in the current scope will be returned.

The scoped structure of the symbol table is only important while symbols and types are being resolved. After names are resolved, the association between a name in the AST and its symbol can be found directly via a pointer.

Compilers for languages like Pascal and C, which have simple hierarchical scope, frequently use symbol tables that directly mirror the language scope. There is a symbol table for every scope. Each symbol table has a pointer to its parent scope. At the root of the symbol table hierarchy is the global symbol table, which contains global symbols and functions (or, in the case of Pascal, procedures). When a function scope is entered, a function symbol table is created. The function symbol table parent pointer points to the next scope "upward" in the hierarchy (either the global symbol table, or in the case of Pascal, an enclosing procedure or function). A block symbol table would point to its parent, which would be a function symbol table. Symbol search traverses upward, starting with the local scope and moving toward the global scope.

The scope hierarchy is not needed once symbols and types have been resolved. However the local scope, for a method or a class remains important and the symbol tables for these local scopes must remain accessible to allow the compiler to iterate over all symbol in a given scope. For example, to generate code to allocate a stack frame when a method is called, the compiler must be able to find all the variables associated with the method. A Java compiler must be able to keep track of the members of a class, since these variables will be allocated in garbage collected memory.

Scope for most object oriented languages is more complicated than the scope for procedural languages like C and Pascal. C++ supports multiple inheritance and Java supports multiple interface definitions (multiple inheritance done right). The symbol table must also be efficient so compiler performance is not hurt by symbol table lookup in the front end. Symbol table design considerations for a Java compiler include:

Java has a large global scope, since all classes and packages are imported into the global name space. Global symbols must be stored in a high capacity data structure that supports fast (O(n)) lookup (a hash table, for example).
Java has lots of local scopes (classes, methods and blocks) that have relatively few symbols (compared to the global scope). Data structures that support fast high capacity lookup tend to introduce overhead (in either memory use or code complexity). This is overkill for the local scope. The symbol table for the local scopes should be implemented with a data structure that is simple and relatively fast (e.g., (O(log₂ n))). Examples include balanced binary trees and skip lists.
The symbol table must be able to support multiple definitions for a name within a given scope. The symbol table must also help the compiler resolve the error cases where the same kind of symbol (e.g., a method) is declared more than once in a given scope.

In C names within a given scope must be unique. For example, in C a type named MyType and a function named MyType are not allowed. In Java names in a given scope are not required to be unique. Names are resolved by context. For example:
```
        class Rose {
          Rose( int val ) { juliette = val; }
          public int juliette;
        } // Rose

        class Venice {
          void thorn {
            garden = new Rose( 42 );
            Rose( 86 );
            garden.Rose( 94 );
          }

          Rose Rose( int val ) { garden.juliette = val; }

          Rose garden;
        } // venice
```
In this example there is a type named Rose, a Rose constructor, and a method named Rose that returns an object of type Rose. The compiler must know by context which is which. Also, note that the references to the Rose function and the garden type are references to objects declared later in the file.

Most of the symbol scope in Java can be described by a simple hierarchy where a lower scope points to the next scope up. The exception is the interface list that may be associated with a Java class. Note that interfaces may also inherit from super interfaces. The scopes in Java are outlined below:

    
     Global (objects imported via import statements)
        Parent Interface (this may be a list)
          Interface (there may be a list of interfaces)
             Parent class
               Class
                 Method
                   Block

The symbol table and the semantic analysis code that checks the Java AST returned by the parser must be able to resolve whether a symbol definition is semantically correct. The presence of multiple definitions for a given name (e.g., multiple definitions of a class member) are allowed. However, ambiguous symbol use is not allowed:

Java Language Specification (JLS) 8.3.3.3

A class may inherit two or more fields with the same name, either from two interfaces or from its superclass and an interface. A compile-time error occurs on any attempt to refer to any ambiguously inherited field by its simple name. A qualified name or field access expression that contains the keyword super (15.10.2) may be used to access such fields unambiguously.

Both a parent class and an interface place symbols defined in the class or interface in the local scope. In the example below the symbol x is defined in both bar and fu. This is allowed, since x is not referenced in the class DoD.

interface bar {
  int x = 42;
}

class fu {
  double x;
}


class DoD extends fu implements bar {
  int y;  // No error, since there is no local reference to x
}

If x is referenced in the class DoD, the compiler must report an error, since the reference to x is ambiguous.

class DoD extends fu implements bar {
  int y; 

  DoD() {
    y = x + 1;   // Error, since the reference to x is ambiguous
  }
}

Similar name ambiguity can exist with inner classes defined in an interface and a parent class:

interface BuildEmpire
{
  class KhubilaiKahn {
    public int a, b, c;
  }
}

class GengisKahn
{
  class KhubilaiKahn {
    public double x, y, z;
  }
}


class mongol extends GengisKahn implements BuildEmpire
{
  void mondo() {
    KhubilaiKahn TheKahn;  // Ambiguous reference to class KhubilaiKahn
  }
}

Java does not support multiple inheritance in the class hierarchy, but Java does allow a class to implement multiple interfaces or an interface to extend multiple interfaces.

Java Language Standard 9.3

It is possible for an interface to inherit more than one field with the same name (8.3.3.3). Such a situation does not in itself cause a compile-time error. However, any attempt within the body of the interface to refer to either field by its simple name will result in a compile-time error, because such a reference is ambigous.

For example, in the code below key is ambiguous.


interface Maryland
{
  String key = "General William Odom";
}

interface ProcurementOffice
{
  String key = "Admiral Bobby Inman";
}


interface NoSuchAgency extends Maryland, ProcurementOffice
{
  String RealKey = key + "42"; // ambiguous reference to key
}

When the semantic analysis phase looks up the symbol key the symbol table must allow the semantic checking code to determine that there are two member definitions for key. The symbol table must only group like symbols in the same scope together (e.g., members with members and types with types). Unlike symbols (methods, classes and member variables) are not grouped together because they are distinguished by context.

Multiple definitions of a method do not cause a semantic error in Java, since there is no multiple inheritance. If a method of the same name is inherited from two interfaces, for example, the method must either be the same or must define an overloaded version of the method. If there is a local method with the same name and arguments (e.g., same type signature) as a method defined in a parent class, the local method will be in a "lower" scope and will override the definition of the parent.

Design of a Java Symbol Table

Symbol table requirements

Taking into account the issues discussed above, a just symbol table must fulfill the following requirements:

Support for multiple definitions for a given identifier.
Fast lookup (O(n)) for a large global (e.g., package level) symbol base.
Relatively fast lookup (O(log₂ n)) for local symbols (e.g., local to a class, method or block)
Support for Java hierarchical scope
Searchable by symbol type (e.g., member, method, class).
Quickly determine whether a symbol definition is ambiguous.

Symbol lifetime

Languages like C can be compiled one function at a time. The global symbol table must retain the symbol information the functions and their arguments for the functions defined in the current file. But other local symbol information can be discarded after the function is compiled. When the compiler has processed all the functions in a given .c file (and its referenced include files), all symbols can be discarded.

C++ can be compiled in a similar fashion. Class definitions are defined in header files (e.g., .h files) for each file (e.g., .C or .cpp file) that references an object. When the file has been processed all symbols can be discarded.

Java is more complicated. The Java compiler must read the Java symbol definitions for the class tree that is needed to define all classes referenced by the current class being compiled (the transitive closure of all the class hierarchy). In the case of the object containing the main method, this includes all classes referenced in the program.

In theory Java symbols could be discarded once all of the classes that references them are compiled. In practice this is probably more trouble than it is worth on a modern computer system with lots of memory. So Java symbols live throughout the compile.

Building Symbol Table Scope

Hierarchical scope in the symbol table only needs to be available during the semantic analysis phase. After this phase, all symbols (identifier nodes) will point to the correct symbol. However, once scope is built, it is left in place.

Each local scope (e.g., block, method or class) has a local symbol table which points to the symbol table in the enclosing scope. At the root of the hierarchy is the global symbols table containing all global classes and imported symbols. During semantic analysis symbol search starts with the local symbol table and searches upward in the hierarchy, searching each symbol table until the global symbol table is searched. If the global symbol table is searched and the symbol is not present, the symbol does not exist.

Java scope is not a simple hierarchy composed of unique symbols, as is the case with C. There may be multiple definitions for a symbol (e.g., a class member, a method and a class name). The symbols at a given scope level may come from more than one source. For example, in the Java code below the class gin and the interface tonic define symbols at the same level of hierarchy.

  interface tonic {
    int water = 1;
    int quinine = 2;
    int sugar = 3;
    int TheSame = 4;
  }
  
  class gin {
    public int water, alcohol, juniper;
    public float TheSame;
  }
  
  class g_and_t extends gin implements tonic {
    class contextName {
      public int x, y, z;
    } // contextName
  
    public int contextName( int x ) { return x; }
    public contextName contextName;
  }

Scope and Local Variables and Arguments

Local variables in Java are variables in methods. These variables are allocated in a stack frame and have a "life time" that exists as long as the method is active. A method may also have local scope created by blocks or statements. For example:

      class bogus {
        public void foobar() {
          int a, b, c;

          { // this is a scope block
            int x, y, z;
          }
        }

Unlike C and C++, Java does not allow a local variable to be redeclared:

If a declaration of an identifier as a local variable appears within the scope of a parameter or local variable of the same name, a compile-time error occurs. Thus the following examples does not compile:

JLS 14.3.2
      class Test {
        public static void main( String[] args ) {
          int i;

          for (int i = 0; i < 10; i++)  // Error: local variable redefinition
          redeclared
            System.out.println(i);
        }
      }

A local variable is allowed to redefine a class member. This makes variable redefinition a semantic check in the semantic analysis phase.

Forward reference of symbols

A forward reference is a reference to a symbol that is defined texturally later in the code.

When a class field is initialized, the initializer must have been previously declared and initialized. The following example (from JLS 6.3) results in a compile time error:

  class Test {
    int i = j;  // compile-time error: incorrect forward reference
    int j = 1;
  }

Nor is forward reference allowed for local variables. For example:

  class geomancy {
    public float circleArea( float r ) {
      float area;
  
      area = pie * r * r;     // undefined variable 'pie'
      float pie = (float)Math.PI;
  
      return area;
    }
  }

However, forward reference is allowed from a local scope (e.g., a method) to a class member defined in the enclosing class. For example, in the Java below the method getHexChar makes a forward reference to the class member hexTab:


class HexStuff {

  public char getHexChar( byte digit ) {

    digit = (byte)(digit & 0xf);
    char ch = hexTab[digit];  // legal forward reference to class member

    return ch;
  } // getHexchar

  private static char hexTab[] = new char[] { '0', '1', '2', '3',
		                              '4', '5', '6', '7',
                                              '8', '9', 'a', 'b', 
                                              'c', 'd', 'e', 'f' };

} // HexStuff

Packages

The root compilation unit in Java is the package, either an explicitly named package or an unnamed package (e.g., the file containing the main method). All packages import the default packages which include java.lang.* and any other packages that may be required by the local system. The user may also explicitly import other packages.

When package A imports package B, package B provides:

Class and interface definitions that have the public modifier.
Sub-packages (e.g., packages that are imported into package B).

If package B imports package X which contains the public class foo, the class foo is referred to via the qualified name X.foo.

Packages add yet another level of complexity to the symbol table. A package exists as an object that defines a set of classes, interfaces and sub-packages. Once a package has been read by the compiler, it does not need to be read again when subsequent import statements are encountered, since its definition is already known to the compiler.

The classes, interfaces and packages defined by a package are "imported" into the global scope of the current package. In the Java source, the type names defined in the imported package are referenced via simple names (JLS 6.5.4) and type names defined in the sub-packages of an imported package are referenced via qualified names. However, in the symbol table all type names have an associated fully qualified name.

symbol Table Implementation Overview

Support for multiple definitions for a given identifier.

All symbols that share the same identifier at a particular scope level are contained in a container. As noted above, an identifier may be a class member, method and local class definition. There may also be multiple instances for a given kind of definition. For example, in the Java above there two definitions for the class member TheSame. The container is searchable by identifier type (member, method or class) and it can quickly be determined whether there is more than on definition of a given type (leading to an ambiguous reference). If the object is named, the symbol will have a field that points to the symbol for its parent (e.g, a method or class). For a block this pointer will be null. Note that parent is not necessarily the parent scope. The symbols defined in the class gin and the interface tonic are in the same scope, but they may have different parents.
Fast global lookup

The global symbol table is implemented by a hash table with a large capacity (the hash table can support a large number of symbols without developing long hash chains).
Package information

Once a package is imported into the global scope, the package is not referenced again. The imported type names (classes and interfaces) are referenced as if they were defined in the current compilation unit (e.g., via simple type names). The sub-packages become objects in the global scope as well. Package type names and additional sub-packages are referenced via qualified names.

Package definitions are kept in a separate package table. Packages are imported into the global scope of the compilation unit from this table. Package information is live for as long as the main compilation unit is being compiled (e.g., through out the compile process).
Local lookup

In general the number of symbols in a local Java scope is small. Local symbol lookup must be fast, but not as fast as the global lookup, since there will usually be fewer symbols.

I have considered three data structures for implementing the local symbol tables:
- skip lists (see also Thomas Niemann's excellent web page on skip lists).
- Red-Black Trees (a form of balanced binary tree)
- Simple binary tree
For small symbol table sizes the search time does not differ much for these three data structures. The binary tree has the example of being the smallest and simplest algorithm, so it has been chosen for local symbol tables.
Support for Java hierarchical scope

Each symbol table contains a pointer to the symbol table in the next scope up.
Searchable by symbol type

The semantic analysis phase knows the context for the symbol it is searching for (e.g., whether the symbol should be a member, method or class). The symbol table hierarchy is searched by identifier and type.
Quickly determine whether a symbol definition is ambiguous

Multiple symbol definitions for a given type of symbol (e.g., two member definitions) are chained together. If the next pointer is not NULL, there are multiple definitions. The error reporting code can use these definitions to report to the user where the clashing symbols were defined.

Symbol Table Construction

All class member references are processed and entered into the symbol table before methods are processed. This allows references to class members within a method to be properly resolved.

Declarations in a method are processed sequentially. If a name referenced in a method has not been "seen", an error will be reported (e.g., Undefined name).

Recursive Compilation and the Symbol Table

When a compilation unit (a package) is compiled, type and package information for all of the packages and classes that it references must be available. The Java Language Specification does not define exactly how this happens. The JLS states that compiled Java code may be stored in a database or in a directory hierarchy that mirrors the qualified names for imported packages and classes. Classes and packages must be accessable. The Java Virtual Machine Specification defines the information in a Java .class file, but it is silent on the issue of compile ordering. Although there is no specification for how Java should be compiled, there is "common practice". At least in the case of this design, "common practice" is based on Sun's javac compiler and Microsoft's Visual J++ compiler jvc.

When a compilation unit is compiled, all information about external classes referenced in the compilation unit is contained in .class files which are produced by compiling the associated Java code (usually stored in .java files). Class files may be packaged in .jar files, which are compressed archived .class file hierarchies in zip file format. The .class or .jar files are located in reference to either the local directory or the CLASSPATH environment variable. For this scheme to work, files names most correspond to the associated type name (e.g., class FooBar is implemented by FooBar.java).

If, when searching for a type definition, the Java compiler finds only a .java file defining the type or the .java file has a newer time stamp (usually file date and time) than the associated Java .class file, the Java compiler will recompile the type definition.

While compling the top level compilation unit, the Java compiler keeps track of package objects (where a package contains lists of types and sub-packages) imported by the compilation unit. Package type definitions that are not public are not kept by the compiler, since they cannot be seen outside the package.

Ian Kaplan, May 2, 2000
Revised most recently: May 31, 2000

back to Java Compiler Architecture page