ANTLR Tree ConstructionANTLR helps you build intermediate form trees, or abstract syntax trees (ASTs), by providing grammar annotations that indicate what tokens are to be treated as subtree roots, which are to be leaves, and which are to be ignored with respect to tree construction. As with PCCTS 1.33, you may manipulate trees using tree grammar actions. It is often the case that programmers either have existing tree definitions or need a special physical structure, thus, preventing ANTLR from specifically defining the implementation of AST nodes. ANTLR specifies only an interface describing minimum behavior. Your tree implementation must implement this interface so ANTLR knows how to work with your trees. Further, you must tell the parser the name of your tree nodes or provide a tree "factory" so that ANTLR knows how to create nodes with the correct type (rather than hardcoding in a new AST() expression everywhere). ANTLR can construct and walk any tree that satisfies the AST interface. A number of common tree definitions are provided. NotationIn this and other documents, tree structures are represented by a LISP-like notation, for example: #(A B C) is a tree with A at the root, and children B and C. This notation can be nested to describe trees of arbitrary structure, for example: #(A B #(C D E)) is a tree with A at the root, B as a first child, and an entire subtree as the second child. The subtree, in turn, has C at the root and D,E as children. Controlling AST constructionAST construction in an ANTLR Parser, or AST transformation in a Tree-Parser, is turned on and off by the buildAST option. From an AST construction and walking point of view, ANTLR considers all tree nodes to look the same (i.e., they appear to be homogeneous). Through a tree factory or by specification, however, you can instruct ANTLR to create nodes of different types. See the section below on heterogeneous trees. Grammar annotations for building ASTsLeaf nodesANTLR assumes that any nonsuffixed token reference or token-range is a leaf node in the resulting tree for the enclosing rule. If no suffixes at all are specified in a grammar, then a Parser will construct a linked-list of the tokens (a degenerate AST), and a Tree-Parser will copy the input AST. Root nodesAny token suffixed with the "^" operator is considered a root token. A tree node is constructed for that token and is made the root of whatever portion of the tree has been built a : A B^ C^ ; results in tree #(C #(B A)). First A is matched and made a lonely child, followed by B which is made the parent of the current tree, A. Finally, C is matched and made the parent of the current tree, making it the parent of the B node. Note that the same rule without any operators results in the flat tree A B C. Turning off standard tree constructionSuffix a token reference with "!" to prevent incorporation of the node for that token into the resulting tree (the AST node for the token is still constructed and may be referenced in actions, it is just not added to the result tree automatically). Suffix a rule reference "!" to indicate that the tree constructed by the invoked rule should not be linked into the tree constructed for the current rule. Suffix a rule definition with "!" to indicate that tree construction for the rule is to be turned off. Rules and tokens referenced within that rule still create ASTs, but they are not linked into a result tree. The following rule does no automatic tree construction. Actions must be used to set the return AST value, for example: begin! : INT PLUS i:INT { #begin = #(PLUS INT i); } ; For finer granularity, prefix alternatives with "!" to shut off tree construction for that alternative only. This granularity is useful, for example, if you have a large number of alternatives and you only want one to have manual tree construction: stat: ID EQUALS^ expr // auto construction ... some alternatives ... |! RETURN expr {#stat = #([IMAGINARY_TOKEN_TYPE] expr);} ... more alternatives ... ; Tree node constructionWith automatic tree construction off (but with
To construct a tree structure from a set of nodes, you can set the first-child and next-sibling references yourself or call the factory make method or use #(...) notation described below. AST Action TranslationIn parsers and tree parsers with buildAST set to true, ANTLR will translate portions of user actions in order to make it easier to build ASTs within actions. In particular, the following constructs starting with '#' will be translated:
The target code generator performs this translation with the help of a special lexer that parses the actions and asks the code-generator to create appropriate substitutions for each translated item. Invoking parsers that build treesAssuming that you have defined a lexer L and a parser P in your grammar, you can invoke them sequentially on the system input stream as follows. L lexer = new L(System.in); P parser = new P(lexer); parser.setASTNodeType("MyAST"); parser.startRule(); If you have set buildAST=true in your parser grammar, then it will build an AST, which can be accessed via parser.getAST(). If you have defined a tree parser called T, you can invoke it with: T walker = new T(); walker.startRule(parser.getAST()); // walk tree If, in addition, you have set buildAST=true in your tree-parser to turn on transform mode, then you can access the resulting AST of the tree-walker: AST results = walker.getAST(); DumpASTVisitor visitor = new DumpASTVisitor(); visitor.visit(results); Where DumpASTVisitor is a predefined ASTVisitor implementation that simply prints the tree to the standard output. You can also use get a LISP-like print out of a tree via String s = parser.getAST().toStringList(); AST FactoriesANTLR uses a factory pattern to create and connect AST nodes. This is done to primarily to separate out the tree construction facility from the parser, but also gives you a hook in between the parser and the tree node construction. Subclass ASTFactory to alter the create methods. If you are only interested in specifying the AST node type at runtime, use the setASTNodeClass(String className) method on the parser or factory. By default, trees are constructed of nodes of type CommonAST. The ASTFactory has some generically useful methods: /** Copy a single node. clone() is not used because we want to return an AST not a plain object...type safety issue. Further, we want to have all AST node creation go through the factory so creation can be tracked. Returns null if t is null. */ public AST dup(AST t); /** Duplicate tree including siblings * of root. */ public AST dupList(AST t); /**Duplicate a tree, assuming this is a * root node of a tree--duplicate that node * and what's below; ignore siblings of root * node. */ public AST dupTree(AST t); Heterogeneous ASTsEach node in an AST must encode information about the kind of node it is; for example, is it an ADD operator or a leaf node such as an INT? There are two ways to encode this: with a token type or with a Java (or C++ etc...) class type. In other words, do you have a single class type with numerous token types or no token types and numerous classes? For lack of better terms, I (Terence) have been calling ASTs with a single class type homogeneous trees and ASTs with many class types heterogeneous trees. The only reason to have a different class type for the various kinds of nodes is for the case where you want to execute a bunch of hand-coded tree walks or your nodes store radically different kinds of data. The example I use below demonstrates an expression tree where each node overrides value() so that root.value() is the result of evaluating the input expression. From the perspective of building trees and walking them with a generated tree parser, it is best to consider every node as an identical AST node. Hence, the schism that exists between the hetero- and homogeneous AST camps. ANTLR supports both kinds of tree nodes--at the same time! If you do nothing but turn on the "buildAST=true" option, you get a homogeneous tree. Later, if you want to use physically separate class types for some of the nodes, just specify that in the grammar that builds the tree. Then you can have the best of both worlds--the trees are built automatically, but you can apply different methods to and store different data in the various nodes. Note that the structure of the tree is unaffected; just the type of the nodes changes. ANTLR applies a "scoping" sort of algorithm for determining the class type of a particular AST node that it needs to create. The default type is CommonAST unless, prior to parser invocation, you override that with a call to: myParser.setASTNodeClass("com.acme.MyAST"); In the grammar, you can override the default class type by setting the type for nodes created from a particular input token. Use the element option <AST=typename> in the tokens section: tokens { PLUS<AST=PLUSNode>; ... } You may further override the class type by annotating a particular token reference in your parser grammar: anInt : INT<AST=INTNode> ; This reference override is super useful for tokens such as ID that you might want converted to a TYPENAME node in one context and a VARREF in another context. ANTLR uses the AST factory to create nodes for which it does not know a specific type. In other words, ANTLR generates code similar to the following: AST tmp2_AST = (AST)astFactory.create(LT(1)); On the other hand, if you specify a class to use, either in the tokens section or on a particular reference, ANTLR generates the more appropriate: INTNode tmp3_AST = new INTNode(LT(1)); Besides being faster and more obvious, this code alleviates another problem with homogeneous ASTs: you have to cast like mad from your type to AST and back (though ANTLR does let you set the AST label with grammar option ASTLabelType). An Expression Tree ExampleThis example includes a parser that constructs expression ASTs, the usual lexer, and some AST node class definitions. Let's start by describing the AST structure and node types. Expressions have plus and multiply operators and integers. The operators will be subtree roots (nonleaf nodes) and integers will be leaf nodes. For example, input 3+4*5+21 yields a tree with structure: ( + ( + 3 ( * 4 5 ) ) 21 ) or: + | +--21 | 3--* | 4--5 All AST nodes are subclasses of CalcAST, which are BaseAST's that also answer method value(). Method value() evaluates the tree starting at that node. Naturally, for integer nodes, value() will simply return the value stored within that node. Here is CalcAST: public abstract class CalcAST extends antlr.BaseAST { public abstract int value(); } The AST operator nodes must combine the results of computing the value of their two subtrees. They must perform a depth-first walk of the tree below them. For fun and to make the operations more obvious, the operator nodes define left() and right() instead, making them appear even more different than the normal child-sibling tree representation. Consequently, these expression trees can be treated as both homogeneous child-sibling trees and heterogeneous expression trees. public abstract class BinaryOperatorAST extends CalcAST { /** Make me look like a heterogeneous tree */ public CalcAST left() { return (CalcAST)getFirstChild(); } public CalcAST right() { CalcAST t = left(); if ( t==null ) return null; return (CalcAST)t.getNextSibling(); } } The simplest node in the tree looks like: import antlr.BaseAST; import antlr.Token; import antlr.collections.AST; import java.io.*; /** A simple node to represent an INT */ public class INTNode extends CalcAST { int v=0; public INTNode(Token tok) { v = Integer.parseInt(tok.getText()); } /** Compute value of subtree; this is * heterogeneous part :) */ public int value() { return v; } public String toString() { return " "+v; } // satisfy abstract methods from BaseAST public void initialize(int t, String txt) { } public void initialize(AST t) { } public void initialize(Token tok) { } } The operators derive from BinaryOperatorAST and define value() in terms of left() and right(). For example, here is PLUSNode: import antlr.BaseAST; import antlr.Token; import antlr.collections.AST; import java.io.*; /** A simple node to represent PLUS operation */ public class PLUSNode extends BinaryOperatorAST { public PLUSNode(Token tok) { } /** Compute value of subtree; * this is heterogeneous part :) */ public int value() { return left().value() + right().value(); } public String toString() { return " +"; } // satisfy abstract methods from BaseAST public void initialize(int t, String txt) { } public void initialize(AST t) { } public void initialize(Token tok) { } } The parser is pretty straightforward except that you have to add the options to tell ANTLR what node types you want to create for which token matched on the input stream. The tokens section lists the operators with element option AST appended to their definitions. This tells ANTLR to build PLUSNode objects for any PLUS tokens seen on the input stream, for example. For demonstration purposes, INT is not included in the tokens section--the specific token references is suffixed with the element option to specify that nodes created from that INT should be of type INTNode (of course, the effect is the same as there is only that one reference to INT). class CalcParser extends Parser; options { buildAST = true; // uses CommonAST by default } // define a bunch of specific AST nodes to build. // can override at actual reference of tokens in // grammar below. tokens { PLUS<AST=PLUSNode>; STAR<AST=MULTNode>; } expr: mexpr (PLUS^ mexpr)* SEMI! ; mexpr : atom (STAR^ atom)* ; // Demonstrate token reference option atom: INT<AST=INTNode> ; Invoking the parser is done as usual. Computing the value of the resulting AST is accomplished by simply calling method value() on the root. import java.io.*; import antlr.CommonAST; import antlr.collections.AST; class Main { public static void main(String[] args) { try { CalcLexer lexer = new CalcLexer( new DataInputStream(System.in) ); CalcParser parser = new CalcParser(lexer); // Parse the input expression parser.expr(); CalcAST t = (CalcAST)parser.getAST(); System.out.println(t.toStringTree()); // Compute value and return int r = t.value(); System.out.println("value is "+r); } catch(Exception e) { System.err.println("exception: "+e); e.printStackTrace(); } } } For completeness, here is the lexer: class CalcLexer extends Lexer; WS : (' ' | '\t' | '\n' | '\r') { $setType(Token.SKIP); } ; LPAREN: '(' ; RPAREN: ')' ; STAR: '*' ; PLUS: '+' ; SEMI: ';' ; protected DIGIT : '0'..'9' ; INT : (DIGIT)+ ; Describing Heterogeneous Trees With GrammarsSo what's the difference between this approach and default homogeneous tree construction? The big difference is that you need a tree grammar to describe the expression tree and compute resulting values. But, that's a good thing as it's "executable documentation" and negates the need to handcode the tree parser (the value() methods). If you used homogeneous trees, here is all you would need beyond the parser/lexer to evaluate the expressions: [This code comes from the examples/java/calc directory.] class CalcTreeWalker extends TreeParser; expr returns [float r] { float a,b; r=0; } : #(PLUS a=expr b=expr) {r = a+b;} | #(STAR a=expr b=expr) {r = a*b;} | i:INT {r = (float) Integer.parseInt(i.getText());} ; Because Terence wants you to use tree grammars even when constructing heterogeneous ASTs (to avoid handcoding methods that implement a depth-first-search), implement the following methods in your various heterogeneous AST node class definitions: /** Get the token text for this node */ public String getText(); /** Get the token type for this node */ public int getType(); That is how you can use heterogeneous trees with a tree grammar. Note that your token types must match the PLUS and STAR token types imported from your parser. I.e., make sure PLUSNode.getType() returns CalcParserTokenTypes.PLUS. The token types are generated by ANTLR in interface files that look like: public interface CalcParserTokenTypes { ... int PLUS = 4; int STAR = 5; ... } AST (XML) Serialization[Oliver Zeigermann olli@zeigermann.de provided the initial implementation of this serialization. His XTAL XML translation code is worth checking out; particularly for reading XML-serialized ASTs back in.] For a variety of reasons, you may want to store an AST or pass it to another program or computer. Class antlr.BaseAST is Serializable using the Java code generator, which means you can write ASTs to the disk using the standard Java stuff. You can also write the ASTs out in XML form using the following methods from BaseAST:
All methods throw IOException. You can override xmlSerializeNode and so on to change the way nodes are written out. By default the serialization uses the class type name as the tag name and has attributes text and type to store the text and token type of the node. The output from running the simple heterogeneous tree example, examples/java/heteroAST, yields: ( + ( + 3 ( * 4 5 ) ) 21 ) <PLUS><PLUS><int>3</int><MULT> <int>4</int><int>5</int> </MULT></PLUS><int>21</int></PLUS> value is 44 The LISP-form of the tree shows the structure and contents. The various heterogeneous nodes override the open and close tags and change the way leaf nodes are serialized to use <int>value</int> instead of tag attributes of a single node. Here is the code that generates the XML: Writer w = new OutputStreamWriter(System.out); t.xmlSerialize(w); w.write("\n"); w.flush(); AST enumerationsThe AST findAll and findAllPartial methods return enumerations of tree nodes that you can walk. Interface antlr.collections.ASTEnumeration and class antlr.Collections.impl.ASTEnumerator implement this functionality. Here is an example: // Print out all instances of // a-subtree-of-interest // found within tree 't'. ASTEnumeration enum; enum = t.findAll(a-subtree-of-interest); while ( enum.hasMoreNodes() ) { System.out.println( enum.nextNode().toStringList() ); } A few examplessum :term ( PLUS^ term)* ; The "^" suffix on the PLUS tells ANTLR to create an additional node and place it as the root of whatever subtree has been constructed up until that point for rule sum. The subtrees returned by the term references are collected as children of the addition nodes. If the subrule is not matched, the associated nodes would not be added to the tree. The rule returns either the tree matched for the first term reference or a PLUS-rooted tree. The grammar annotations should be viewed as operators, not static specifications. In the above example, each iteration of the (...)* will create a new PLUS root, with the previous tree on the left, and the tree from the new term on the right, thus preserving the usual associatively for "+". Look at the following rule that turns off default tree construction. decl!: modifiers type ID SEMI; { #decl = #([DECL], ID, ([TYPE] type), ([MOD] modifiers) ); } ; In this example, a declaration is matched. The resulting AST has an "imaginary" DECL node at the root, with three children. The first child is the ID of the declaration. The second child is a subtree with an imaginary TYPE node at the root and the AST from the type rule as its child. The third child is a subtree with an imaginary MOD at the root and the results of the modifiers rule as its child. Labeled subrules[THIS WILL NOT BE IMPLEMENTED AS LABELED SUBRULES...We'll do something else eventually.] In 2.00 ANTLR, each rule has exactly one tree associated with it. Subrules simply add elements to the tree for the enclosing rule, which is normally what you want. For example, expression trees are easily built via: expr: ID ( PLUS^ ID )* ; However, many times you want the elements of a subrule to produce a tree that is independent of the rule's tree. Recall that exponents must be computed before coefficients are multiplied in for exponent terms. The following grammar matches the correct syntax. // match exponent terms such as "3*x^4" eterm : expr MULT ID EXPONENT expr ; However, to produce the correct AST, you would normally split the ID EXPONENT expr portion into another rule like this: eterm: expr MULT^ exp ; exp: ID EXPONENT^ expr ; In this manner, each operator would be the root of the appropriate subrule. For input 3*x^4, the tree would look like: #(MULT 3 #(EXPONENT ID 4)) However, if you attempted to keep this grammar in the same rule: eterm : expr MULT^ (ID EXPONENT^ expr) ; both "^" root operators would modify the same tree yielding #(EXPONENT #(MULT 3 ID) 4) This tree has the operators as roots, but they are associated with the wrong operands. Using a labeled subrule allows the original rule to generate the correct tree. eterm : expr MULT^ e:(ID EXPONENT^ expr) ; In this case, for the same input 3*x^4, the labeled subrule would build up its own subtree and make it the operand of the MULT tree of the eterm rule. The presence of the label alters the AST code generation for the elements within the subrule, making it operate more like a normal rule. Annotations of "^" make the node created for that token reference the root of the tree for the e subrule. Labeled subrules have a result AST that can be accessed just like the result AST for a rule. For example, we could rewrite the above decl example using labeled subrules (note the use of ! at the start of the subrules to suppress automatic construction for the subrule): decl!: m:(! modifiers { #m = #([MOD] modifiers); } ) t:(! type { #t = #([TYPE] type); } ) ID SEMI; { #decl = #( [DECL] ID t m ); } ; What about subrules that are closure loops? The same rules apply to a closure subrule--there is a single tree for that loop that is built up according to the AST operators annotating the elements of that loop. For example, consider the following rule. term: T^ i:(OP^ expr)+ ; For input T OP A OP B OP C, the following tree structure would be created: #(T #(OP #(OP #(OP A) B) C) ) which can be drawn graphically as T | OP | OP--C | OP--B | A The first important thing to note is that each iteration of the loop in the subrule operates on the same tree. The resulting tree, after all iterations of the loop, is associated with the subrule label. The result tree for the above labeled subrule is: #(OP #(OP #(OP A) B) C) The second thing to note is that, because T is matched first and there is a root operator after it in the rule, T would be at the bottom of the tree if it were not for the label on the subrule. Loops will generally be used to build up lists of subtree. For example, if you want a list of polynomial assignments to produce a sibling list of ASSIGN subtrees, then the following rule you would normally split into two rules. interp : ( ID ASSIGN poly ";" )+ ; Normally, the following would be required interp : ( assign )+ ; assign : ID ASSIGN^ poly ";"! ; Labeling a subrule allows you to write the above example more easily as: interp : ( r:(ID ASSIGN^ poly ";") )+ ; Each recognition of a subrule results in a tree and if the subrule is nested in a loop, all trees are returned as a list of trees (i.e., the roots of the subtrees are siblings). If the labeled subrule is suffixed with a "!", then the tree(s) created by the subrule are not linked into the tree for the enclosing rule or subrule. Labeled subrules within labeled subrules result in trees that are linked into the surrounding subrule's tree. For example, the following rule results in a tree of the form X #( A #(B C) D) Y. a : X r:( A^ s:(B^ C) D) Y ; Labeled subrules within nonlabeled subrules result in trees that are linked into the surrounding rule's tree. For example, the following rule results in a tree of the form #(A X #(B C) D Y). a : X ( A^ s:(B^ C) D) Y ; Reference nodesNot implemented. A node that does nothing but refer to another node in the tree. Nice for embedding the same tree in multiple lists. Required AST functionality and formThe data structure representing your trees can have any form or type name as long as they implement the AST interface: package antlr.collections; /** Minimal AST node interface used by ANTLR * AST generation and tree-walker. */ public interface AST { /** Get the token type for this node */ public int getType(); /** Set the token type for this node */ public void setType(int ttype); /** Get the token text for this node */ public String getText(); /** Set the token text for this node */ public void setText(String text); /** Get the first child of this node; * null if no children */ public AST getFirstChild(); /** Set the first child of a node */ public void setFirstChild(AST c); /** Get the next sibling in line after this * one */ public AST getNextSibling(); /** Set the next sibling after this one */ public void setNextSibling(AST n); /** Add a (rightmost) child to this node */ public void addChild(AST node); /** Are two nodes exactly equal? */ public boolean equals(AST t); /** Are two lists of nodes/subtrees exactly * equal in structure and content? */ public boolean equalsList(AST t); /** Are two lists of nodes/subtrees * partially equal? In other words, 'this' * can be bigger than 't' */ public boolean equalsListPartial(AST t); /** Are two nodes/subtrees exactly equal? */ public boolean equalsTree(AST t); /** Are two nodes/subtrees exactly partially * equal? In other words, 'this' can be * bigger than 't'. */ public boolean equalsTreePartial(AST t); /** Return an enumeration of all exact tree * matches for tree within 'this'. */ public ASTEnumeration findAll(AST tree); /** Return an enumeration of all partial * tree matches for tree within 'this'. */ public ASTEnumeration findAllPartial( AST subtree); /** Init a node with token type and text */ public void initialize(int t, String txt); /** Init a node using content from 't' */ public void initialize(AST t); /** Init a node using content from 't' */ public void initialize(Token t); /** Convert node to printable form */ public String toString(); /** Treat 'this' as list (i.e., * consider 'this' * siblings) and convert to printable * form */ public String toStringList(); /** Treat 'this' as tree root * (i.e., don't consider * 'this' siblings) and convert * to printable form */ public String toStringTree(); } This scheme does not preclude the use of heterogeneous trees versus homogeneous trees. However, you will need to write extra code to create heterogeneous trees (via a subclass of ASTFactory) or by specifying the node types at the token reference sites or in the tokens section, whereas the homogeneous trees are free. Version: $Id: //depot/code/org.antlr/release/antlr-2.7.0/doc/trees.html#3 $ |