Although covering all aspect of the Java CUP system, this manual is relatively brief, assumes you have at least a little bit of knowledge of LR parsing, and preferably have a bit of experience with a program such as YACC. A number of compiler construction textbooks (such as [2,3]) cover this material, and discuss the YACC system (which is quite similar to this one) as a specific example.
Using Java CUP involves creating a simple specifications based on the grammar for which a parser is needed, along with construction of a scanner capable of breaking characters up into meaningful tokens (such as keywords, numbers, and special symbols).
As a simple example, consider a system for evaluating simple arithmetic expressions over integers. This system would read expressions from standard input (each terminated with a semicolon), evaluate them, and print the result on standard output. A grammar for the input to such a system might look like:
expr_list ::= expr_list expr_part | expr_part expr_part ::= expr ';' expr ::= expr '+' term | expr '-' term | term term ::= term '*' factor | term '/' factor | term '%' factor | factor factor ::= number | '-' expr | '(' expr ')'To specify a parser based on this grammar, our first step is to identify and name the set of terminal symbols that will appear on input, and the set of non terminal symbols. In this case, the non terminals are:
expr_list, expr_part, expr, term, and factor.For terminal names we might choose:
SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, NUMBER, LPAREN, and RPARENBased on these namings we can construct a small Java CUP specification as follows:
// Java CUP specification for a simple expression evaluator (no actions) import java_cup.runtime.*; /* Preliminaries to set up and use the scanner. */ init with {: scanner.init(); :}; scan with {: return scanner.next_token(); :}; /* Terminals (tokens returned by the scanner). */ terminal token SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, LPAREN, RPAREN; terminal int_token NUMBER; /* Non terminals */ non terminal symbol expr_list, expr_part; non terminal int_token expr, term, factor; /* The grammar */ expr_list ::= expr_list expr_part | expr_part; expr_part ::= expr SEMI; expr ::= expr PLUS term | expr MINUS term | term; term ::= term TIMES factor | term DIVIDE factor | term MOD factor | factor; factor ::= NUMBER | MINUS factor | LPAREN expr LPAREN;
To produce a parser from this specification we use the Java CUP generator. If this specification were stored in a file parser.cup, then (on a Unix system at least) we might invoke Java CUP using a command like:
java java_cup.Main < parser.cupIn this case, the system will produce two Java source files containing parts of the generated parser: sym.java and parser.java. As you might expect, these two files contain declarations for the classes sym and parser. The sym class contains a series of constant declarations, one for each terminal symbol. This is typically used by the scanner to refer to symbols (e.g. with code such as "return new token(sym.SEMI);" ). The parser class implements the parser itself.
The specification above, while constructing a full parser, does not perform any semantic actions -- it will only indicate success or failure of a parse. To calculate and print values of each expression, we must embed Java code within the parser to carry out actions at various points. In Java CUP, actions are contained in code strings which are surrounded by delimiters of the form {: and :} (we can see examples of this in the init with and scan with clauses above). In general, the system records all characters within the delimiters, but does not try to check that it contains valid Java code.
A more complete Java CUP specification for our example system (with actions
embedded at various points in the grammar) is shown below:
// Java CUP specification for a simple expression evaluator (w/ actions) import java_cup.runtime.*; /* Preliminaries to set up and use the scanner. */ init with {: scanner.init(); :}; scan with {: return scanner.next_token(); :}; /* Terminals (tokens returned by the scanner). */ terminal token SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, LPAREN, RPAREN; terminal int_token NUMBER; /* Non terminals */ non terminal symbol expr_list, expr_part; non terminal int_token expr, term, factor; /* The grammar */ expr_list ::= expr_list expr_part | expr_part; expr_part ::= expr:e {: System.out.println("= " + e.int_val); :} SEMI ; expr ::= expr:e1 PLUS term:e2 {: RESULT.int_val = e1.int_val + e2.int_val; :} | expr:e1 MINUS term:e2 {: RESULT.int_val = e1.int_val - e2.int_val; :} | term:e1 {: RESULT.int_val = e1.int_val; :} ; term ::= term:e1 TIMES factor:e2 {: RESULT.int_val = e1.int_val * e2.int_val; :} | term:e1 DIVIDE factor:e2 {: RESULT.int_val = e1.int_val / e2.int_val; :} | term:e1 MOD factor:e2 {: RESULT.int_val = e1.int_val % e2.int_val; :} | factor:e {: RESULT.int_val = e.int_val; :} ; factor ::= NUMBER:n {: RESULT.int_val = n.int_val; :} | MINUS factor:e {: RESULT.int_val = -e.int_val; :} | LPAREN expr:e RPAREN {: RESULT.int_val = e.int_val; :} ;
expr ::= expr:e1 PLUS term:e2 {: RESULT.int_val = e1.int_val + e2.int_val; :}the non terminal expr has been labeled with e1, while term has been labeled with e2. The left hand side symbol of each production is always implicitly labeled as RESULT.
Each symbol appearing in a production is represented at runtime by an object (on the parse stack). These labels allow code embedded in a production to refer to these objects. Since expr and term were both declared as int_token, they are both represented by an object of class int_token. These objects are created as the result of matching some other production. The code in that production fills in various fields of its result object, which are in turn used here to fill in a new result object, and so on. Overall this is a very common form of syntax directed translation related to attribute grammars and discussed at length in compiler construction textbooks such as [2,3].
In our specific example, the int_token class includes an int_val field which stores an int value. We use this field to compute the value of the expression from its component parts. In the production above, we compute the int_val field of the left hand side symbol (i.e. RESULT) as the sum of the values computed by the expr and term parts making up this expression. That value in turn may be combined with other to compute a final result.
The final step in creating a working parser is to create a scanner (also known as a lexical analyzer or simply a lexer). This routine is responsible for reading individual characters, removing things things like white space and comments, recognizing which terminal symbols from the grammar each group of characters represents, then returning token objects representing these symbols to the parser. Example code for a workable (if not elegant or efficient) scanner for our example system can be found in Appendix B.
Like the very simple one given in Appendix B, all scanners need to return objects which are instances of java_cup.runtime.token (or one of its subclasses). The runtime system predefines three such classes: token which contains no specific information beyond the token type (and some internal information used by the parser), int_token which also records a single int value, and str_token which records a single string value.
The code contained in the init with clause of the specification will be executed before any tokens are requested. Each token will be requested using whatever code is found in the scan with clause. Beyond this, the exact form the scanner takes is up to you.
In the next section a more detailed and formal
explanation of all parts of a Java CUP specification will be given.
Section 3 describes options for running the
Java CUP system. Section 4 discusses the details
of how to customize a Java CUP parser, while Section 5
considers error recovery. Finally, Section 6
provides a conclusion.
2. Specification Syntax
Now that we have seen a small example, we present a complete description of all
parts of a Java CUP specification. A specification has four sections with
a total of eight specific parts (however, most of these are optional).
A specification consists of:
package name;where name name is a Java package identifier, possibly in several parts separated by ".". In general, Java CUP employs Java lexical conventions. So for example, both styles of Java comments are supported, and identifiers are constructed beginning with a letter, dollar sign ($), or underscore (_), which can then be followed by zero or more letters, numbers, dollar signs, and underscores.
After an optional package declaration, there can be zero or more import declarations. As in a Java program these have the form:
import package_name.class_name;or
import package_name.*;The package declaration indicates what package the sym and parser classes that are generated by the system will be in. Any import declarations that appear in the specification will also appear in the source file for the parser class allowing various names from that package to be used directly in user supplied action code.
action code {: ... :};where {: ... :} is a code string whose contents will be placed directly within the action class class declaration.
After the action code declaration is an optional parser code declaration. This declaration allows methods and variable to be placed directly within the generated parser class. Although this is less common, it can be helpful when customizing the parser -- it is possible for example, to include scanning methods inside the parser and/or override the default error reporting routines. This declaration is very similar to the action code declaration and takes the form:
parser code {: ... :};Again, code from the code string is placed directly into the generated parser class definition.
Next in the specification is the optional init declaration which has the form:
init with {: ... :};This declaration provides code that will be executed by the parser before it asks for the first token. Typically, this is used to initialize the scanner as well as various tables and other data structures that might be needed by semantic actions. In this case, the code given in the code string forms the body of a void method inside the parser class.
The final (optional) user code section of the specification indicates how the parser should ask for the next token from the scanner. This has the form:
scan with {: ... :};As with the init clause, the contents of the code string forms the body of a method in the generated parser. However, in this case the method returns an object of type java_cup.runtime.token. Consequently the code found in the scan with clause should return such a value.
terminal classname name1, name2, ...;and
non terminal classname name1, name2, ...;where classname can be a multiple part name separated with "."s. Since the parser uses these objects for internal bookkeeping, the classes used for non terminal symbols must be a subclass of java_cup.runtime.symbol. Similarly, the classes used for terminal symbols must be a subclass of java_cup.runtime.token (note that java_cup.runtime.token is itself a subclass of java_cup.runtime.symbol).
start with nonterminal;This indicates which non terminal is the start or goal non terminal for parsing. If a start non terminal is not explicitly declared, then the non terminal on the left hand side of the first production will be used.
The grammar itself follows the optional start declaration. Each production in the grammar has a left hand side non terminal followed by the symbol "::=", which is then followed by a series of zero or more actions, terminal, or non terminal symbols, and terminated with a semicolon (;). Each symbol on the right hand side can optionally be labeled with a name. Label names appear after the symbol name separated by a colon (:). Label names must be unique within the production, and can be used within action code to refer to the runtime object that represents the symbol. If there are several productions for the same non terminal they may be declared together. In this case the productions start with the non terminal and "::=". This is followed by multiple right hand sides each separated by a bar (|). The full set of productions is then terminated by a semicolon.
Actions appear in the right hand side as code strings (e.g., Java code inside
{: ... :} delimiters). These are executed by the parser
at the point when the portion of the production to the left of the
action has been recognized. (Note that the scanner will have returned the
token one past the point of the action since the parser needs this extra
lookahead token for recognition.)
3. Running Java CUP
As mentioned above, Java CUP is written in Java. To invoke it, one needs
to use the Java interpreter to invoke the static method
java_cup.Main(), passing an array of strings containing options.
Assuming a Unix machine, the simplest way to do this is typically to invoke it
directly from the command line with a command such as:
java java_cup.Main options < inputfileOnce running, Java CUP expects to find a specification file on standard input and produces two Java source files as output.
In addition to the specification file, Java CUP's behavior can also be changed by passing various options to it. Legal options include:
The parser class contains the actual generated parser. It is a subclass of java_cup.runtime.lr_parser which implements a general table driven framework for an LR parser. The generated parser class provides a series of tables for use by the general framework. Three tables are provided:
Beyond the parse tables, generated (or inherited) code provides a series of methods that can be used to customize the generated parser. Some of these methods are supplied by code found in part of the specification and can be customized directly in that fashion. The others are provided by the lr_parser base class and can be overridden with new versions (via the parser code declaration) to customize the system. Methods available for customization include:
In addition to the normal parser, the runtime system also provides a debugging version of the parser. This operates in exactly the same way as the normal parser, but prints debugging messages (by calling public void debug_message(String mess) whose default implementation prints a message to System.err).
Based on these routines, invocation of a Java CUP parser is typically done with code such as:
/* create a parsing object */ parser parse_obj = new parser(); /* open input files, etc. here */ try { if (do_debug_parse) parser_obj.debug_parse(); else parser_obj.parse(); } catch (Exception e) { /* do cleanup here -- possibly rethrow e */ } finally { /* do close out here */ }
The error symbol only comes into play if a syntax error is detected. If a syntax error is detected then the parser tries to replace some portion of the input token stream with error and then continue parsing. For example, we might have productions such as:
stmt ::= expr SEMI | while_stmt SEMI | if_stmt SEMI | ... | error SEMI ;This indicates that if none of the normal productions for stmt can be matched by the input, then a syntax error should be declared, and recovery should be made by skipping erroneous tokens (equivalent to matching and replacing them with error) up to a point at which the parse can be continued with a semicolon (and additional context that legally follows a statement). An error is considered to be recovered from if and only if a sufficient number of tokens past the error symbol can be successfully parsed. (The number of tokens required is determined by the error_sync_size() method of the parser and defaults to 3).
Specifically, the parser first looks for the closest state to the top
of the parse stack that has an outgoing transition under
error. This generally corresponds to working from
productions that represent more detailed constructs (such as a specific
kind of statement) up to productions that represent more general or
enclosing constructs (such as the general production for all
statements or a production representing a whole section of declarations)
until we get to a place where an error recovery production
has been provided for. Once the parser is placed into a configuration
that has an immediate error recovery (by popping the stack to the first
such state), the parser begins skipping tokens to find a point at
which the parse can be continued. After discarding each token, the
parser attempts to parse ahead in the input (without executing any
embedded semantic actions). If the parser can successfully parse past
the required number of tokens, then the input is backed up to the point
of recovery and the parse is resumed normally (executing all actions).
If the parse cannot be continued far enough, then another token is
discarded and the parser again tries to parse ahead. If the end of
input is reached without making a successful recovery (or there was no
suitable error recovery state found on the parse stack to begin with)
then error recovery fails.
6. Conclusion
This manual has briefly described the Java CUP LALR parser generation system.
Java CUP is designed to fill the same role as the well known YACC parser
generator system, but is written in and operates entirely with Java code
rather than C or C++. Additional details on the operation of the system can
be found in the parser generator and runtime source code. See the Java CUP
home page below for access to the API documentation for the system and its
runtime.
This document covers the system as it stands at the time of its fourth alpha release (v0.9d). Check the Java CUP home page: http://www.cc.gatech.edu/gvu/people/Faculty/hudson/java_cup/home.html for the latest release information, instructions for downloading the system, and additional news about the system. Bug reports and other comments for the developers can be sent to java-cup@cc.gatech.edu
Java CUP was originally written by Scott Hudson, in August of 1995.
java_cup_spec ::= package_spec import_list code_part init_code scan_code symbol_list start_spec production_list package_spec ::= PACKAGE multipart_id SEMI | empty import_list ::= import_list import_spec | empty import_spec ::= IMPORT import_id SEMI code_part ::= action_code_part parser_code_part action_code_part ::= ACTION CODE CODE_STRING SEMI | empty parser_code_part ::= PARSER CODE CODE_STRING SEMI | empty init_code ::= INIT WITH CODE_STRING SEMI | empty scan_code ::= SCAN WITH CODE_STRING SEMI | empty symbol_list ::= symbol_list symbol | symbol symbol ::= TERMINAL type_id term_name_list SEMI | NON TERMINAL type_id non_term_name_list SEMI term_name_list ::= term_name_list COMMA new_term_id | new_term_id non_term_name_list ::= non_term_name_list COMMA new_non_term_id | new_non_term_id start_spec ::= START WITH nt_id SEMI | empty production_list ::= production_list production | production production ::= nt_id COLON_COLON_EQUALS rhs_list SEMI rhs_list ::= rhs_list BAR rhs | rhs rhs ::= prod_part_list prod_part_list ::= prod_part_list prod_part | empty prod_part ::= symbol_id opt_label | CODE_STRING opt_label ::= COLON label_id | empty multipart_id ::= multipart_id DOT ID | ID import_id ::= multipart_id DOT STAR | multipart_id type_id ::= multipart_id new_term_id ::= ID new_non_term_id ::= ID nt_id ::= ID symbol_id ::= ID label_id ::= ID
// Simple Example Scanner Class import java_cup.runtime.*; public class scanner { /* single lookahead character */ protected static int next_char; /* advance input by one character */ protected static void advance() { next_char = System.in.read(); } /* initialize the scanner */ public static void init() { advance(); } /* recognize and return the next complete token */ public static token next_token() { for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new int_token(sym.NUMBER, i_val); case ';': advance(); return new token(sym.SEMI); case '+': advance(); return new token(sym.PLUS); case '-': advance(); return new token(sym.MINUS); case '*': advance(); return new token(sym.TIMES); case '/': advance(); return new token(sym.DIVIDE); case '%': advance(); return new token(sym.MOD); case '(': advance(); return new token(sym.LPAREN); case ')': advance(); return new token(sym.RPAREN); case -1: return new token(sym.EOF); default: /* in this simple scanner we just ignore everything else */ advance(); break; } } };