Reading 18: Parsers

Software in 6.031

Safe from bugs	Easy to understand	Ready for change
Correct today and correct in the unknown future.	Communicating clearly with future programmers, including future you.	Designed to accommodate change without rewriting.

Objectives

After today’s class, you should:

Be able to use a grammar in combination with a parser generator, to parse a character sequence into a parse tree
Be able to convert a parse tree into a useful data type

Parser generators

A parser generator is a good tool that you should make part of your toolbox. A parser generator takes a grammar as input and automatically generates a parser, which takes a sequence of characters and tries to match the sequence against the grammar.

The parser typically produces a parse tree, which shows how grammar productions are expanded into a sentence that matches the character sequence. The root of the parse tree is the root nonterminal of the grammar. Each node of the parse tree expands into one production of the grammar. We’ll see how a parse tree actually looks later in this reading.

The final step of parsing is to do something useful with this parse tree. We are going to translate it into a value of a recursive data type. Recursive abstract data types are often used to represent an expression in a language, like HTML, or Markdown, or Java, or algebraic expressions. A recursive abstract data type that represents a language expression is called an abstract syntax tree (AST).

In 6.031, we are going to use ParserLib, a parser generator for Java developed by the course staff. The parser generator is similar in spirit to more widely used parser generators like Antlr, but it has a simpler interface and is generally easier to use.

A ParserLib grammar

The code for the examples that follow is downloadable as ex18-parsers.

Here is what our HTML grammar looks like as a ParserLib grammar:

html ::= ( italic | normal ) * ;
italic ::= '<i>' html '</i>' ;
normal ::= text ; 
text ::= [^<>]+ ;  /* represents a string of one or more characters that are not < or > */

Let’s break it down.

Each ParserLib rule consists of a name, followed by a ::=, followed by its definition, terminated by a semicolon. The ParserLib grammar can also include Java-style comments, both single line and multiline.

By convention, we use lowercase for nonterminals: html, normal, italic, text. The ParserLib library is case-insensitive with respect to nonterminal names; it canonicalizes names to all-lowercase, so even if you don’t write all your names into lowercase, you will see them as lowercase when you print your grammar.

Terminals are either quoted strings, like '<i>', or regular expressions, like [^<>]+.

html ::= ( italic | normal ) * ;

This rule shows that ParserLib rules can have the alternation operator |, repetition operators like * (and also + and ?, even though they’re not shown in this rule), and parentheses for grouping, in the same way we’ve been using in the grammars reading.

Whitespace in a grammar file is not significant (outside of quoted strings or [...] character classes). Here the operators have been surrounded by spaces to make them more visible. Writing the rule as html::=(italic|normal)*; would have the same effect, just with less readability.

The html nonterminal also happens to be the root of this grammar (also called starting symbol). The root is the nonterminal that the whole input needs to match. It’s good practice to put the rule for the root nonterminal first in the grammar file, so that a human reader can take a top-down view, but it isn’t essential. When we create a parser from the grammar, we tell ParserLib which nonterminal the parse should use as the root.

italic ::= '<i>' html '</i>' ;
normal ::= text ;
text ::= [^<>]+ ;

Note that the text production uses the inverted character class [^<>], discussed in the grammars reading, to represent any character except < and >.

Whitespace

Consider the grammar shown below.

expr ::= sum ;
sum ::= primary ('+' primary)* ;
primary ::= number | '(' sum ')' ;
number ::= [0-9]+ ;

This grammar will accept an expression like 42+2+5, but will reject a similar expression that has any spaces between the numbers and the + signs. We could modify the grammar to allow whitespace around the plus sign by modifying the production rule for sum like this:

sum ::= primary (whitespace* '+' whitespace* primary)* ;
whitespace ::= [ \t\r\n]+ ;

However, this can become cumbersome very quickly once the grammar becomes more complicated. ParserLib allows a shorthand to indicate that certain kinds of characters should be skipped.

// the IntegerExpression grammar
@skip whitespace {
    expr ::= sum ;
    sum ::= primary ('+' primary)* ;
    primary ::= number | '(' sum ')' ;
}
whitespace ::= [ \t\r\n]+ ;
number ::= [0-9]+ ;

The @skip whitespace notation indicates that zero or more matches to the whitespace nonterminal should be automatically ignored, before and after each terminal, nonterminal, and character class on the righthand side of expr, sum, and primary. So from the point of view of grammar matching, these three rules effectively become:

expr ::= whitespace* sum whitespace* ;
sum ::= whitespace* primary whitespace* (whitespace* '+' whitespace* primary whitespace*)* whitespace* ;
primary ::= whitespace* number whitespace* | whitespace* '('  whitespace* sum  whitespace* ')' whitespace* ;

Several things are important to note about @skip. First, there is nothing special about whitespace. The @skip directive works with any nonterminal defined in the grammar – so you could @skip punctuation or @skip spacesAndComments instead, by defining appropriate rules for those nonterminals.

Second, note how the definition of number was intentionally placed outside the @skip block. This is because we want to accept expressions with extra space around a number, like 42 + 2, but we want to reject spaces within numbers, like 4 2 + 2. Putting the number rule inside the @skip block would cause it to effectively become number ::= (whitespace* [0-9] whitespace*)+, so it would accept numbers with spaces inside them. Putting the rule for primary inside the @skip block – so that its use of number can be surrounded by whitespace – but keeping the rule for number outside, provides the desired behavior.

reading exercises

@skip

Consider this simple grammar, which has semester as its root nonterminal:

@skip spaces {
  semester ::= season year ;
  season ::= 'Fall' | 'Spring' ;
  year ::= [0-9] [0-9] ;
}
spaces ::= ' '+ ;

Which of the following strings are matched by this grammar (where ␣ indicates a space in the string)?

Fall15

␣␣␣␣Spring␣␣␣␣␣␣23

Spring␣9␣9

F␣a␣l␣l␣06

(missing explanation)

@skip less

Now suppose the grammar is instead:

@skip spaces {
  semester ::= season year ;
}
season ::= 'Fall' | 'Spring' ;
year ::= [0-9] [0-9] ;
spaces ::= ' '+ ;

The rules are still the same, but @skip spaces only wraps around the semester rule.

Which of these strings match the new grammar?

Fall15

␣␣␣␣Spring␣␣␣␣␣␣23

Spring␣9␣9

F␣a␣l␣l␣06

(missing explanation)

Another @skip

Now suppose the grammar is:

@skip space {
  semester ::= season year ;
}
season ::= 'Fall' | 'Spring' ;
year ::= [0-9] [0-9] ;
space ::= ' ' ;

The spaces nonterminal has now been changed to match just one space.

Which of these strings match the new grammar?

Fall15

␣␣␣␣Spring␣␣␣␣␣␣23

Spring␣9␣9

F␣a␣l␣l␣06

(missing explanation)

Generating the parser

Javadoc documentation for ParserLib can be found online.

The rest of this reading will use as a running example the IntegerExpression grammar defined just above, which we’ll store in a file called IntegerExpression.g.

The ParserLib parser generator tool converts a grammar like IntegerExpression.g into a parser. In order to do this, you need to follow three steps. First, you need to import the ParserLib library, which resides in a package edu.mit.eecs.parserlib:

import edu.mit.eecs.parserlib.*;

The second step is to define an Enum type that contains all the nonterminals used by your grammar. This will tell ParserLib which definitions to expect in the grammar and will allow it to check for any missing ones.

IntegerExpressionParser.java

private static enum IntegerGrammar {
  EXPR, SUM, PRIMARY, NUMBER, WHITESPACE
}

Note that ParserLib itself is case insensitive, but by convention, the names of enum values are all upper case.

From within your code, you can create a Parser by calling its compile static factory method.

IntegerExpressionParser.java

Parser<IntegerGrammar> parser = 
    Parser.compile(new File("src/intexpr/IntegerExpression.g"),
                   IntegerGrammar.EXPR);

This code opens the file IntegerExpression.g, using its path relative to the root project folder, and compiles it into a Parser object. Whether the grammar is in a string or a file, the compile method takes as a second argument the name of the nonterminal to use as the root nonterminal of the grammar. In this example, the root nonterminal we want is expr, so we pass IntegerGrammar.EXPR.

Assuming you don’t have any syntax errors in your grammar, the result will be a Parser object that can be used to parse text. Notice that the Parser is a generic type that is parameterized by the IntegerGrammar enum that you defined earlier.

Calling the parser

the parse tree produced by parsing '54+(2+ 89)' with the IntegerExpression grammar

Now that you’ve generated the parser object, you are ready to parse your own text. The parser has a method called parse that takes in the text to be parsed (in the form of either a String, an InputStream, a File or a Reader) and returns a ParseTree. Calling it produces a parse tree:

ParseTree<IntegerGrammar> tree = parser.parse("54+(2+ 89)");

Note that the ParseTree is also a generic type that is parameterized by the enum type IntegerGrammar.

For debugging, we can then print this tree out:

System.out.println(tree.toString());

You can also try calling the method Visualizer.showInBrowser(tree) which will attempt to open a browser window that will show you a visualization of your parse tree. If for any reason it is not able to open the browser window, the method will print a URL to the console which you can copy and paste to your browser to view the visualization.

See the corresponding code in IntegerExpressionParser.java.

Traversing the parse tree

So we’ve used the parser to turn a stream of characters into a parse tree, which shows how the grammar matches the stream. Now we need to do something with this parse tree. We’re going to translate it into a value of a recursive abstract data type.

Like the Parser itself, the ParseTree is parameterized by the type NT, an enum type that lists all the nonterminals in the grammar, like the IntegerGrammar enumeration we defined earlier.

The first step is to learn how to traverse the parse tree. The ParseTree object has four methods that you need to be most familiar with. Three of them are fundamental observers:

public interface ParseTree<NT> {

  /**
   * Get this node's name.
   * @return the nonterminal corresponding to this node in the parse tree.
   */
  public NT name();

  /**
   * Get this node's children.
   * @return the children of this node, in order, excluding @skipped subtrees
   */
  public List<ParseTree<NT>> children();

  /**
   * Get this subtree's text.
   * @return the substring of the original string that this subtree matched
   */
  public String text();

Additionally, you can query the ParseTree for all children that match a particular production rule:

  /**
   * Get the children that correspond to a particular production rule 
   * @param name Name of the nonterminal corresponding to the desired production rule.
   * @return children that represent matches of name's production rule.
   */
  public List<ParseTree<NT>> childrenByName(NT name);

A good way to visit the nodes in a parse tree is to write a recursive function. For example, the recursive function below prints all nodes in the parse tree with proper indentation.

/**
 * Traverse a parse tree, indenting to make it easier to read.
 * @param node   parse tree to print.
 * @param indent indentation to use.
 */
static void printNodes(ParseTree<IntegerGrammar> node, String indent){
    System.out.println(indent + node.name() + ":" + node.text());
    for (ParseTree<IntegerGrammar> child: node.children()){
        printNodes(child, indent + "  ");
    }
}

Running this function on the parse tree for 54+(2+ 89) produces the output below. For reference, the grammar and the visualized parse tree are shown at right.

EXPR:54+(2+ 89)
  SUM:54+(2+ 89)
    PRIMARY:54
      NUMBER:54
    PRIMARY:(2+ 89)
      SUM:2+ 89
        PRIMARY:2
          NUMBER:2
        PRIMARY:89
          NUMBER:89

// the IntegerExpression grammar
@skip whitespace {
    expr ::= sum ;
    sum ::= primary ('+' primary)* ;
    primary ::= number | '(' sum ')' ;
}
whitespace ::= [ \t\r\n]+ ;
number ::= [0-9]+ ;

reading exercises

Parse trees

Which of the following statements are true of a ParserLib parse tree, from careful examination of the example output just above?

the root node of the tree corresponds to the root nonterminal of the grammar

(missing explanation)

each node of the tree is named by a nonterminal in the grammar

(missing explanation)

terminals do not have their own separate nodes that can be retrieved by children()

(missing explanation)

only a grammar with recursive productions can generate a parse tree

(missing explanation)

a node’s immediate children must correspond to nonterminals mentioned in the node’s production

(missing explanation)

@skip N means that the skipped nonterminal N never appears among the children() of nodes whose rules are inside the @skip block

(missing explanation)

Snapshot diagram of a ParseTree

Let’s see how a ParserLib parse tree looks as a snapshot diagram, because it’s closer to how we will think about it and work with it in Java code.

The partial snapshot diagram at the right corresponds to part of the IntegerGrammar parse tree for "2+ 89", also shown.

The snapshot diagram shows the three correspondingly-colored nodes along the left branch of the parse tree: sum, primary, and number.

Fill in the gray rectangles in the snapshot diagram.

What should appear at location A?

(missing explanation)

What should appear at location B?

(missing explanation)

What should appear at all the locations marked C?

(missing explanation)

Constructing an abstract syntax tree

We need to convert the parse tree into a recursive data type. Here’s the definition of the recursive data type that we’re going to use to represent integer arithmetic expressions:

IntegerExpression = Number(n:int)
                    + Plus(left:IntegerExpression, right:IntegerExpression)

If this syntax is mysterious, review recursive data type definitions.

When a recursive data type represents a language this way, it is often called an abstract syntax tree. An IntegerExpression value captures the important features of the expression – its grouping and the integers in it – while omitting unnecessary details of the sequence of characters that created it.

By contrast, the parse tree that we just generated with the IntegerExpression parser is a concrete syntax tree. It’s called concrete, rather than abstract, because it contains more details about how the expression is represented in actual characters. For example, the strings 2+2, ((2)+(2)), and 0002+0002 would each produce a different concrete syntax tree, but these trees would all correspond to the same abstract IntegerExpression value: Plus(Number(2), Number(2)).

Now, we can create a recursive function that walks the ParseTree to produce an IntegerExpression as follows:

IntegerExpressionParser.java

/**
 * Convert a parse tree into an abstract syntax tree.
 * 
 * @param parseTree constructed according to the grammar in IntegerExpression.g
 * @return abstract syntax tree corresponding to parseTree
 */
private static IntegerExpression makeAbstractSyntaxTree(final ParseTree<IntegerGrammar> parseTree) {

    switch (parseTree.name()) {
    case EXPR: // expr ::= sum;
        {
            final ParseTree<IntegerGrammar> child = parseTree.children().get(0);
            return makeAbstractSyntaxTree(child);
        }

    case SUM: // sum ::= primary ('+' primary)*;
        {
            final List<ParseTree<IntegerGrammar>> children = parseTree.children();
            IntegerExpression expression = makeAbstractSyntaxTree(children.get(0));
            for (int i = 1; i < children.size(); ++i) {
                expression = new Plus(expression, makeAbstractSyntaxTree(children.get(i)));
            }
            return expression;
        }

    case PRIMARY: // primary ::= number | '(' sum ')';
        {
            final ParseTree<IntegerGrammar> child = parseTree.children().get(0);
            // check which alternative (number or sum) was actually matched
            switch (child.name()) {
            case NUMBER:
                return makeAbstractSyntaxTree(child);
            case SUM:
                return makeAbstractSyntaxTree(child); // in this case, we do the
                                                      // same thing either way
            default:
                throw new AssertionError("should never get here");
            }
        }

    case NUMBER: // number ::= [0-9]+;
        {
            final int n = Integer.parseInt(parseTree.text());
            return new Number(n);
        }

    default:
        throw new AssertionError("should never get here");
    }

}

The function follows the structure of the grammar, handling each rule from the grammar in turn: expr, sum, primary, and number. The only rule we don’t need to handle here is whitespace, because the grammar uses whitespace only in a @skip block, and skipped subtrees are not returned by children() so they should never appear in the traversal.

Note that this code is tied very closely to the grammar in IntegerExpression.g. If you change the rules of the grammar in a significant way, this code will likely need to change too.

And remember: the execution of a switch statement starts at the matching case and falls through to subsequent cases. For example, if we neglected to return expression at the end of the SUM case, Java would simply continue on and run the code under PRIMARY. The function is careful to return or throw from every case, never falling through.

reading exercises

String to AST 1

If the input string is "19+23+18", which abstract syntax tree would be produced by makeAbstractSyntaxTree above?

Plus(Number(19))

Plus(19, 23, 18)

Plus(Plus(19, 23), 18)

Plus(Plus(Number(19), Number(23)), Number(18))

Plus(Number(19), Plus(Number(23), Number(18)))

(missing explanation)

String to AST 2

Which of the following input strings would produce:

Plus(Plus(Number(1), Number(2)), 
     Plus(Number(3), Number(4)))

"(1+2)+(3+4)"

"1+2+3+4"

"(1+2)+3+4"

"(((1+2)))+(3+4)"

(missing explanation)

Binary parse tree

In the code we’re looking at, the Plus operator in the abstract syntax tree is binary. It represents the sum of exactly two expressions, a lefthand side and a righthand side.

Its corresponding grammar rule sum ::= primary ('+' primary)* is n-ary. It matches the sum of n primary expressions, where n ≥ 1. As a result, a SUM node in the parse tree may have one or more PRIMARY children, not just two.

Part of the complexity of the SUM case of makeAbstractSyntaxTree() comes from translating an n-ary node into binary nodes.

If we wanted the SUM node to be (at most) binary instead, which of the following grammar rules would do it correctly?

sum ::= primary '+' primary

sum ::= primary '+' primary*

sum ::= primary | primary '+' sum

sum ::= sum '+' sum

(missing explanation)

N-ary AST

Suppose instead that we keep the SUM node in the parse tree as n-ary, but we want to change the abstract syntax tree Plus node from strictly binary to n-ary. Which of these datatype definitions would do it correctly?

IntegerExpression = Number(n:int) + Plus(operands:List<Integer>)

IntegerExpression = Number(n:int) + Plus(operands:List<IntegerExpression>)

IntegerExpression = Numbers(list:List<Integer>) + Plus(expr:IntegerExpression)

(missing explanation)

Enumeration

For this grammar:

@skip spaces { 
  date ::= monthname day ',' year | day '/' monthnum '/' year ;
}
monthname ::= 'January' | 'February' ;
monthnum ::= [0-9]+ ; 
day ::= [0-9]+ ;
year ::= [0-9]+ ;
spaces ::= ' '+ ;

…what values would you need to have in the enumeration type passed to ParserLib?

Specifically, when you define enum DateGrammar { ... } and call Parser<DateGrammar>.compile(), what values should appear in the ... of the enum?

(missing explanation)

Using a parse tree

Consider this grammar, which is designed to match a list of numbers like 5,-8,3:

@skip spaces { 
  list ::= int (',' list)? ;
}
int ::= '-'? [0-9]+ ;
spaces ::= ' '+ ;

The root nonterminal is list.

Fill in the blanks of this function that converts a ParseTree for this grammar into a List.

static enum ListGrammar { LIST, INT, SPACES }
static List<Integer> makeList(ParseTree<ListGrammar> parseTree) {
    switch (parseTree.name()) {
    case LIST:
        List<Integer> list = new ArrayList<>();
        list.add(Integer.parseInt(parseTree.▶▶A◀◀));
        if (parseTree.▶▶B◀◀ > 1) {
            list.addAll(makeList(parseTree.▶▶C◀◀));
        }
        return list;
    default:
        throw new AssertionError("should never get here");
    }


}

(missing answer)
(missing answer)
(missing answer)

(missing explanation)

To every thing…

Consider this grammar, with semester as its root nonterminal:

@skip spaces {
  semester ::= season year ;
  season ::= 'Fall' | 'Spring' ;
  year ::= [0-9] [0-9] ;
}
spaces ::= ' '+ ;

Suppose ParserLib is used to produce a parse tree, and you are writing code to convert the parse tree into abstract data types representing semesters and seasons. Here is part of that code intended to handle just the season node of the tree:

// nonterminals in the grammar
static enum SemesterGrammar { SEMESTER, SEASON, YEAR, SPACES }

// abstract datatype representing a season of the year
static enum Season { FALL, SPRING, WINTER, SUMMER }

/**
 * @param node must be a match to the season rule
 * @return corresponding Season value
 */
static Season convertToSeason(ParseTree<SemesterGrammar> node) {
  assert node.name() == SemesterGrammar.SEASON;
  return ▶▶A◀◀ ? Season.FALL : Season.SPRING;
}

What is the simplest correct code for ▶▶A◀◀? (You may want to look at the spec of ParseTree, specifically ParseTree.text().)

node.text() == ("Fall")

node.text().equals("Fall")

node.text().trim().equals("Fall")

(missing explanation)

…there is a season

Now suppose the grammar is instead:

@skip spaces {
  semester ::= season year ;
}
season ::= 'Fall' | 'Spring' ;
year ::= [0-9] [0-9] ;
spaces ::= ' '+ ;

In the same line of code in the previous exercise:

  return ▶▶A◀◀ ? Season.FALL : Season.SPRING;

…what is now the simplest correct code for ▶▶A◀◀?

node.text() == ("Fall")

node.text().equals("Fall")

node.text().trim().equals("Fall")

(missing explanation)

Handling errors

Several things can go wrong when parsing.

Your grammar file may not exist or fail to open. In this case, the compile method will throw an IOException.

Your grammar may have a syntax error in it. In this case, compile will throw an UnableToParseException.

The string you are trying to parse may not be parseable with your given grammar. This might happen because your grammar is incorrect, or because your string is incorrect. Either way, the problem will be signaled by the parse method throwing an UnableToParseException.

The UnableToParseException exception contains some information about the possible location of the error, although parse errors are sometimes inherently difficult to localize, since the parser cannot know what string you intended to write, so you may need to search a little to find the true location of the error.

Left recursion and other ParserLib limitations

ParserLib works by generating a top-down Recursive Descent Parser. These kind of parsers have a few limitations in terms of the grammars that they can parse. There are two in particular that are worth pointing out.

Left recursion. A recursive descent parser can go into an infinite loop if the grammar involves left recursion. This is a case where a definition for a nonterminal involves that nonterminal as its leftmost symbol. For example, the grammar below includes left recursion because one of the possible definitions of sum is sum '+' number which has sum as its leftmost symbol.

sum ::= number | sum '+' number ;
number ::= [0-9]+ ;

This left-recursive definition is problematic because the recursive descent parser needs to try matching all alternatives for sum, both number and sum '+' number. But trying to match sum '+' number immediately requires matching sum as its first step. The parser keeps trying the rule recursively but never makes any progress through the string being parsed – not reducing to a smaller subproblem as correct recursion requires to terminate. By contrast, a rule like expr = number | '(' expr ')', which is recursive but not left-recursive, is able to make progress through the string by matching ( before recursively applying the expr rule again again.

Left recursion can also happen indirectly. For example, changing the grammar above to the one below does not address the problem because the definition of sum still indirectly involves a symbol that has sum as its first symbol.

sum ::= number | thing number ;
thing ::= sum '+' ;
number ::= [0-9]+ ;

If you give any of these grammars to ParserLib and then try to use them to parse a symbol, ParserLib will fail with an UnableToParseException listing the offending nonterminal.

There are some general techniques to eliminate left recursion; for our purposes, the simplest approach will be to replace left recursion with repetition (*), so the grammar above becomes:

sum ::= (number '+')* number ;
number ::= [0-9]+ ;

Greediness. This is an issue that you may not run into in this class, but it is a limitation of ParserLib you should be aware of. The ParserLib parsers are greedy in that at every point they try to match a maximal string for any rule they are currently considering. For example, consider the following grammar:

g ::= ab threeb ;
ab ::= 'a'*'b'* ;
threeb ::= 'bbb' ;

The string 'aaaabbb' clearly should match g, but a greedy parser cannot parse it because it will try to parse a maximal substring that matches the ab symbol, and then it will find that it cannot parse threeb because it has already consumed the entire string. Unlike left recursion, which is easy to fix, this is a more fundamental limitation of the type of parser implemented by ParserLib.

Summary

The topics of today’s reading connect to our three properties of good software as follows:

Safe from bugs. A grammar is a declarative specification for strings and streams, which can be implemented automatically by a parser generator. These specifications are often simpler, more direct, and less likely to be buggy than parsing code written by hand.
Easy to understand. A grammar captures the shape of a sequence in a form that is compact and easier to understand than hand-written parsing code.
Ready for change. A grammar can be easily edited, then run through a parser generator to regenerate the parsing code.