Reading 17: Regular Expressions & Grammars

Software in 6.031

Safe from bugs	Easy to understand	Ready for change
Correct today and correct in the unknown future.	Communicating clearly with future programmers, including future you.	Designed to accommodate change without rewriting.

Objectives

After today’s class, you should:

Understand the ideas of grammar productions and regular expression operators
Be able to read a grammar or regular expression and determine whether it matches a sequence of characters
Be able to write a grammar or regular expression to match a set of character sequences and parse them into a data structure

Introduction

Today’s reading introduces several ideas:

grammars, with productions, nonterminals, terminals, and operators
regular expressions

Some program modules take input or produce output in the form of a sequence of bytes or a sequence of characters, which is called a string when it’s simply stored in memory, or a stream when it flows into or out of a module. In today’s reading, we talk about how to write a specification for such a sequence. Concretely, a sequence of bytes or characters might be:

A file on disk, in which case the specification is called the file format
Messages sent over a network, in which case the specification is a wire protocol
A command typed by the user on the console, in which case the specification is a command line interface
A string stored in memory

For these kinds of sequences, we introduce the notion of a grammar, which allows us not only to distinguish between legal and illegal sequences, but also to parse a sequence into a data structure that a program can work with. The data structure produced from a grammar will often be a recursive data type like we talked about in the recursive data type reading.

We also talk about a specialized form of a grammar called a regular expression. In addition to being used for specification and parsing, regular expressions are a widely-used tool for many string-processing tasks that need to disassemble a string, extract information from it, or transform it.

The next reading will talk about parser generators, a kind of tool that translates a grammar automatically into a parser for that grammar.

Grammars

To describe a string of symbols, whether they are bytes, characters, or some other kind of symbol drawn from a fixed set, we use a compact representation called a grammar.

A grammar defines a set of strings. Suppose we want to write a grammar that represents URLs. Our grammar for URLs will specify the set of strings that are legal URLs in the HTTP protocol.

The literal strings in a grammar are called terminals. They’re called terminals because they are the leaves of a parse tree that represents the structure of the string. They don’t have any children, and can’t be expanded any further. We generally write terminals in quotes, like 'http' or ':'.

A grammar is described by a set of productions, where each production defines a nonterminal. You can think of a nonterminal like a variable that stands for a set of strings, and the production as the definition of that variable in terms of other variables (nonterminals), operators, and constants (terminals). Nonterminals are internal nodes of the tree representing a string.

A production in a grammar has the form

nonterminal ::= expression of terminals, nonterminals, and operators

One of the nonterminals of the grammar is designated as the root. The set of strings that the grammar recognizes are the ones that match the root nonterminal. This nonterminal is sometimes called root or start or even just S, but in the grammars below we will typically choose more readable names for the root, like url, html, and markdown.

So a grammar that represents a singleton set, allowing only one specific URL, would have just one production, with a terminal on the right:

url ::= 'http://mit.edu/'

Grammar operators

Productions can use operators to combine terminals and nonterminals on the righthand side. The three most important operators in a production expression are:

Repetition, represented by *:

x ::= y*        x matches zero or more y

Concatenation, represented not by a symbol, but just a space:

x ::= y z       x matches y followed by z

Union, also called alternation, represented by |:

x ::= y | z     x matches either y or z

By convention, postfix operators like * have highest precedence, which means they are applied first. Concatenation is applied next. Alternation | has lowest precedence, which means it is applied last. Parentheses can be used to override precedence:

m ::=  a (b|c) d      m matches a, followed by either b or c, followed by d
x ::=  (y z | a b)*   x matches zero or more yz or ab pairs

Let’s use these operators to generalize our url grammar to match some other hostnames, such as http://stanford.edu/ and http://google.com/.

url ::= 'http://' hostname '/'
hostname ::= 'mit.edu' | 'stanford.edu' | 'google.com'

The first rule of this grammar demonstrates concatenation. The url nonterminal matches strings that start with the literal string http://, followed by a match to the hostname nonterminal, followed by the literal string /.

The hostname rule demonstrates union. A hostname can match one of the three literal strings, mit.edu or stanford.edu or google.com.

So this grammar represents the set of three strings, http://mit.edu/, http://google.com/, and http://stanford.edu/.

Let’s take it one step further by allowing any lowercase word in place of mit, stanford, google, com and edu:

url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= ('a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' 
              | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' 
              | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z')*

The new word rule matches a string of zero or more lowercase letters, so the overall grammar can now match http://alibaba.com/ and http://zyxw.edu/ as well. Unfortunately word can also match an empty string, so this url grammar also matches http://./, which is not a legal URL. Here’s a verbose way to fix that, which requires word to match at least one letter.

word ::= ('a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' 
              | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' 
              | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z')
         ('a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' 
              | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' 
              | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z')*

We’ll see a simpler way to write word in the next section.

More grammar operators

You can also use additional operators which are just syntactic sugar (i.e., they’re equivalent to combinations of the main three operators).

0 or 1 occurrence is represented by ?:

x ::=  y?      an x is a y or is the empty string

1 or more occurrences is represented by +:

x ::= y+       an x is one or more y
               (equivalent to  x ::= y y* )

A character class [...] represents the set of single-character strings matching any of the characters listed in the square brackets:

x ::= [aeiou]  is equivalent to  x ::= 'a' | 'e' | 'i' | 'o' | 'u'
x ::= [a-c]    is equivalent to  x ::= 'a' | 'b' | 'c'

An inverted character class [^...] represents the set of single-character strings matching a character not listed in the brackets:

x ::= [^a-c] is equivalent to  x ::= 'd' | 'e' | 'f' 
                                         | ... (all other characters)

These additional operators allow the word production to be expressed more compactly:

url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= [a-z]+

Recursion in grammars

How else do we need to generalize? Hostnames can have more than two components, and there can be an optional port number:

http://didit.csail.mit.edu:4949/

To handle this kind of string, the grammar is now:

url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

Notice how hostname is now defined recursively in terms of itself. Which part of the hostname definition is the base case, and which part is the recursive step? What kinds of hostnames are allowed?

Using the repetition operator, we could also write hostname without recursion, like this:

hostname ::= (word '.')+ word

Recursion can sometimes be eliminated from a grammar using operators like this, but not always.

Another thing to observe is that this grammar allows port numbers that are not technically legal, since port numbers can only range from 0 to 65535. We could write a more complex definition of port that would match only these integers, but that’s not typically done in a grammar. Instead, the constraint 0 ≤ port ≤ 65535 would be specified in the program that uses the grammar.

There are more things we should do to go farther:

generalizing http to support the additional protocols that URLs can have
generalizing the / at the end to a slash-separated path
allowing hostnames with the full set of legal characters instead of just a-z

reading exercises

Reading a Grammar 1

Which strings match the root nonterminal of this grammar?

root    ::= integer ('-' integer)+
integer ::= [0-9]+

(missing explanation)

Reading a Grammar 2

Which strings match the root nonterminal of this grammar?

root   ::= (A B)+
A      ::= [Aa]
B      ::= [Bb]

aaaBBB

abababab

aBAbabAB

AbAbAbA

(missing explanation)

Writing a Grammar

Suppose we want the url grammar to also match strings of the form:

https://websis.mit.edu/
ftp://ftp.athena.mit.edu/

but not strings of the form:

ptth://web.mit.edu/
mailto:bitdiddle@mit.edu

So we change the grammar to:

url ::= protocol '://' hostname (':' port)? '/' 
protocol ::= TODO
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

What could you put in place of TODO to match the desirable URLs but not the undesirable ones?

word

'ftp' | 'http' | 'https'

('http' 's'?) | 'ftp'

('f' | 'ht') 'tp' 's'?

(missing explanation)

Parse trees

Matching a grammar against a string can generate a parse tree that shows how parts of the string correspond to parts of the grammar.

The leaves of the parse tree are labeled with terminals, representing the parts of the string that have been parsed. They don’t have any children, and can’t be expanded any further. If we concatenate the leaves together, we get back the original string. A trivial example is the one-line URL grammar that we started with, whose (only possible) parse tree is shown at the right:

the parse tree produced by parsing 'http://mit.edu' with the one-line URL grammar

url ::= 'http://mit.edu/'

Internal nodes of the parse tree are labeled with nonterminals. The immediate children of a nonterminal node must follow the pattern of the nonterminal’s production rule in the grammar. For example, in our more elaborate URL grammar that allows any two-part hostname, the children of a hostname node in the tree must follow the pattern of the hostname rule, word '.' word. The figure on the right shows the parse tree produced by matching this grammar against http://mit.edu/:

the parse tree produced by parsing 'http://mit.edu' with a grammar with url, hostname, and word nonterminals

url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= [a-z]+

For a more elaborate example, here is the parse tree for the recursive URL grammar. The tree has more structure now. The hostname and word nonterminals are labeling nodes of the tree whose subtrees match those rules in the grammar.

a parse tree produced by a grammar with a recursive hostname rule

url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

reading exercises

Parse trees 1

What string was matched against the grammar to produce the last parse tree above?

(missing explanation)

Parse trees 2

If the same string was matched against this grammar with a non-recursive hostname rule:

url ::= 'http://' hostname (':' port)? '/' 
hostname ::= (word '.')+ word
port ::= [0-9]+
word ::= [a-z]+

then how many internal nodes (labeled by nonterminals) would the resulting parse tree have? Hint: try to draw the parse tree on paper before counting.

(missing explanation)

Example: Markdown and HTML

Now let’s look at grammars for some file formats. We’ll be using two different markup languages that represent typographic style in text. Here they are:

Markdown

This is _italic_.

(To learn about Markdown, see the Markdown syntax documentation or Markdown on Wikipedia.)

HTML

Here is an <i>italic</i> word.

(To learn about HTML, see the W3C HTML specification, WHATWG HTML standard, or HTML on Wikipedia.)

For simplicity, our example HTML and Markdown grammars will only specify italics, but other text styles are of course possible. Also for simplicity, we will assume the plain text between the formatting delimiters isn’t allowed to use any formatting punctuation, like _ or <.

Here’s the grammar for our simplified version of Markdown:

a parse tree produced by the Markdown grammar

markdown ::=  ( normal | italic ) *
italic ::= '_' normal '_'
normal ::= text
text ::= [^_]*

Here’s the grammar for our simplified version of HTML:

a parse tree produced by the HTML grammar

html ::=  ( normal | italic ) *
italic ::= '<i>' html '</i>'
normal ::= text
text ::= [^<>]*

reading exercises

Recursive Grammars

Look at the markdown and html grammars above, and compare their italic productions. Notice that not only do they differ in delimiters (_ in one case, < > tags in the other), but also in the nonterminal that is matched between those delimiters. One grammar is recursive; the other grammar is not.

For each string below, if you match the specified grammar against it, which letters are inside matches to the italic nonterminal? Your answer should be some subset of the letters abcde.

markdown: a_b_c_d_e

(missing explanation)

html: a<i>b<i>c</i>d</i>e

(missing explanation)

Regular expressions

A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side.

Our URL grammar is regular. By replacing nonterminals with their productions, it can be reduced to a single expression:

url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/'

The Markdown grammar is also regular:

markdown ::= ([^_]* | '_' [^_]* '_' )*

But our HTML grammar can’t be reduced completely. By substituting righthand sides for nonterminals, you can eventually reduce it to something like this:

html ::=  ( [^<>]* | '<i>' html '</i>' )*

…but the recursive use of html on the righthand side can’t be eliminated, and can’t be simply replaced by a repetition operator either. So the HTML grammar is not regular.

The reduced expression of terminals and operators can be written in an even more compact form, called a regular expression. A regular expression does away with the quotes around the terminals, and the spaces between terminals and operators, so that it consists just of terminal characters, parentheses for grouping, and operator characters. For example, the regular expression for our markdown format is just

([^_]*|_[^_]*_)*

Regular expressions are also called regexes for short. A regex is far less readable than the original grammar, because it lacks the nonterminal names that documented the meaning of each subexpression. But many programming languages have library support for regexes (and not for grammars), and regexes are much faster to match than a grammar.

The regex syntax commonly implemented in programming language libraries has a few more special operators, in addition to the ones we used above in grammars. Here’s are some common useful ones:

.       any single character (but sometimes excluding newline, depending on the regex library)

\d      any digit, same as [0-9]
\s      any whitespace character, including space, tab, newline
\w      any word character including underscore, same as [a-zA-Z_0-9]

\., \(, \), \*, \+, ...
        backslash escapes an operator or special character so that it matches literally

Using backslashes is important whenever there are terminal characters that would be confused with special characters. Because our url regular expression has . in it as a terminal, we need to use a backslash to escape it:

http://([a-z]+\.)+[a-z]+(:[0-9]+)?/

reading exercises

Regular Expressions

Consider the following regular expression:

[A-G]+(♭|♯)?

Which of the following strings match the regular expression?

(missing explanation)

Using regular expressions in Java

Regexes are widely used in programming, and you should have them in your toolbox.

In Java, you can use regexes for manipulating strings (see String.split, String.matches, java.util.regex.Pattern). They’re built-in as a first-class feature of modern scripting languages like Python, Ruby, and JavaScript, and you can use them in many text editors for find and replace. Regular expressions are your friend! Most of the time. Here are some examples.

Replace all runs of spaces with a single space:

String singleSpacedString = string.replaceAll(" +", " ");

Match a URL:

Pattern regex = Pattern.compile("http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/");
Matcher m = regex.matcher(string);
if (m.matches()) {
    // then string is a url
}

Extract part of an HTML tag:

Pattern regex = Pattern.compile("<a href=\"([^\"]*)\">");
Matcher m = regex.matcher(string);
if (m.matches()) {
    String url = m.group(1); 
    // Matcher.group(n) returns the nth parenthesized part of the regex
}

Notice the backslashes in the URL and HTML tag examples. In the URL example, we want to match a literal period ., so we have to first escape it as \. to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as \\. to protect the backslash from being interpreted as a Java string escape character. In the HTML example, we have to escape the quote mark " as \" to keep it from ending the string. The frequent necessity for backslash escapes makes regexes still less readable.

reading exercises

Using regexes in Java

Write the shortest regex you can to remove single-word, lowercase-letter-only HTML tags from a string:

String input = "The <b>Good</b>, the <i>Bad</i>, and the <strong>Ugly</strong>";
String regex = "TODO";
String output = input.replaceAll(regex, "");

The desired output for that example is "The Good, the Bad, and the Ugly". What is the shortest regex you can put in place of TODO? You may find it useful to run the code here and try your answer.

(missing explanation)

Context-free grammars

In general, a language that can be expressed with our system of grammars is called context-free. Not all context-free languages are also regular; that is, some grammars can’t be reduced to single nonrecursive productions. Our HTML grammar is context-free but not regular.

The grammars for most programming languages are also context-free. In general, any language with nested structure (like nesting parentheses or braces) is context-free but not regular. That description applies to the Java grammar, shown here in part:

statement ::= 
  '{' statement* '}'
| 'if' '(' expression ')' statement ('else' statement)?
| 'for' '(' forinit? ';' expression? ';' forupdate? ')' statement
| 'while' '(' expression ')' statement
| 'do' statement 'while' '(' expression ')' ';'
| 'try' '{' statement* '}' ( catches | catches? 'finally' '{' statement* '}' )
| 'switch' '(' expression ')' '{' switchgroups '}'
| 'synchronized' '(' expression ')' '{' statement* '}'
| 'return' expression? ';'
| 'throw' expression ';' 
| 'break' identifier? ';'
| 'continue' identifier? ';'
| expression ';' 
| identifier ':' statement
| ';'

Summary

Machine-processed textual languages are ubiquitous in computer science. Grammars are the most popular formalism for describing such languages, and regular expressions are an important subclass of grammars that can be expressed without recursion.

The topics of today’s reading connect to our three properties of good software as follows:

Safe from bugs. Grammars and regular expressions are declarative specifications for strings and streams, which can be used directly by libraries and tools. These specifications are often simpler, more direct, and less likely to be buggy than parsing code written by hand.
Easy to understand. A grammar captures the shape of a sequence in a form that is easier to understand than hand-written parsing code. Regular expressions, alas, are often not easy to understand, because they are a one-line reduced form of what might have been a more understandable regular grammar.
Ready for change. A grammar can be easily edited, but regular expressions, unfortunately, are much harder to change, because a complex regular expression is cryptic and hard to understand.