This page describes "Big Hack," my submission to the 2001 ICFP Programming Contest.

**Update:**
I have posted the
results of running my program on
the judges' sample inputs.

My program is based on a dynamic programming algorithm that computes an optimal output for any input. It uses O(n^3) time and O(n^2) space, with some pretty nasty constant factors hidden by the big-O notation. Unfortunately, that is not good enough to handle large inputs. The algorithm is executed in iterations, and after each iteration the program uses the information calculated so far to prepare a candidate output. When time runs out, the program prints the most recent candidate output (or the input, if it is shorter).

The program was written in C++, for the simple reason that I have been writing C++ code at work for the past two years, so I am most comfortable working in that language for now.

My program first parses the input into an internal representation equivalent to the "meaning" of the document as described in the problem specification, in other words, a sequence of characters and associated contexts. My representation differs from that described in the spec in one significant way: I invented a 9th pseudo-color called the "root color" and an 11th pseudo-size called the "root size" to represent characters not enclosed in any color or size tags.

After parsing, my program combines groups of adjacent characters that have the same context, and combines spaces with adjacent characters having compatible contexts. The rest of the algorithm works on these groups of characters. For simplicity, I will use the term "character" from now on to mean a group of characters sharing the same context.

The main algorithm used is a dynamic programming algorithm. Briefly, a dynamic programming algorithm breaks a problem into subproblems such that the solution to any subproblem can be calculated from the solutions to smaller subproblems. Instead of implementing this recursively, all possible subproblems are arranged in a table, and the table is filled in order. This has the advantage that each subproblem only needs to be solved once, where the recursive approach may solve each subproblem many times.

The subproblems I used were the following. For each range of decorated characters (described by the range length and starting character position), and for each possible context, determine the shortest piece of SML text that produces the specified sequence of decorated characters when enclosed in the specified context.

One of these subproblems can be solved using the solutions to smaller subproblems as follows. The context is optionally switched to a different context using the appropriate tags, and the range of characters is split into two smaller ranges of characters to be produced in the new context. As an example, consider the problem of producing a bold X followed by a bold italic Y, starting in the root context. There is only one way to split this range into two smaller ranges, but there is still a choice of what context to switch to. Here are the results of switching to various contexts (with spaces added for clarity):

<B>X</B> <B><I>Y</I></B><B>X <I>Y</I></B><B><I><PL><B>X</B></PL> Y</I></B><U><U><EM><TT><PL><B>X</B></PL> <PL><B><I>Y</I></B></PL></TT></EM></U></U>

In this case the shortest result occurs when switching to the bold context before solving the subproblems.

It is a simple matter to look up the shortest way of producing each smaller range of characters when enclosed in the new context. The algorithm loops over all posible contexts to switch to, and all possible places to split the range into two pieces. It adds up the number of characters needed to produce each smaller range, plus the number of characters needed to switch from the enclosing context to the new context. The solution to the subproblem consists of the context and split point requiring the fewest total characters.

The algorithm solves the subproblems one by one, placing the results in a table. The table is indexed by range length, starting character, and enclosing context. Each entry in the table stores the number of characters needed, the new context to switch to, and the split point:

table[rangeLength, startChar, enclosingContext] : { bytesNeeded, newContext, splitPoint }

For `rangeLength = 1`, the table is filled in a
straightforward way by calculating the number of characters
needed to switch from `enclosingContext` to
the context of the particular character in the input. (This
must take into consideration that some of these are
sequences of input characters all of which are spaces, in which
case it is only necessary to switch to a compatible context.)
Then the main algorithm is used to fill the rest of the table.

In pseudocode, here is the main algorithm:

for rangeLength <- 2 to number of characters in the input for range <- all ranges of length rangeLength for enclosingContext <- all possible contexts bestLength <- infinity bestNewContext <- undefined bestSplitPoint <- undefined for newContext <- all possible contexts for splitPoint <- all ways of splitting range range1, range2 <- range split by splitPoint currentLength <- table[range1.length, range1.start, newContext].bytesNeeded + table[range2.length, range2.start, newContext].bytesNeeded + number of characters to switch from enclosingContext to newContext if currentLength < bestLength bestLength <- currentLength bestNewContext <- newContext bestSplitPoint <- splitPoint table[rangeLength, range.start, enclosingContext] <- { bestLength, bestNewContext, bestSplitPoint }

It is simple to traverse the completed table to produce an output document using the already computed solutions to the top-level problem and all subproblems.

I was able to reduce the space requirements of the table
by eliminating the `newContext` and
`splitPoint` fields, and only storing the number
of bytes needed to solve each subproblem. Notice that the
main algorithm doesn't need those two fields at all. They
are only used when reading off the final answer from the
table. They can easily be recovered by repeating the
inner two loops of the main algorithm. That makes it
slower to produce the final answer, but the extra time
required is minor compared to the other speed problems of
the algorithm.

I also was able to reduce the size of the `bytesNeeded`
field from 4 bytes to 1 byte by noticing that for a given
`rangeLength` and `startChar`, the values
in the table can only differ by a small amount, because
one solution can be reduced to another by switching
contexts. I count 76 characters for the longest
context switch, which fits easily in 1 byte, so I store
a 4 byte base value and a 1 byte offset for each
`enclosingContext`.

I was able to reduce both the table size and the algorithm
running time in some circumstances by taking advantage
of the root color and root size. For any range of
characters containing a character that has the root color,
it is silly to compute and store table entries for values
of `enclosingContext` that are in any color other
than the root color, since it is impossible to switch
from a context containing a normal color to a context
containing the root color. Similarly for the root size.
The savings can be significant: The number of table
entries for a range is reduced by about an order of
magnitude if the range contains even a single character
having the root color, and another order of magnitude if it
contains a character having the root size. The algorithm
contains two nested loops over all possible contexts, so
the effects of this optimization on running time can
be even more dramatic -- up to 4 orders of magnitude.

After each pass of the outermost loop, the program takes a break to compute a possible solution based on the information that has been computed so far.

Basically, after the nth pass, we know the shortest way of producing each range of output characters of length at most n when enclosed in the root context. This information is used to compute the shortest possible output where all tags are closed at least every n output characters. This is done with another dynamic programming algorithm that is left as an exercise to the reader. (I'm getting tired.)

Whenever we run out of time, the most recent partial solution is printed, unless the input is shorter, in which case the input is printed.

In this analysis I will use n to mean the size of the input, specifically, the number of sequences of characters sharing a common context (or compatible contexts). Also, I will use c to mean the total number of possible contexts. Since input files can be up to 5 megabytes, it is possible for n to be up to 10^6 or so, and in the worst case c is approximately 10^4, though this can sometimes decrease to 10^3 or 10^2 due to the optimization described above.

Looking at the loops used by the main algorithm, it is easy to see that the running time for the algorithm is O(n^3 * c^2). I will ignore the time needed to produce partial solutions. It is worth noting however that to do a small constant number of iterations of the algorithm (call it k), the time is O(n * k^2 * c^2). So, if you are interested in finding the optimal solution in which tags can contain at most, say, 5 output characters, the algorithm becomes linear time.

Space requirements for the main table are O(n^2 * c). For the worst case input, that is about 10^16. Yow. Even to do the first pass of the algorithm on a worst case input, it will take 10^10 bytes, or 10 gigabytes. Oh well. For the largest possible file that doesn't use colors or sizes, the first pass will take 100 megabytes, which should fit in memory. As with time requirements, for a constant number of algorithm iterations the space requirements are linear.

I set out with the intention of coding an algorithm that finds an optimal solution for any input. I did the best I could with it in 3 days, tacking on a mechanism for calculating partial solutions, and adding support for time limits, which turned it into a plausible contest entry. I'm sure it doesn't have what it takes to be a winning entry, though something like it could possibly be part of a winning entry (maybe as the last stage that consumes any remaining time). I imagine a winning program would require a combination of strategies, including some local optimizations, intelligent use of heuristics, etc. I'm interested in seeing what other creative ideas people came up with that didn't even cross my mind.

Finally, I had great fun working on this program. Thanks to the organizers, who did an excellent job running the contest.

You can download the final submission I made for the contest (pkg2.tar.gz) or take a quick peek at the source code (icfp2001.cpp). I also made an earlier submission just in case (pkg1.tar.gz).

And finally, a fully optimized version of example.txt (I hope) at 13058 bytes.

Jeremy Sawicki (jeremys AT mit DOT edu)