Design Project 1 Comments

Prof. Jackson put together the following comments, based on the design reports that he graded. They are particularly useful as a guide to structuring your ideas in a technical document, and should help you develop your writing skills for Design Project 2.

General Comments on Structure and Writing

Standalone Report

No report should start in a way that makes it seem like a continuation of another report, making the reader wonder if the first few pages have been lost. Ideally, any report you write should stand alone in its own right; any background needed can be explained briefly. If that's really too burdensome, you can cite another piece of work. But what matters most is to make it clear at the outset how the report is connected to what came before.

Use Familiar Ideas

Whenever possible, use ideas you've been taught in the course. This has many advantages: they're likely to be good ones; it saves work for you (since you can cite the course text and skip a lengthy explanation); and it saves work for the reader. Just make sure to use standard terminology. If your design doesn't use a standard idea when your reader would expect, explain why not.

Tell Me What You're Going to Tell Me -- Carefully

There's an old adage about public speaking that says "tell them what you're going to tell them, then tell them, then tell them what you told them". It's a good idea, because it gives structure to your presentation, sets expectations, and lets you reinforce what you think matters most. But if you just give a generic outline at the start ("First, I'll explain what the problem is. Then I'll explain my solution..."), and repeat yourself verbatim at the end, it's of little use -- and will annoy the reader. Instead of describing structure explicitly, try and convey it by carefully chosen section headers, and by the narrative flow of the report.

Problem before Solution

Before you explain the solution to a problem, say what the problem is. The problem may seem obvious to you, but it's rarely obvious to the reader. Moreover, if you can break the problem down into its elements, you can then show how the choices you made in the solution addressed these elements individually. For example, many of your reports said the aim was to run the dataflow graph "accurately and efficiently", but very few explained what these terms meant. For example, if "efficiently" means minimizing time, what time is involved? The time between starting the computation and the last output being produced? The time until the first output is produced? Does it include the setup time (which would matter if only one execution was planned, but might be ignored if many were planned)? And does maintaining a smooth rate of output matter, or is it acceptable to have throughput that is good on average but bursty?

Observations before Proposals

Before explaining how you plan to solve the problem you've set out, it helps to explain some observations that inspired your solution. For example, you might note that if the problem is just to get the entire computation completed as quickly as possible this can be accomplished by keeping the processors as busy as possible. Or you might explain that waking a thread only to discover that its input queue is empty wastes a context switch, so it would be good to avoid this (eg, by scheduling threads based on availability of inputs).

Specification before Implementation

When you decompose a design into components, explain the specifications of the components before you explain their implementations. In some cases, the implementation will be so obvious (or standard) that you can just skip it or reference a known solution. For example, many designs involved identifying pipeline stages, and presented algorithms for organizing operators into these stages. A better way to explain this is just to define the depth of an operator (for example, as the length of the shortest path to an input) and then to say that all operators with the same depth belong to the same pipeline stage. Implementing this specification is then a standard graph algorithm problem that anyone can look up in a textbook.

Likewise, many of you presented table layouts for representing the dataflow graph. This is a low-level representation detail; in this project, the memory usage of the graph itself is not an issue, so you could just skip all that and talk about the structure as an abstract datatype. Also, try to avoid using low-level datatypes in pseudocode: it's clearer and simpler to use types that match what you're trying to say conceptually. So, in general, use lists, sets and maps, and don't use arrays or tables. And don't ever use nulls!

Examples and General Formulations

Many reports explained ideas by way of examples, but never explained the general case. A report presented the graph transformation with two examples, one dividing the graph into pipeline stages and the other dividing it into parallel pipelines, but it never mentioned how to reconcile these. Other reports explained the general case but gave no examples (or only a very abstract, skeletal one), and because the explanation was not clear and precise enough, it was hard to tell exactly what you had in mind. You should do both: explain your design in its full generality, succinctly and precisely, and then give examples that illustrate the key points.

Separate Data from Control

Even though the control flow of a program depends on the data it encounters, it is usually best to explain how the data is organized separately from how control flows. And the order in which things happen at runtime is rarely the best order for explaining a program.

Beware of New Ideas in Conclusion

A conclusion should highlight the key ideas of the report. Of course there's no point just summarizing the contributions -- you've probably already done that in an abstract or introduction. A conclusion should take advantage of what the reader has learned. It can point to avenues for further work, but be careful introducing new ideas. It's fine to mention a well-known idea ("It may be possible to exploit threading to improve fairness") or to hint at an idea without explaining it in full ("Perhaps a new communication primitive could make implementing queues simpler"). But don't introduce a new idea out of the blue the way you would at the start of the paper and leave it hanging ("Stateful operators could be replicated; this is trivially done by using quantum correlation to maintain an invariant between the two copies of the state").

Explain All Figures, Tables and Code

Don't expect the reader to understand the content of figures, tables and pseudocode without help. Give a brief explanation of what the element is for, how it is organized, what any labels or special terms mean, and what conclusions can be drawn from it.

Shouldn't Say Should, Mustn't Say Must

Be careful when using "should" and "must" when describing something technical. If you say "to avoid races, the queue must be locked", your reader may assume you mean that locking is necessary, when in fact you just meant that it was sufficient. You probably meant "to avoid races, the queue is locked". Worse, many reports used "must" and "should" without giving any reason, so it wasn't clear if these just meant "it would be nice to do X" or "I've been asked to do X" or "there's nothing else to do but X". In short, avoid these words, and whenever one consideration implies another, say so explicitly.

Avoid Such That, So That You Won't Look Geeky

Computer scientists tend to use "such that" as much as possible, perhaps because it sounds precise. But it's usually not grammatically correct: "such that" modifies a noun, not a verb. Whenever you're tempted to use it, try "so that" instead. Thus: "the operators are allocated to processors so that utilization is maximized" -- not "such that utilization is maximized".

Technical Points Specific to this Project

Parallel Speedup from Processors, Not Threads

Many of you seemed to think, mistakenly, that threads in themselves are a resource. The computational power of the machine comes from its processors, and your aim is generally to keep them busy. If you can keep N processors busy, you may approach the best speedup possible -- a factor of N. Threads can help you do that, so long as you run different threads on different processors. Splitting a task into multiple threads on a single processor can improve performance by making better use of that processor, but there will be no advantage over running a single thread that keeps it equally busy.

Master Coordinator Bad Idea

MapReduce inspired many designs involving a master coordinator. Of course there needs to be some way to set up the computation, so some kind of master is inevitable. But having a single processor coordinate the others throughout the computation is almost certainly a bad idea, because it will likely become a bottleneck. Some of you had all the individual processors refer to data on the master every time a message was sent, to find out what queue it should be written to. This will exact a heavy cost, because every such reference will be a remote memory access. (Caching might help, but we're ignoring that.) Some of you had the master actually schedule threads on the other processors. This isn't feasible; each processor should be in charge of its own threads, and any effort to coordinate them will slow things down and complicate the design. Remember that in MapReduce the coordinator is there only to provide fault tolerance; it plays no role once the computation has started, unless some node fails.

Replicating Stateful Operators is Hard

The problem with stateful operators isn't that replicating them would get messages out of order. It's that the outputs depend on the state, so you can't replicate the state at all. And if you replicated the computation on multiple processors, but kept one copy of the state, reading and writing it would likely be a bottleneck and the replication would bring no improvement.

Don't Forget Simple Data Parallelism

Many of you exploited the two ideas mentioned in the project outline: replicating stateless operators and pipelining. You forgot about simple data parallelism: just running two different stateful operators in parallel, on independent flows.

Single Input Queue Per Operator

Here's a neat (but wrong) idea to ensure that an operator's thread is woken only when enough input is available: use only a single input queue for the operator, even though there are multiple incoming edges, and form tuples with one slot for each incoming edge that get consumed only when all slots are full. Unfortunately this misunderstands the semantics of the dataflow graph: an operator doesn't have to process one message from each incoming edge in a step. The merge operator, for example, takes a message from any one edge in each step, producing an output from it alone.