Reading 2: Testing

Software in 6.102

Safe from bugs	Easy to understand	Ready for change
Correct today and correct in the unknown future.	Communicating clearly with future programmers, including future you.	Designed to accommodate change without rewriting.

Objectives

After today’s class, you should:

understand the value of testing, and know the process of test-first programming;
be able to judge a test suite for correctness, thoroughness, and size;
be able to design a test suite for a function by partitioning its input space and choosing good test cases;
be able to judge a test suite by measuring its code coverage; and
understand and know when to use black box vs. glass box testing, unit tests vs. integration tests, and automated regression testing.

Validation

Testing is an example of a more general process called validation. The purpose of validation is to uncover problems in a program and thereby increase your confidence in the program’s correctness. Validation includes:

Formal reasoning about a program, usually called verification. Verification constructs a formal proof that a program is correct. Verification is tedious to do by hand, and automated tool support for verification is still an active area of research. Nevertheless, small, crucial pieces of a program may be formally verified, such as the scheduler in an operating system, or the bytecode interpreter in a virtual machine, or the filesystem in an operating system.
Code review. Having somebody else carefully read your code, and reason informally about it, can be a good way to uncover bugs. It’s much like having somebody else proofread an essay you have written. We discuss code review in another reading.
Testing. Running the program on carefully selected inputs and checking the results.

Why software testing is hard

Here are some approaches that unfortunately don’t work well in the world of software.

Exhaustive testing is infeasible. The space of possible test cases is generally too big to cover exhaustively. Imagine exhaustively testing a 32-bit floating-point multiply operation, a*b. There are 2⁶⁴ test cases!

Haphazard testing (“just try it and see if it works”) is less likely to find bugs, unless the program is so buggy that an arbitrarily-chosen input is more likely to fail than to succeed. It also doesn’t increase our confidence in program correctness.

Random or statistical testing doesn’t work as well for finding bugs in software, as it does for finding flaws in other kinds of engineered artifacts. Other engineering disciplines can test small random samples (e.g. 1% of hard drives manufactured) and infer the defect rate for the whole production lot. Physical systems can use many tricks to speed up time, like opening a refrigerator 1000 times in 24 hours instead of 10 years. These tricks give known failure rates (e.g. mean lifetime of a hard drive), but they assume continuity or uniformity across the space of defects. This is true for physical artifacts.

But it’s not true for software. Software behavior varies discontinuously and discretely across the space of possible inputs. The system may seem to work fine across a broad range of inputs, and then abruptly fail at a single boundary point. The famous Pentium division bug affected approximately 1 in 9 billion divisions. Stack overflows, out of memory errors, and numeric overflow bugs tend to happen abruptly, and always in the same way, not with probabilistic variation. That’s different from physical systems, where there is often visible evidence that the system is approaching a failure point (cracks in a bridge), or failures are distributed probabilistically near the failure point (so that statistical testing will observe some failures even before the point is reached).

Instead, test cases should be chosen carefully and systematically. Techniques for systematic testing are the primary focus of this reading. Note that random testing is used as a tool in software engineering (particularly to increase confidence that common-case behavior will be reliable or efficient enough, not just for finding bugs), but it’s best to use it in combination with systematic testing.

reading exercises

Needle in a haystack

Here is a buggy function:

/**
 * @param bits an array of 32 true/false values
 * @returns the Boolean AND of all values in the array
 */
function andAll(bits: Array<boolean>): boolean {
  let result = bits[0];
  for (let i = 1; i < 31; i++) {
    result = result && bits[i];
  }
  return result;
}

Louis Reasoner, who wrote this function and thinks it should work, tries it on a couple of haphazardly-chosen test cases shown below. What is the result of each test case?

andAll([true, true, true, ..., true, true]) // 32 true values

andAll([false, true, false, true, ..., false, true]) // 32 values alternating between false and true

Louis is satisfied. But unfortunately his code has an off-by-one error. Which expression has the bug?

bits[0]

i = 1

i < 31

i++

(missing explanation)

Which is closest to the number of test cases required to test this function exhaustively?

(missing explanation)

Which is closest to the probability of finding the off-by-one bug with a random test (i.e. picking a single input to try, uniformly at random from the possible inputs)?

1 in 32

1 in 2³¹

1 in 2³²

1 in 2⁶⁴

(missing explanation)

A real crash

In the 1990s, the Ariane 5 launch vehicle, designed and built for the European Space Agency, self-destructed 37 seconds after its first launch.

The reason was a control software bug that went undetected. The Ariane 5’s guidance software was reused from the Ariane 4, which was a slower rocket. When the velocity calculation converted from a 64-bit floating point number (the same as a number in TypeScript, though this software wasn’t written in TypeScript) to a 16-bit signed integer, it overflowed the small integer and caused an exception to be thrown. The exception handler had been disabled for efficiency reasons, so the guidance software crashed. Without guidance, the rocket crashed too. The cost of the failure was $1 billion.

What ideas does this story demonstrate?

Even high-quality safety-critical software may still have residual bugs.

Testing all possible inputs is the best solution to this problem.

Software exhibits discontinuous behavior, unlike many physically-engineered systems.

Static type checking could have detected this bug.

(missing explanation)

Test-first programming

Before we dive in, we need to define some terms:

A module is a part of a software system that can be designed, implemented, tested, and reasoned about separately from the rest of the system. In this reading, we’ll focus on modules that are functions. In future readings, we’ll broaden our view to think about larger modules, like a class with multiple interacting methods.
A specification (or spec) describes the behavior of a module. For a function, the specification gives the types of the parameters and any additional constraints on them (e.g. sqrt’s parameter must be nonnegative). It also gives the type of the return value and how the return value relates to the inputs. In TypeScript code, the specification consists of the function signature and the comment above it that describes what it does.
A module has an implementation that provides its behavior, and clients that use the module. For a function, the implementation is the body of the function, and the clients are other code that calls the function. The specification of the module constrains both the client and the implementation. We’ll have much more to say about specifications, implementations, and clients a few classes from now.
A test case is a particular choice of inputs, along with the expected output behavior required by the specification.
A test suite is a set of test cases for a module.

You have already started using these concepts on Problem Set 0. You were given some specifications for TypeScript functions and asked to write an implementation for each one. You were also given a test suite for each function that you could run to see if your implementation obeyed the spec.

It turns out that this is a good pattern to follow when designing a program from scratch. In test-first programming, you write the spec and the tests before you even write any code. The development of a single function proceeds in this order:

Spec: Write a specification for the function.
Test: Write tests that exercise the specification.
Implement: Write the implementation.

Once your implementation passes the tests you wrote, you’re done.

The biggest benefit of test-first programming is safety from bugs. Don’t leave testing until the end of development, when you have a big pile of unvalidated code. Leaving testing until the end only makes debugging longer and more painful, because bugs may be anywhere in your code. It’s far more pleasant to test your code as you develop it.

reading exercises

Test-first programming

In which stage of test-first programming for a function would you…

…first write the function signature:

(missing explanation)

…write the function’s final return statement:

(missing explanation)

…write the first client of the module:

(missing explanation)

…first write a comment:

(missing explanation)

Systematic testing

Rather than exhaustive, haphazard, or randomized testing, we want to test systematically. Systematic testing means that we are choosing test cases in a principled way, with the goal of designing a test suite with three desirable properties:

Correct. A correct test suite is a legal client of the specification, and it accepts all legal implementations of the spec without complaint. This gives us the freedom to change how the module is implemented internally without necessarily having to change the test suite. We will discuss in more detail about what legal means with respect to a specification in an upcoming reading.
Thorough. A thorough test suite finds actual bugs in the implementation, caused by mistakes that programmers are likely to make.
Small. A small test suite, with few test cases, is faster to write in the first place, and easier to update if the specification evolves. Small test suites are also faster to run. You will be able to run your tests more frequently if your test suites are small and fast.

By these criteria, exhaustive testing is thorough but infeasibly large. Haphazard testing tends to be small but not thorough. Randomized testing can achieve thoroughness only at the cost of large size.

Designing a test suite for both thoroughness and small size requires having the right attitude. Normally when you’re coding, your goal is to make the program work. But as a test suite designer, you want to make it fail. That’s a subtle but important difference. A good tester intentionally pokes at all the places the program might be vulnerable, so that those vulnerabilities can be eliminated.

The need to adopt a testing attitude is another argument for test-first programming. It is all too tempting to treat code you’ve already written as a precious thing, a fragile eggshell, and test it very lightly just to see it work. For thorough testing, though, you have to be brutal. Test-first programming allows you to put on your testing hat, and adopt that brutal perspective, before you’ve even written any code.

Choosing test cases by partitioning

Creating a good test suite is a challenging and interesting design problem. We want to pick a set of test cases that is small enough to be easy to write and maintain and quick to run, yet thorough enough to find bugs in the program.

To do this, we divide the program’s input space into subdomains, each consisting of a set of inputs. (The name subdomain comes from the fact that it is a subset of the domain, the input space of a mathematical function.) Taken together, the subdomains form a partition. A partition is a collection of non-empty disjoint sets that completely covers the input space, so that every input lies in exactly one subdomain (no overlap between subdomains). Then, we choose one test case from each subdomain, and that’s our test suite.

The idea behind subdomains is to divide the input space into sets of similar inputs on which the program has similar behavior, so that we only need to test one representative for each set. This approach makes the best use of limited testing resources by choosing dissimilar test cases (for smallness), and forcing the testing to explore areas of the input space that haphazard or random testing might not reach (for thoroughness).

Note that for the purpose of partitioning, a program’s input space only includes legal inputs, for which the correct behavior is specified (and therefore testable). Inputs for which the program does not have defined behavior are not part of the legal input space and should not be included in the partition. This distinction is subtle and will make more sense with some examples and further discussion of specifications in a future reading.

Example: `abs()`

Let’s start with a simple example from the JavaScript library, the Math.abs() function:

/**
 * ...
 * @param a  the argument whose absolute value is to be determined
 * @returns the absolute value of the argument.
 */
function abs(a: number): number

Mathematically, this is a function of the following type:

abs : number → number

The function has a one-dimensional input space, consisting of all the possible values of a. Thinking about how the absolute value function behaves, we might start by partitioning the input space into these two subdomains: { a | a ≥ 0 } and { a | a < 0 }. On the first subdomain, abs should return a unchanged. On the second subdomain, abs should negate it.

To write a partition compactly, we omit the { a | … } frame of each set description, and simply write a list of predicates like so:

// partition: a >= 0; a < 0

To choose test cases for our test suite, we pick an arbitrary value a from each subdomain of the partition, for example:

a = 17 to cover the subdomain a ≥ 0
a = -3 to cover the subdomain a < 0

Example: `max()`

Now let’s look at another example from the JavaScript library, Math.max(). Although max() can take any number of arguments, we will focus just on the two-argument form for now:

/**
 * ...
 * @param a  an argument
 * @param b  another argument
 * @returns the larger of a and b.
 */
function max(a: number, b: number): number

Mathematically, this is a function of two arguments:

max : number × number → number

So we have a two-dimensional input space, consisting of all pairs of numbers (a,b). Now let’s partition it. From the specification, it makes sense to choose the subdomains { (a,b) | a < b } and { (a,b) | a > b }, because the spec calls for different behavior on each one: inputs in the former subdomain should return b, and those in the latter subdomain should return a. But we can’t stop there, because these subdomains are not yet a partition of the input space. A partition must completely cover the set of possible inputs. So we need to add { (a,b) | a = b }, where the function could return either a or b. Expressed compactly, the partition looks like this:

// partition: a < b; a > b; a = b

Our test suite might then be:

(a,b) = (1, 2) to cover a < b
(a,b) = (10, -8) to cover a > b
(a,b) = (9, 9) to cover a = b

To summarize, to form a partition, the subdomains should have three desirable properties:

subdomains should be disjoint, with no intersections between them;
subdomains should be complete, so that their union covers the legal input space;
subdomains should be nonempty, so that a legal test case can be chosen from each one.

reading exercises

Square root

Suppose you want to partition the input space of this square root function:

/**
 * @param x  must be nonnegative
 * @returns square root of x
 */
function sqrt(x: number): number

Assess the quality of each of the following candidate partitions. Are the proposed subdomains disjoint, complete, and nonempty, thus forming a mathematical partition of the input space? A good partition should check all three boxes.

// partition: x < 0; x >= 0

subdomains are disjoint

subdomains are complete

subdomains are nonempty

(missing explanation)

// partition: x is a perfect square; x is > 0 but not a perfect square

subdomains are disjoint

subdomains are complete

subdomains are nonempty

(missing explanation)

// partition: x=0, x=1, x=7, x=16

subdomains are disjoint

subdomains are complete

subdomains are nonempty

(missing explanation)

The greatest of all common divisors

Now consider this specification:

/**
 * @param x an integer
 * @param y an integer, where x and y are not both 0
 * @returns the greatest common divisor of x and y
 */
function gcd(x: number, y: number): number

Assess each of the following candidate partitions for gcd.

// partition: x and y are not both 0

subdomains are disjoint

subdomains are complete

subdomains are nonempty

(missing explanation)

// partition: x is divisible by y; y is divisible by x; x and y are relatively prime

subdomains are disjoint

subdomains are complete

subdomains are nonempty

(missing explanation)

Include boundaries in the partition

Bugs often occur at boundaries between subdomains. Some examples:

0 is a boundary between positive numbers and negative numbers
the maximum and minimum values of numeric types, like Number.MAX_SAFE_INTEGER or Number.MAX_VALUE
emptiness for collection types, like the empty string, empty array, or empty set
the first and last element of a sequence, like a string or array

Why do bugs often happen at boundaries? One reason is that programmers often make off-by-one mistakes, like writing <= instead of <, or initializing a counter to 0 instead of 1. Another is that some boundaries may need to be handled as special cases in the code. Yet another is that boundaries may be places of discontinuity in the code’s behavior. When a number variable used as an integer grows beyond Number.MAX_SAFE_INTEGER, for example, it suddenly starts to lose precision.

We incorporate boundaries as single-element subdomains in the partition, so that the test suite will necessarily include the boundary value as a test case. For abs, we would add a subdomain for the relevant boundary:

a = 0, because abs behaves differently on positive and negative numbers

Our original two subdomains would then shrink to exclude the boundary values: { a | a > 0 } and { a | a < 0 }.

This is still a partition of the input space of abs: the three subdomains are disjoint and completely cover the input space. One way to write it compactly looks like this:

// partition:
//   a < 0
//   a = 0
//   a > 0

Our test suite might then be:

a = 0
a = 17 to cover the subdomain a > 0
a = -3 to cover the subdomain a < 0

reading exercises

Boundaries

For this function:

/**
 * @param winsAndLosses  a string of length at most 5 consisting of 'W' or 'L' characters
 * @returns the fraction of characters in winsAndLosses that are 'W'
 */
function winLossRatio(winsAndLosses: string): number

Which of the following are appropriate boundary values for testing this function?

(missing explanation)

Example: `BigInt` multiplication

Let’s look at a slightly more complicated example. BigInt is a type built into TypeScript that can represent integers of any size, unlike the type number that only has a limited range. BigInt supports the usual integer operations like +, -, *, /, and **. For example, here’s how it might be used:

let a = BigInt(Number.MAX_SAFE_INTEGER); // 2^53 - 1
let b = BigInt(4);
let ab = a * b; // should be 2^55 - 4

We can think of * as an (overloaded) function taking two inputs, each of type BigInt, and producing one output of type BigInt:

multiply : BigInt × BigInt → BigInt

We again have a two-dimensional input space, consisting of all the pairs of integers (a,b). Thinking about how sign rules work with multiplication, we might start with these subdomains:

a and b are both positive
a and b are both negative
a is positive, b is negative
a is negative, b is positive

There are also some boundary values for multiplication that we should check:

a or b is 0, because the result is always 0
a or b is 1, the identity value for multiplication

Finally, because the purpose of BigInt is to represent arbitrarily-large integers, we should make sure to try integers that are very big – at least bigger than Number.MAX_SAFE_INTEGER, which is 2⁵³-1, a 16-digit decimal integer.

a or b is small or large in magnitude (i.e., small enough to represent exactly in a number value, or too large for a number)

Let’s bring all these observations together into a single partition of the whole (a,b) space. We’ll choose a and b independently from:

0
1
small positive integer (≤ Number.MAX_SAFE_INTEGER and > 1)
small negative integer (≥ Number.MIN_SAFE_INTEGER and < 0)
large positive integer (> Number.MAX_SAFE_INTEGER)
large negative integer (< Number.MIN_SAFE_INTEGER)

So this would produce 6 × 6 = 36 subdomains that partition the space of pairs of integers.

To produce the test suite from this partition, we would pick an arbitrary pair (a,b) from each square of the grid, for example:

(a,b) = (0, 0) to cover (0, 0)
(a,b) = (0, 1) to cover (0, 1)
(a,b) = (0, 8392) to cover (0, small positive integer)
(a,b) = (0, -7) to cover (0, small negative integer)
…
(a,b) = (-10⁶⁰, -10⁸¹⁰) to cover (large negative, large negative)

The figure at the right shows how the two-dimensional (a,b) space is divided by this partition, and the points are test cases that we might choose to completely cover the partition.

Using multiple partitions

The examples so far used only one partition – one collection of disjoint subdomains – across the entire input space. For functions with multiple parameters, this can become costly. Each parameter may have interesting behavior variation and several boundary values, so forming a single partition of the input space from the Cartesian product of the behavior on each parameter leads to a combinatorial explosion in the size of the resulting test suite. We saw this already with BigInt multiplication, in which the Cartesian product partition already had 6 × 6 = 36 subdomains, requiring 36 test cases to cover. For a function with n parameters, the Cartesian-product approach produces a test suite of size exponential in n, which quickly becomes infeasible for manual test authoring.

partitioning multiply() with separate a and b partitions

An alternative approach is to treat the features of each input a and b as two separate partitions of the input space. One partition only considers the value of a:

(a,b) such that a = 0, 1, small positive, small negative, large positive, large negative

And the other partition only considers the value of b:

(a,b) such that b = 0, 1, small positive, small negative, large positive, large negative

These two partitions are illustrated at the right. Every input pair (a,b) belongs to exactly one subdomain from each partition.

We might write the two partitions compactly as follows:

// partition on a:
//   a = 0
//   a = 1
//   a is small positive integer > 1
//   a is small negative integer
//   a is large positive integer
//   a is large negative integer
//      (where "small" fits in a TypeScript number, and "large" doesn't)
// partition on b:
//   b = 0
//   b = 1
//   b is small positive integer > 1
//   b is small negative integer
//   b is large positive integer
//   b is large negative integer

We still want to cover every subdomain with a test case, but now a single test case can cover multiple subdomains from different partitions, making the test suite more efficient. We can cover both partitions completely with just 6 test cases, as shown on the right.

partitioning multiply() on the signs of a and b

Partitioning a and b independently raises the risk that you’re no longer testing the interaction between them. For example, sign handling in multiplication is a possible source of bugs, and the sign of the result depends on the signs of both a and b. But we can add an additional partition that captures this interaction:

// partition on signs of a and b:
//   a and b are both positive
//   a and b are both negative
//   a positive and b negative
//   a negative and b positive
//   one or both are 0

Now we have three partitions, with 6, 6, and 5 subdomains each, but we don’t need the Cartesian product of 6 × 6 × 5 test cases to cover them. A test suite with 6 carefully-chosen test cases can cover the subdomains of all three partitions.

We can continue to add partitions this way, as we think more about the spec and observe other behavioral variations that might lead to bugs. With careful test case selection, additional partitions should require few (if any) additional test cases.

Sometimes we may want to use the Cartesian product approach on multiple partitions, to produce a more thorough test suite. But even in those cases, the Cartesian product may be smaller than we expect. When subdomains from different partitions turn out to be mutually exclusive, the Cartesian product won’t include a subdomain for that particular combination of subdomains. We’ll see an example of that in one of the exercises below.

As a starting point for test-first programming, however, a small test suite that covers each subdomain of several thoughtfully-chosen partitions strikes a good balance between size and thoroughness. The test suite may then grow further with glass box testing, code coverage measurement, and regression testing, which we’ll see later in this reading.

reading exercises

One partition vs. multiple partitions

Consider this partition on a from above:

// partition on a:
//   a = 0
//   a = 1
//   a is small integer > 1
//   a is small integer < 0
//   a is large positive integer
//   a is large negative integer
//      (where "small" fits in a TypeScript number, and "large" doesn't)

This partition actually combines several distinct concerns: the sign of a, the magnitude of a (small or large), and the boundary values 0 and 1.

We can instead think about these concerns as independent partitions. From among the choices below, choose a subset that would be legal partitions and that together would capture the same concerns:

partition on a: 0, 1

partition on a: 0

partition on a: 1

partition on a: 0, positive, negative

partition on a: positive, negative

partition on a: 1, !=1

partition on a: small (fits in a number), large (doesn't fit in a number)

(missing explanation)

Covering every subdomain vs. covering Cartesian product

Consider again this partition on a from above:

// partition on a:
//   a = 0
//   a = 1
//   a is small integer > 1
//   a is small integer < 0
//   a is large positive integer
//   a is large negative integer
//      (where "small" fits in a number, and "large" doesn't)

This partition has 6 subdomains, so 6 different values of a can cover it, one chosen for each subdomain.

Suppose we used these three partitions of a instead:

// partition on a: 0, positive, negative
// partition on a: 1, !=1
// partition on a: small (fits in a number), large (doesn't fit in a number)

If we just want to cover every subdomain of the three partitions, how many different values of a would we need?

(missing explanation)

If we want to cover the Cartesian product of these three partitions, how many different values of a would we need?

(missing explanation)

Partitions expressed on outputs

It is sometimes convenient to think about and write an input space partition in terms of the output of the function, because variations in behavior may be more visible there. For example:

// partition on a*b: 0, positive, negative

is shorthand for the three-subdomain partition consisting of:

{ (a,b) | a*b = 0 } { (a,b) | a*b > 0 } { (a,b) | a*b < 0 }

Using this approach, how many test cases are needed to cover the following three partitions?

// partition on a: 0, positive, negative
// partition on b: 0, positive, negative
// partition on a*b: 0, positive, negative

(missing explanation)

How many are needed to cover the following two partitions?

// partition on value of abs(x): same as x, different from x
// partition on sign of abs(x): 0, positive

(missing explanation)

Partitioning

Consider the following specification:

/**
 * Reverses the end of a string.
 *
 *                          012345                     012345
 * For example: reverseEnd("Hello, world", 5) returns "Hellodlrow ,"
 *                               <----->                    <----->
 *
 * With start === 0, reverses the entire text.
 * With start === text.length, reverses nothing.
 *
 * @param text    string that will have its end reversed
 * @param start   the index at which the remainder of the input is reversed.
 *                Requires start to be an integer, 0 <= start <= text.length
 * @returns input text with the substring from `start` to the end of the string reversed
 */
function reverseEnd(text: string, start: number): string

Which of the following are reasonable partitions involving the start parameter?

start = 0; start = 5; start = 100

start < 0; start = 0; start > 0

start = 0; 0 < start < text.length; start = text.length (assuming nonempty text)

start < text.length; start = text.length; start > text.length (assuming nonempty text)

(missing explanation)

Which of the following are reasonable partitions involving the text parameter?

text contains some letters; text contains no letters, but some numbers; text contains neither letters nor numbers

text.length = 0; text.length > 0

text.length-start = 0; text.length-start is odd; text.length-start is even (> 0)

text is every possible string from length 0 to 100

text is "" or "Hello, world"

text should be the empty string in some test case

(missing explanation)

Automated unit testing

A well-tested program will have tests for every individual module that it contains. A test that tests an individual module, in isolation if possible, is called a unit test.

Nothing makes unit tests easier to run, and more likely to be run, than complete automation. Automated testing means running the tests and checking their results automatically. The code that runs tests on a module is a test driver (also known as a test harness or test runner). A test driver should not be an interactive program that prompts you for inputs and prints out results for you to manually check. Instead, a test driver should invoke the module itself on fixed test cases and automatically check that the results are correct. The result of the test driver should be either “all tests OK” or “these tests failed: …”

Most programming languages have at least one popular library for writing automated unit tests, including JUnit for Java and unittest for Python. For JavaScript and TypeScript, a popular choice is Mocha.

A Mocha unit test is written as a call to the function it(), with the name of the test as its first argument, and the body of the test (written as a function expression) as the second argument. The test body typically contains one or more calls to the module being tested, and then checks the results using assertion functions like assert, assert.strictEqual, or assert.throws. For example, a single unit test for max might look like this:

it("covers a < b", function() {
  assert.strictEqual(Math.max(1, 2), 2);
});

Note that the order of the parameters to assert.strictEqual is important. The first parameter should be the actual result – what the code actually does. The second parameter should be the expected result – usually a constant – that the test wants to see. If you switch them around, then Mocha may produce a confusing error message when the test fails. An assertion can also take an optional message string as the last argument, which you can use to make the test failure clearer.

To collect a set of unit tests into a test suite, put them inside a call to describe(). For example, the tests we chose for max above might look like this when implemented for Mocha:

describe("Math.max", function() {
  it("covers a < b", function() {
    assert.strictEqual(Math.max(1, 2), 2);
  });

  it("covers a = b", function() {
    assert.strictEqual(Math.max(9, 9), 9);
  });

  it("covers a > b", function() {
    assert.strictEqual(Math.max(10, -9), 10);
  });
});

A test suite can contain any number of it() functions, which are run independently when you run the test suite with Mocha. If an assertion fails in a test case, then that test case returns immediately, and Mocha records a failure for that test. Even if one test fails, the others in the suite will still be run.

Note that automated testing frameworks like Mocha make it easy to run the tests, but you still have to come up with good test cases yourself. Automatic test generation is a hard problem, still a subject of active computer science research.

reading exercises

Asserting equal

Fill in the blank to finish this Mocha test case for Math.abs():

describe("Math.abs", function() {
  it("covers a < 0", function() {
    const result: number = Math.abs(-3);(missing answer)
  });
});

(missing explanation)

Asserting true

Consider this Mocha test, which is intended to test a function pickRandomly() that picks a random number from its input set:

it("should return a number found in the set", function () {
  const set = new Set<number>([293, 384, 10, 5, -3, 99]);
  const result: number = pickRandomly(set);
  /* ASSERTION HERE */
});

Which assertion, put where the code says /* ASSERTION HERE */, would be both correct and useful for debugging?

assert.strictEqual(result, -3);

assert(set.has(result));

assert(set.has(result), "expected result from {" + Array.from(set) + "} but actually was " + result);

(missing explanation)

The comparison used by strictEqual works as expected for built-in types like string and number, but it does not help us with data structures. For example, all of the following assertions fail:

strictEqual([], [])
strictEqual([1], [1])
strictEqual(new Set([1]), new Set([1]))

As we’ll see in later readings, they fail for good reason: the question of whether these mutable arrays and sets are “equal” is quite a thorny one. So when we’re testing a function that returns this kind of structure, we have two options:

Use assert.deepStrictEqual, which performs an equality comparison that considers arrays, sets, maps, and object literals with identical contents to be equal. This is most useful when we know exactly what the correct structure is. But assert.deepStrictEqual has a dangerous caveat: only use it to compare structures of built-in arrays, sets, maps, and object literals — not anything else. We’ll see why in future classes.
Use simple assert and assert.strictEqual assertions on properties of the structure. This is most useful when, as in the previous exercise, there are multiple correct outputs.

reading exercises

Asserting deep

If we want to assert that value actual is a Set with the single element "hello", we could use:

assert.deepStrictEqual(actual, new Set(["hello"]));

… or we could use: (select a set of correct assertions that are necessary and sufficient)

assert(actual.has("hello"));

assert( ! actual.has("goodbye"));

assert.strictEqual(actual.size, 1);

assert.strictEqual(actual.values(), ['hello']);

(missing explanation)

Documenting your testing strategy

It’s a good idea to write down the testing strategy you used to create a test suite: the partitions, their subdomains, and which subdomains each test case was chosen to cover. Writing down the strategy makes the thoroughness of your test suite much more visible to the reader, and easier for you to spot missing partitions or subdomain coverage.

Document the partitions and subdomains in a comment at the top of the describe() function. For example, to document our strategy for testing max, we would write this:

describe("max", function() {
  /*
   * Testing strategy
   *
   * partition:
   *   a < b
   *   a > b
   *   a = b
   */

   it(...); // test cases here
   it(...);
   it(...);
   ...
});

Each test case should be named by the subdomain(s) it covers, e.g.:

it("covers a < b", function() {
  assert.strictEqual(Math.max(1, 2), 2);
});

Most test suites will have more than one partition, and most test cases will cover multiple subdomains. For example, here’s a strategy for BigInt multiplication, using seven partitions:

describe("multiplication", function() {
  /*
   * Testing strategy
   *
   * cover the cartesian product of these partitions:
   *   partition on a: positive, negative, 0
   *   partition on b: positive, negative, 0
   *   partition on a: 1, !=1
   *   partition on b: 1, !=1
   *   partition on a: small (fits in a TypeScript number), or large (doesn't fit)
   *   partition on b: small, large
   *
   * cover the subdomains of these partitions:
   *   partition on signs of a and b:
   *      both positive
   *      both negative
   *      different signs
   *      one or both are 0
   */

  it("covers a = 1, b != 1, a and b have same sign", function() {
    assert.strictEqual(BigInt(1) * BigInt(33), BigInt(33));
  });

  it("covers a is positive, b is negative, "
      + "a fits in a number, b fits in a number, "
      + "a and b have different signs", function() {
    assert.strictEqual(BigInt(73) * BigInt(-2), BigInt(-146));
  });

  ...
});

Black box and glass box testing

Recall from above that the specification is the description of the function’s behavior — the types of parameters, type of return value, and constraints and relationships between them.

Black box testing means choosing test cases only from the specification, not the implementation of the function. That’s what we’ve been doing in our examples so far. We partitioned and looked for boundaries in abs, max, and BigInt multiplication without looking at the actual code for these functions. In fact, if we were following the test-first programming approach, we had not even written the code for these functions yet.

Glass box testing means choosing test cases with knowledge of how the function is actually implemented. For example, if the implementation selects different algorithms depending on the input, then you should partition around the points where different algorithms are chosen. If the implementation keeps an internal cache that remembers the answers to previous inputs, then you should test repeated inputs to see if the cache is working properly.

For the case of BigInt multiplication, when we finally implemented it, we may have decided to represent the integer as an array of decimal digits. This decision introduces new boundaries in the function’s behavior. Multiplying two single-digit numbers is trivial, but arrays of two or more digits demand the greater complexity of multiple-digit multiplication. So glass box testing of this implementation might introduce a new partition based on the length of the internal array. (A more efficient representation for BigInt might be an array of 32-bit unsigned integers like Uint32Array, but the principle is the same — a sequence of digits of base 2³² rather than base 10.)

When doing glass box testing, you must take care that your test cases don’t require specific implementation behavior that isn’t specifically called for by the spec. For example, if the spec says “throws an exception if the input is poorly formatted,” then your test shouldn’t check specifically for a TypeError exception just because that’s what the current implementation does. The specification in this case allows any exception to be thrown, so your test case should likewise be general in order to be correct, and preserve the implementor’s freedom. We’ll have much more to say about this in the class on specs.

reading exercises

Black box and glass box testing

Consider the following function:

/**
 * Sort an array of numbers in nondecreasing order.  Modifies the array so that
 * values[i] <= values[i+1] for all 0 <= i < values.length-1
 */
function sort(values: Array<number>): void {
    // choose a good algorithm for the size of the array
    if (values.length < 10) {
        radixSort(values);
    } else if (values.length < 1_000_000_000) {
        quickSort(values);
    } else {
        mergeSort(values);
    }
}

Which of the following test cases are likely to be boundary values produced by glass box testing?

values = [] (the empty array)

values = [1, 2, 3]

values = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

values = [0, 0, 1, 0, 0, 0, 0]

(missing explanation)

Coverage

One way to judge a test suite is to ask how thoroughly it exercises the program. This notion is called coverage. Here are three common kinds of coverage:

Statement coverage: is every statement run by at least one test case?
Branch coverage: for every control branch in the program (such as an if or while or ?: ternary expression), is each side of the branch taken by at least one test case?
Path coverage: is every possible combination of branches — every path through the program — taken by at least one test case?

example output from Istanbul code coverage tool

Branch coverage is stronger (i.e. requires more tests to achieve) than statement coverage, and path coverage is stronger than branch coverage. In industry, 100% statement coverage is a common goal, but even that is rarely achieved due to unreachable defensive code (like “should never get here” assertions). 100% branch coverage is highly desirable, and safety-critical industry code has even more arduous criteria (e.g., MC/DC, modified condition/decision coverage). Unfortunately 100% path coverage is infeasible, requiring exponential-size test suites to achieve.

A standard approach to testing is to add tests until the test suite achieves adequate statement coverage: i.e., so that every reachable statement in the program is executed by at least one test case. In practice, statement coverage is usually measured by a code coverage tool, which counts the number of times each statement is run by your test suite. With such a tool, glass box testing is easy; you just measure the coverage of your black box tests, and add more test cases until all important statements are logged as executed.

A good code coverage tool for JavaScript and TypeScript is c8, shown on the right. In the tool’s HTML output, a line that has been executed by the test suite has a green mark in the left margin (which also shows the number of times the line was executed). A line not yet covered is colored red. A line containing a branch that has been executed in only one direction – always true but never false, or vice versa – has an I or E icon, indicating which path was never taken (I for “if” or E for “else”). If you saw the result on the right from your coverage tool, your next step would be to come up with a test case that causes the if test to be true at least once, and add it to your test suite so that the red line becomes green.

reading exercises

Using a coverage tool

This exercise asks you to explore the output of the Istanbul/nyc coverage tool.

Download this small program, ex02-coverage.zip, and unpack it.
In your terminal prompt, cd into the ex02-coverage folder you just extracted, and run:
```
npm install
```
This will install Istanbul/nyc and other necessary packages.
Open ex02-coverage/hailstone.ts in Visual Studio Code. You should see this code:
```
let n: number = 3;
while (n !== 1) {
   if (n % 2 === 0) {
       n = n / 2;
   } else {
       n = 3 * n + 1;
   }
}
```
By changing the initial value of n in the first line, you will be able to observe how the code coverage tool would highlight different lines of code differently.
Back in your terminal prompt, run the program using the code coverage tool:
```
npm run coverage
```
Open the coverage report in your web browser: you will find it in the file ex02-coverage/coverage/index.html. Then click on hailstone.ts in the report to see a line-by-line coverage report for that source file.

When n=3 initially, what color is the line n = n/2 in the coverage report?

(missing explanation)

When n=16 initially, what color is the line n = 3 * n + 1 in the coverage report?

(missing explanation)

What initial value of n would make the entire body of the loop red in the coverage report?

(missing explanation)

Unit and integration testing

We’ve so far been talking about unit tests that test a single module in isolation. Testing modules in isolation leads to much easier debugging. When a unit test for a module fails, you can be more confident that the bug is found in that module, rather than anywhere in the program.

In contrast to a unit test, an integration test tests a combination of modules, or even the entire program. If all you have are integration tests, then when a test fails, you have to hunt for the bug. It might be anywhere in the program. Integration tests are still important, because a program can fail at the connections between modules. For example, one module may be expecting different inputs than it’s actually getting from another module. But if you have a thorough set of unit tests that give you confidence in the correctness of individual modules, then you’ll have much less searching to do to find the bug.

Suppose you’re building a document search engine. Two of your modules might be load(), which loads a file, and extract(), which splits a document into its component words:

/**
 * @returns the contents of the file
 */
function load(filename: string): string { ... }

/**
 * @returns the words in string s, in the order they appear,
 *         where a word is a contiguous sequence of
 *         non-whitespace and non-punctuation characters
 */
function extract(s: string): Array<string> { ... }

These functions might be used by another module index() to make the search engine’s index:

/**
 * @returns an index mapping a word to the set of filenames
 *         containing that word, for all files in the input set
 */
function index(filenames: Set<string>): Map<string, Set<string>> {
    ...
    for (const file of files) {
        const doc = load(file);
        const words = extract(doc);
        ...
    }
    ...
}

In our test suite, we would want:

unit tests just for load that test it on various files
unit tests just for extract that test it on various strings
unit tests for index that test it on various sets of files

One mistake that programmers sometimes make is writing test cases for extract in such a way that the test cases depend on load to be correct. For example, a test case might use load to load a file, and then pass its result as input to extract. But this is not a unit test of extract. If the test case fails, then we don’t know if the failure is due to a bug in load or extract.

It’s better to think about and test extract in isolation. Using test partitions that involve realistic file content might be reasonable, because that’s how extract is actually used in the program. But don’t actually call load from the test case, because load may be buggy! Instead, store file content as a literal string, and pass it directly to extract. That way you’re writing an isolated unit test, and if it fails, you can be more confident that the bug is in the module it’s actually testing.

Note that the unit tests for index can’t easily be isolated in this way. When a test case calls index, it is testing the correctness of not only the code inside index, but also all the functions called by index. If the test fails, the bug might be in any of those functions. That’s why we want separate tests for load and extract, to increase our confidence in those modules individually and localize the problem to the index code that connects them together.

Isolating a higher-level module like index is possible if we write stub versions of the modules that it calls. For example, a stub for load wouldn’t access the filesystem at all, but instead would return mock file content no matter what File was passed to it. A stub for a class is often called a mock object. Stubs are an important technique when building large systems, but we will generally not use them in 6.102.

reading exercises

Pizza pizza

Suppose you are developing a new pizza recipe. Making pizza involves three “modules” (subrecipes) for:

making the dough for the crust
making the sauce
preparing the cheese and other toppings

You make the dough for the crust, bake it by itself, and see how crunchy and tasty it comes out. This is:

(missing explanation)

You decide to buy a premade spice mix from a specialty store. You make a sauce using the spices, then taste it. This is:

(missing explanation)

You put sauce and toppings on a crust and bake it, to see whether the crust still cooks well with the moist stuff on top of it. This is:

(missing explanation)

Automated regression testing

Once you have test automation, it’s very important to rerun your tests when you modify your code. Software engineers know from painful experience that any change to a large or complex program is dangerous. Whether you’re fixing another bug, adding a new feature, or optimizing the code to make it faster, an automated test suite that preserves a baseline of correct behavior – even if it’s only a few tests – will save your bacon. Running the tests frequently while you’re changing the code prevents your program from regressing — introducing other bugs when you fix new bugs or add new features. Running all your tests after every change is called regression testing.

Whenever you find and fix a bug, take the input that elicited the bug and add it to your automated test suite as a test case. This kind of test case is called a regression test. This helps to populate your test suite with good test cases. Remember that a test is good if it elicits a bug — and every regression test did in one version of your code! Saving regression tests also protects against reversions that reintroduce the bug. The bug may be an easy error to make, since it happened once already.

This idea also leads to test-first debugging. When a bug arises, immediately write a test case for it that elicits it, and immediately add it to your test suite, with a name that identifies it as coming from a bug report rather than partitioning, such as it("covers bug #1079"). Once you find and fix the bug, all your test cases will be passing, and you’ll be done with debugging and have a regression test for that bug.

In practice, these two ideas, automated testing and regression testing, are almost always used in combination. Regression testing is only practical if the tests can be run often, automatically. Conversely, if you already have automated testing in place for your project, then you might as well use it to prevent regressions. So automated regression testing is a best-practice of modern software engineering.

reading exercises

Regression testing

Which of the following best defines regression testing?

You should run a suite of tests whenever you change the code.

Every component in your code should have an associated set of tests that exercises all the corner cases in its specification.

Tests should be written before you write the code as a way of checking your understanding of the specification.

When a new test exposes a bug, you should run it on all previous versions of the code until you find the version where the bug was introduced.

(missing explanation)

Running automated tests

Which of the following are good times to rerun all your tests?

Before doing git add/commit/push

After rewriting a function to make it faster

When using a code coverage tool

After you think you fixed a bug

(missing explanation)

Iterative test-first programming

Let’s revisit the test-first programming idea that we introduced at the start of this reading, and refine it. Effective software engineering does not follow a linear process. Practice iterative test-first programming, in which you are prepared to go back and revise your work in previous steps:

Write a specification for the function.
Write tests that exercise the spec. As you find problems, iterate on the spec and the tests.
Write an implementation. As you find problems, iterate on the spec, the tests, and the implementation.

Each step helps to validate the previous steps. Writing tests is a good way to understand the specification. The specification can be incorrect, incomplete, ambiguous, or missing corner cases. Trying to write tests can uncover these problems early, before you’ve wasted time working on an implementation of a buggy spec. Similarly, writing the implementation may help you discover missing or incorrect tests, or prompt you to revisit and revise the specification.

Since it may be necessary to iterate on previous steps, it doesn’t make sense to devote enormous amounts of time making one step perfect before moving on to the next step. Plan for iteration:

For a large specification, start by writing only one part of the spec, proceed to test and implement that part, then iterate with a more complete spec.
For a complex test suite, start by choosing a few important partitions, and create a small test suite for them. Proceed with a simple implementation that passes those tests, and then iterate on the test suite with more partitions.
For a tricky implementation, first write a simple brute-force implementation that tests your spec and validates your test suite. Then move on to the harder implementation with confidence that your spec is good and your tests are correct.

Iteration is a feature of every modern software engineering process (such as Agile and Scrum), with good empirical support for its effectiveness. Iteration requires a different mindset than you may be used to as a student solving homework and exam problems. Instead of trying to solve a problem perfectly from start to finish, iteration means reaching a rough solution as soon as possible, and then steadily refining and improving it, so that you have time to discard and rework if necessary. Iteration makes the best use of your time when a problem is difficult and the solution space is unknown.

Summary

In this reading, we saw these ideas:

Test-first programming. Write tests before you write code.
Systematic testing with partitioning and boundary values, to design a test suite that is correct, thorough, and small.
Glass box testing and statement coverage for filling out a test suite.
Unit-testing each module, in isolation as much as possible.
Automated regression testing to keep bugs from coming back.
Iterative development. Plan to redo some work.

The topics of today’s reading connect to our three key properties of good software as follows:

Safe from bugs. Testing is about finding bugs in your code, and test-first programming is about finding them as early as possible, right after you introduce them.
Easy to understand. Systematic testing with a documented testing strategy makes it easier to understand how test cases were chosen and how thorough a test suite is.
Ready for change. Correct test suites only depend on behavior in the spec, which allows the implementation to change within the confines of the spec. We also talked about automated regression testing, which helps keep bugs from coming back when changes are made to code.

Asking questions about the reading

Have a question about the reading, or about a reading exercise? Post your question on Piazza, and include a link that jumps directly to the part of the reading or the reading exercise that you’re asking about. You can get that link by moving your mouse cursor into the left margin until an octothorpe # link appears next to what you want to refer to. Click on that #, and your browser’s address bar now has the link you want.

reading exercises

Every question is a good question

Find the paragraph in this reading that defines glass box testing, get the link that jumps right to that paragraph, and paste the link here:

(missing explanation)

Jumping in

Find the link that jumps right to this exercise, and paste the link here:

Reading 2: Testing

Software in 6.102

Objectives

Validation

Why software testing is hard

reading exercises

Test-first programming

reading exercises

Systematic testing

reading exercises

Choosing test cases by partitioning

Example: abs()

Example: max()

reading exercises

Include boundaries in the partition

reading exercises

Example: BigInt multiplication

Using multiple partitions

reading exercises

Automated unit testing

reading exercises

reading exercises

Documenting your testing strategy

reading exercises

Black box and glass box testing

reading exercises

Coverage

reading exercises

Unit and integration testing

reading exercises

Automated regression testing

reading exercises

Iterative test-first programming

reading exercises

Summary

Asking questions about the reading

reading exercises

Example: `abs()`

Example: `max()`

Example: `BigInt` multiplication