6.031 — Software Construction
Fall 2017

Reading 26: Map, Filter, Reduce

Software in 6.031

Safe from bugsEasy to understandReady for change
Correct today and correct in the unknown future. Communicating clearly with future programmers, including future you. Designed to accommodate change without rewriting.

Objectives

In this reading you’ll learn a design pattern for implementing functions that operate on sequences of elements, and you’ll see how treating functions themselves as first-class values that we can pass around and manipulate in our programs is an especially powerful idea.

  • Map/filter/reduce
  • Functional objects
  • Higher-order functions

Introduction: an example

Suppose we’re given the following problem: write a method that finds the words in the Java files in your project.

Following good practice, we break it down into several simpler steps and write a method for each one:

  • find all the files in the project, by scanning recursively from the project’s root folder
  • restrict them to files with a particular suffix, in this case .java
  • open each file and read it in line-by-line
  • break each line into words

Writing the individual methods for these substeps, we’ll find ourselves writing a lot of low-level iteration code. For example, here’s what the recursive traversal of the project folder might look like:

/**
 * Find all the files in the filesystem subtree rooted at folder.
 * @param folder root of subtree, requires folder.isDirectory() == true
 * @return list of all ordinary files (not folders) that have folder as
 *         their ancestor
 */
public static List<File> allFilesIn(File folder) {
    List<File> files = new ArrayList<>();
    for (File f : folder.listFiles()) {
        if (f.isDirectory()) {
            files.addAll(allFilesIn(f));
        } else if (f.isFile()) {
            files.add(f);
        }
    }
    return files;
}

And here’s what the filtering method might look like, which restricts that file list down to just the Java files. Imagine calling this like onlyFilesWithSuffix(files, ".java"):

/**
 * Filter a list of files to those that end with suffix.
 * @param files list of files
 * @param suffix string to test
 * @return a new list consisting of only those files whose names end with suffix
 */
public static List<File> onlyFilesWithSuffix(List<File> files, String suffix) {
    List<File> result = new ArrayList<>();
    for (File f : files) {
        if (f.getName().endsWith(suffix)) {
            result.add(f);
        }
    }
    return result;
}

→ full Java code for the example

In this reading we discuss map/filter/reduce, a design pattern that substantially simplifies the implementation of functions that operate over sequences of elements. In this example, we’ll have lots of sequences — lists of files; input streams that are sequences of lines; lines that are sequences of words; frequency tables that are sequences of (word, count) pairs. Map/filter/reduce will enable us to operate on those sequences with no explicit control flow — not a single for loop or if statement.

Along the way, we’ll also see another example of an important Big Idea: functions as first-class data values, meaning that they can be stored in variables, passed as arguments to functions, and created dynamically like other values.

Abstracting out control flow

We’ve already seen one design pattern that abstracts away from the details of iterating over a data structure: Iterator.

Iterator abstraction

Iterator gives you a sequence of elements from a data structure, without you having to worry about whether the data structure is a set or a hash table or a list or an array — the Iterator looks the same no matter what the data structure is.

For example, given a List<File> files, we can iterate using indices:

for (int ii = 0; ii < files.size(); ii++) {
    File f = files.get(ii);
    // ...

But this code depends on the size and get methods of List, which might be different in another data structure. Using an iterator abstracts away the details:

Iterator<File> iter = files.iterator();
while (iter.hasNext()) {
    File f = iter.next();
    // ...

Now the loop will be identical for any type that provides an Iterator. There is, in fact, an interface for such types: Iterable. Any Iterable can be used with Java’s enhanced for statementfor (File f : files) — and under the hood, it uses an iterator.

Map/filter/reduce abstraction

The map/filter/reduce patterns in this reading do something similar to Iterator, but at an even higher level: they treat the entire sequence of elements as a unit, so that the programmer doesn’t have to name and work with the elements individually. In this paradigm, the control statements disappear: specifically, the for statements, the if statements, and the return statements in the code from our introductory example will be gone. We’ll also be able to get rid of most of the temporary names (i.e., the local variables files, f, and result).

Streams

Java has an abstract datatype Stream<E> that represents a sequence of elements of type E.

Collection types like List and Set provide a stream() operation that returns a Stream for the collection, and there’s an Arrays.stream function for creating a Stream from an array.

Here are a few examples of streams created in different ways:

List<Integer> intList = Arrays.asList(1, 4, 9, 16);
Stream<Integer> intStream = intList.stream();

String[] stringArray = new String[] {"a", "b", "c"};
Stream<String> stringStream = Arrays.stream(stringArray);

We’ll have three operations for streams: map, filter, and reduce. Let’s look at each one in turn, and then look at how they work together.

Map

Map applies a unary function to each element in the stream and returns a new stream containing the results, in the same order:

map : Stream<‍E> × (E → F) → Stream<‍F>

For example:

List<Integer> list = Arrays.asList(1, 4, 9, 16);
Stream<Integer> s = list.stream();
Stream<Double> t = s.map(x -> Math.sqrt(x));

This code starts with a stream s containing the integers 1, 4, 9, 16, and applies the square root function to each element to produce the double values 1.0, 2.0, 3.0, 4.0. We can write the expression more compactly as:

Arrays.asList(1, 4, 9, 16).stream()
    .map(x -> Math.sqrt(x))

which is the syntax that future examples will use. This kind of function-calling syntax – where the return value of a method is immediately used to call another method – is called method call chaining. The result of the asList() method is chained into the stream() method, and then the result of that is chained into the map method. Chaining is a very common pattern in map/filter/reduce code.

Another example of a map:

Arrays.asList("A", "b", "C").stream()
   .map(s -> s.toLowerCase())

which produces a stream containing "a", "b", "c".

This operation captures a common pattern for operating over sequences: doing the same thing to each element of the sequence.

Method references

A lambda expression like x -> Math.sqrt(x) has an unnecessary level of indirection – it takes an argument, calls sqrt on that argument, and directly returns the result of sqrt. So calling the lambda expression is effectively the same as calling sqrt directly.

Java lets us eliminate this level of indirection by referring to the sqrt method directly:

Arrays.asList(1, 4, 9, 16).stream()
    .map(Math::sqrt)

This syntax, Math::sqrt, is called a method reference. Note the :: between the class name and method name, rather than the usual . which distinguishes the method reference from an ordinary method call or field lookup. You can read more about method references in the Java Tutorials for the details about the several kinds of method references that Java supports, including not only static methods like sqrt but also instance methods like String::lowercase.

But let’s pause here for a second, because we’re doing something unusual with functions. The map method takes a reference to a function as its argument — not to the result of that function. When we wrote map(Math::sqrt), we didn’t call sqrt, like Math.sqrt(25) is a call. Instead we referred to the function itself by name. Math::sqrt is a reference to an object representing the sqrt function. The type of that object is Function<T,R>, which represents unary functions from T to R. The primary operation of a Function is apply(), which calls the function with an argument. So (Math::sqrt).apply(25.0) returns 5.0, just like Math.sqrt(25.0) would.

But you can also assign that function object to another variable if you like, and it still behaves like sqrt:

Function<Double,Double> mySquareRoot = Math::sqrt;
mySquareRoot.apply(16.0); // returns 4.0

You can also pass a reference to the function object as a parameter to another function, and that’s what we’re doing here with map. You can use function objects the same way you would use any other value in Java, like ints or string references or other object references. This is how functions are first-class in Java.

More ways to use map

Map is useful even if you don’t care about the return value of the function. When you have a sequence of mutable objects, for example, you may want to map a mutator operation over them. Because mutator operations typically return void, however, Java requires using forEach instead of map. forEach applies the function to each element of the stream, but does not collect their return values into a new stream:

Stream<Thread> threads = ...;
threads.forEach(Thread::join);

Stream<Socket> sockets = ...;
sockets.forEach(Socket::close);

reading exercises

map 1

What sequence is the result of

Arrays.asList("1", "2", "3").stream()
    .map(String::length);

(missing explanation)

map 2

What sequence is the result of

Arrays.asList(1, 2, 3).stream()
    .map(String::length);

(missing explanation)

Filter

Our next important sequence operation is filter, which tests each element with a unary predicate, Predicate<T>, which represents functions from T to boolean.

Elements that satisfy the predicate are kept; those that don’t are removed. A new sequence is returned; filter doesn’t modify its input sequence.

filter : Stream<‍E> × (E → boolean) → Stream<‍E>

Examples:

Arrays.asList('x', 'y', '2', '3', 'a').stream()
   .filter(Character::isLetter)
// returns the stream ['x', 'y', 'a']

Arrays.asList(1, 2, 3, 4).stream()
   .filter(x -> x%2 == 1)
// returns the stream [1, 3]

Arrays.asList("abc", "", "d").stream()
   .filter(s -> !s.isEmpty())
// returns the stream ["abc", "d"]

reading exercises

filter 1

Given:

String s1 = "Abel";
String s2 = "Baker";
String s3 = "Charlie";

What sequence is the result of

Arrays.asList(s1, s2, s3).stream()
    .filter(s -> s.startsWith("A"));

(missing explanation)

filter 2

Again given:

String s1 = "Abel";
String s2 = "Baker";
String s3 = "Charlie";

What sequence is the result of

Arrays.asList(s1, s2, s3).stream()
    .filter(s -> s.startsWith("Z"));

(missing explanation)

filter 3

What is the result of

Arrays.asList('a', '1', 'b', '2').stream()
    .filter(Character::isDigit);

(missing explanation)

filter 4

What is the result of

Arrays.asList('a', '1', 'b', '2').stream()
    .filter( ! Character::isDigit);

(missing explanation)

Reduce

Our final operator, reduce, combines the elements of the sequence together, using a binary function. In addition to the function and the list, it also takes an initial value that initializes the reduction, and that ends up being the return value if the list is empty:

reduce : Stream<‍E> × E × (E × E → E) → E

stream.reduce(init, f) combines the elements of the stream together. Here is one way that it might compute the result:

result0 = init
result1 = f(result0, stream[0])
result2 = f(result1, stream[1])
...
resultn = f(resultn-1, stream[n-1])

resultn is the final result for an n-element sequence.

Adding numbers is probably the most straightforward example:

Arrays.asList(1,2,3).stream()
    .reduce(0, (x,y) -> x+y)
// computes (((0+1)+2)+3) to produce the integer 6

Initial value

There are three design choices in the reduce operation. First is whether to require an initial value. In Java, the initial value can be omitted, in which case reduce uses the first element of the stream as the initial value of the reduction. But if the stream is empty, then reduce has no value to return, so this version of the reduce operation has return type Optional<E>.

This makes it easier to use reducers like max, which have no well-defined initial value:

Arrays.asList(5, 8, 3, 1).stream()
    .reduce(Math::max)
// computes max(max(max(5,8),3),1) and returns an Optional<Integer> value containing 8

Order of operations

The second design choice is the order in which the elements are accumulated. Java’s reduce operation requires that the operator be associative, like + and max. For associative operators, the order of combination makes no difference, and this gives the Java implementation a great deal of flexibility, including the ability to automatically parallelize the computation, which will be discussed later.

But other languages allow non-associative operators to be used in reductions, and then the order of combination matters. Python’s reduce, called fold-left in other programming languages, combines the sequence starting from the left (the first element):

result0 = init
result1 = f(result0, stream[0])
result2 = f(result1, stream[1])
...
resultn = f(resultn-1, stream[n-1])

Fold-right goes in the other direction:

result0 = init
result1 = f(stream[n-1], result0)
result2 = f(stream[n-2], result1)
...
resultn = f(stream[0], resultn-1)

to produce resultn as the final result.

Here’s a diagram of two ways to reduce, from the left or from the right. If the operator is non-associative (like subtraction, shown here), then each direction might produce a different answer:

fold-left([1, 2, 3], 0, -) = -6
fold-right([1, 2, 3], 0, -) = 2

In Java, because the operator is required to be associative, the implementation of reduce is free to choose any order of combination, including partial combinations from within the sequence. So the reduction

Arrays.asList(1,2,3).stream()
   .reduce(0, (x,y) -> x+y)

can be computed in any of these orders:

((0+1)+2)+3 = 6
(0+1)+(2+3) = 6
0+((1+2)+3) = 6
0+(1+(2+3)) = 6

Reduction to another type

The third design choice is the return type of the reduce operation. It doesn’t necessarily have to match the type of the sequence elements. For example, we can use reduce to concatenate a sequence of integers (type E) into a string (type F). But because Java leaves the order of combination unspecified, this requires two binary operators with slightly different type signatures:

  • an accumulator ⊙ : F × E → F that adds an element from the sequence (of type E) into the growing result (of type F)

  • a combiner ⊗ : F × F → F that combines two partial results, each accumulated from part of the sequence, into a growing result

So the most general form of reduce in Java looks like this:

reduce : Stream<‍E> × F × (F × E → F) × (F × F → F) → F

Here’s a simple example that concatenates a sequence of integers into a string:

Arrays.asList(1,2,3).stream()
   .reduce(
       "",                         // identity value 
       (String s, int n) -> s+n,   // accumulator
       (String s, String t) -> s+t // combiner
   )
// returns "123"

We’ve included parameter type declarations in the lambda expressions above in order to clarify the difference between the accumulator and the combiner. The accumulator ⊙ and combiner ⊗ must both be associative, and must also be consistent with each other in the sense that any order of applying accumulators and combiners to the initial value and the elements of the sequence must produce the same result:

((""1) ⊙ 2) ⊙ 3 = "123"
(""1) ⊗ ((""2) ⊙ 3) = "123"
(""1) ⊗ ((""2) ⊗ (""3)) = "123"

reading exercises

reduce 1
Arrays.asList(e1, e2, e3).stream()
    .reduce(true, 
            (boolean x, String y) -> x && y.equals("true"),
            (boolean x, boolean y) -> x && y);

In this reduction, what should be the type of the elements e1,e2,e3?

(missing explanation)

Assuming e1,e2,e3 have that type, which is the best description of the behavior of this reduction?

(missing explanation)

reduce 2

What is the result of:

Arrays.asList(1, 2, 3).stream()
    .reduce(0, (a,b) -> a*b)

(missing explanation)

What is the result of:

Arrays.asList("oscar", "papa", "tango").stream()
    .reduce((a,b) -> a.length() > b.length() ? a : b)

(missing explanation)

Back to the intro example

Going back to the example we started with, where we want to find all the words in the Java files in our project, let’s try creating a useful abstraction for filtering files by suffix:

static Predicate<File> endsWith(String suffix) {
    return f -> f.getPath().endsWith(suffix);
}

endsWith returns functions that are useful as filters. It takes a filename suffix like .java and dynamically generates a function that we can use with filter to test for that suffix. Given a Stream<File> files, we can now write, e.g., files.filter(endsWith(".java")) to obtain a new filtered stream.

endsWith is a different kind of beast than our usual functions. It’s a higher-order function, meaning that it’s a function that takes another function as an argument, or returns another function as its result. Higher-order functions are operations on the datatype of functions; in this case, endsWith is a creator of functions.

Now let’s use map and filter to recursively traverse the folder tree:

static Stream<File> allFilesIn(File folder) {
    File[] children = folder.listFiles();
    Stream<File> descendants = Arrays.stream(children)
                                     .filter(File::isDirectory)
                                     .flatMap(Words2::allFilesIn);
    return Stream.concat(descendants,
                         Arrays.stream(children).filter(File::isFile));
}

The first line gets all the children of the folder, which might look like this:

["src/client", "src/server", "src/Main.java", ...]

The second line is the key bit: it filters the children for just the subfolders using the isDirectory method, and then recursively maps allFilesIn against this list of subfolders! The result might look like this:

[["src/client/MyClient.java", ...], ["src/server/MyServer.java", ...], ...]

So we have to flatten it to remove the nested structure, which is what flatMap does:

["src/client/MyClient.java", ..., "src/server/MyServer.java", ...]

Finally we add the immediate children that are plain files (not folders), and that’s our result.

We can also do the other pieces of the problem with map/filter/reduce. Once we have the list of all files underneath the current folder, we can filter it to just the Java files:

Stream<File> files = allFilesIn(new File("."))
                      .filter(endsWith(".java"));

Now that we have the files we want to extract words from, we’re ready to load their contents. Java has a useful utility function for this, Files.readAllLines, which requires first converting the File objects into Paths:

Stream<Path> paths = files.map(File::toPath);

and then calling readAllLines:

Stream<List<String>> fileContents = paths.map(path -> { 
    try { 
        return Files.readAllLines(path);
    } catch (IOException ioe) {
        throw new UncheckedIOException(ioe);
    }
});

Note that this step is complicated slightly by the fact that the functions passed to map/filter/reduce must not be declared to throw any checked exceptions, so readAllLines has to be wrapped with a lambda expression that converts the checked IOException into an unchecked exception.

Finally, we can flatten the stream of lists of lines into a simple stream of lines:

Stream<String> lines = fileContents.flatMap(List::stream);

and then extract the nonempty words from each line:

Stream<String> words = lines.flatMap(line -> Arrays.stream(line.split("\\W+"))
                                             .filter(s -> s.length() > 0));

And we’re done, we have our list of all words in the project’s Java files! As promised, the control statements have disappeared.

→ full code for the example

Benefits of abstracting out control

Map/filter/reduce can often make code shorter and simpler, and allow the programmer to focus on the heart of the computation rather than on the details of loops, branches, and control flow.

By arranging our program in terms of map, filter, and reduce, and in particular using immutable datatypes and pure functions (functions that do not mutate data) as much as possible, we’ve created more opportunities for safe concurrency. Maps and filters using pure functions over immutable datatypes are instantly parallelizable — invocations of the function on different elements of the sequence can be run in different threads, on different processors, even on different machines, and the result will still be the same.

Java’s map/filter/reduce implementation supports this concurrency automatically. We can take any Stream and create a parallelized version of it by calling parallel():

Stream<Path> paths = files.parallel().map(File::toPath);

Subsequent map/filter/reduce operations on the parallel stream may be executed in separate threads created automatically by Java. So this simple change allows the file loading and word splitting to happen in parallel.

MapReduce is a pattern for parallelizing very large computations in this way, that require a cluster of machines to compute.

reading exercises

map/filter/reduce

This function accepts a list of numbers and computes the product of all the odd numbers:

static int productOfOdds(List<Integer> list) {
    int result = 1;
    for (int x : list) {
        if (x % 2 == 1) {
            result *= x;
        }
    }
    return result;
}

Rewrite this code using map, filter, and reduce:

static int productOfOdds(List<Integer> list) {
    return ▶▶EXPR◀◀;
}
class C {
    static int x = 5;
    static int y = 8;

    static boolean isOdd(int x) {
      return x % 2 == 1;
    }

    static boolean xIsOdd = x % 2 == 1;

    static int oddOrIdentity(int x) {
      return isOdd(x) ? x : 1;
    }

    static int identityFunction(int x) {
        return x;
    }

    static int sum(x, y) {
      return x + y;
    }

    static Function<Integer, Integer> identity = x -> x;

    static int product(int x, int y) {
      return x * y;
    }

    static boolean alwaysTrue(int x) {
      return true;
    }

    static Function<Integer,Boolean> modulusTester(int i) {
      return x -> x % 2 == i;
    }  
    ...
}

Assuming this code is in the class C shown on the right, which of the following expressions can replace ▶▶EXPR◀◀?

list.stream()
.map(C::identityFunction)
.filter(C::isOdd)
.reduce(C::product)

(missing explanation)

list.stream()
.map(identity)
.filter(C::alwaysTrue)
.reduce(C::product)

(missing explanation)

list.stream()
.map(identity)
.filter(C::modulusTester)
.reduce(C::product)

(missing explanation)

list.stream()
.map(x)
.filter(C::oddOrIdentity)
.reduce(C::product)

(missing explanation)

list.stream()
.map(C::oddOrIdentity)
.filter(C::alwaysTrue)
.reduce(C::product)

(missing explanation)

list.stream()
.map(C::identityFunction)
.filter(xIsOdd)
.reduce(x*y)

(missing explanation)

More examples

Let’s look at a typical database query example. Suppose we have a database about digital cameras, in which each object is of type Camera with observer methods for its properties (brand(), pixels(), cost(), etc.). The whole database is in a list called cameras. Then we can describe queries on this database using map/filter/reduce:

// What's the highest resolution Nikon sells? 
cameras.filter(camera -> camera.brand().equals("Nikon"))
       .map(Camera::pixels)
       .reduce(Math::max);

Relational databases use the map/filter/reduce paradigm (where it’s called project/select/aggregate). SQL (Structured Query Language) is the de facto standard language for querying relational databases. A typical SQL query looks like this:

select max(pixels) from cameras where brand = "Nikon"

cameras is a sequence (a list of rows, where each row has the data for one camera)

where brand = "Nikon" is a filter

pixels is a map (extracting just the pixels field from the row)

max is a reduce

Summary

This reading is about modeling problems and implementing systems with immutable data and operations that implement pure functions, as opposed to mutable data and operations with side effects. Functional programming is the name for this style of programming.

Functional programming is much easier to do when you have first-class functions in your language and you can build higher-order functions that abstract away control flow code.

Some languages — Haskell, Scala, OCaml — are strongly associated with functional programming. Many other languages — JavaScript, Swift, several .NET languages, Ruby, and so on — use functional programming to a greater or lesser extent. With Java’s recently-added functional language features, if you continue programming in Java you should expect to see more functional programming there, too.