6.173 Reading Assignments

Readings

R09: An Analysis of Linux Scalability to Many Cores [PDF]. Assignment due Thurs. 11/18: Question: Sections 4.1 (in particular the 5 bullets) introduce a number of reasons why an application may not achieve good speedup, and Sections 4.3, 4.4, 4.5, and 4.6 describe fairly standard techniques for avoiding some of those problems for MOSbench applications. Which of the problems and techniques apply to your shared-memory TSP implementation on the Beehive?
R08: Tilera Tile Processor [PDF]. Assignment due Tues. 11/16: Question 1: The Tile64 processor has five distinct on-chip interconnection networks. Describe in a sentence or two the deadlock solutions employed in each of these networks. Do not worry if you are unsure about the answer, take your best shot. We will discuss in class. Question 2: Alyssa Hacker, a 6.173 student, reads the Tile processor paper and strongly disagrees with the premise in the "Logical versus Physical Networks" section that "If tiles shrink enough, the tile perimeter could eventually be so small that it would make wiring a challenge," thus seeming to imply that virtual channels may become important again in the future as VLSI technology shrinks to ever smaller dimensions. Your assignment is to write a short paragraph in support of, or against, Alyssa's position. Answer 1: All networks rely on X-Y routing to avoid network deadlock. Tile64 avoids "endpoint" deadlock through careful protocols. Memory traffic is spread across three networks to break cycles: the mdn, the tdn, and the vdn. The mdn carries cache-fill traffic between cores and the memory controllers, the tdn carray tile-to-tile memory accesses, and the vdn carries invalidation traffic. The mdn avoids endpoint deadlock via a contract between cores and the endpoint memory controllers: each of the n cores never generate more than m outstanding requests and the memory controller has enough buffering to store n*m requests. The udn carries user messages where the protocol is ad hoc and potentially not deadlock-free; the udn instead employs deadlock recovery: if fifos are full for a long time, they are drained into the cache. Answer 2: Adding networks rather than virtual channels may be sufficient because the added networks require buffers and the area for buffers creates space for wires above them. Furthermore, the number of metal layers is increasing with newer process technologies enabling more routing opportunities.
R07: Interconnection Networks [PDF]. Assignment due Thurs. 11/4: Please focus on the following parts of the reading assignment: E.2: pages 11, 13-15; E.3: pages 22-23; E.4: pages 30-40; E.5: pages 46-49, 51-52; E.6: pages 58-59. Suppose you are given a multicore chip with 256 cores arranged in a 16x16 direct mesh network configuration. Assume that the network has bidirectional channels (i.e., for any switch-to-switch link, assume there are separate channels in both directions). Assume further, that the network has no end-around connections, and that the switches have finite buffers (e.g., 4 flits). Deadlock avoidance and deadlock recovery are two approaches for dealing with deadlock. Describe one method for each approach that would apply to the given multicore chip. (Note that the reading you have been given does not describe the details of any method for deadlock recovery. We challenge you to devise a solution!) Answer: Deadlock avoidance: a dimension-ordered routing protocol (e.g. X-Y routing). Deadlock recovery: detect locally at each core if the switch buffer is full for more than some number of cycles; if so, drain the buffer into the core's cache.
R06: Scalable Synchronization on Shared Memory Multiprocessors [PDF]. Assignment due Tues. 11/2: The paper does not evaluate the performance of ticket locks on the Sequent Symmetry (Figure 6 is missing a line for ticket locks). As described in Section 4.1 and Figure 3, the Symmetry is a bus-based snooping cache-coherent machine. Suppose you had a version of the Symmetry whose atomic add instruction returned the old value of the target location, so that ticket locks could be implemented as in Algorithm 2. Would they perform better than test-and-test-and-set on such a machine? Why or why not? Answer: ticket lock should perform better because it avoids the "thundering herd" problem. The thundering herd problem refers to high contention situations in which many waiters realize the lock is free around the same time and attempt to perform atomic operations to acquire it.
R05: The MIT Alewife Machine [PDF]. Assignment due Thurs. 10/14: Think about the scalability of Alewife's directory-based cache coherence scheme. Describe in a few sentences how the storage requirements of such a directory scheme scale with the number of cores in the machine. Answer: Size of pointers grows as Log(n) per directory, and number of directories grows as n, since there is one directory per each node. Thus directory storage grows as n * Log(n) where n is number of nodes.
R04: Cache Coherence [PDF]. Assignment due Tues. 10/12: Coreta Systems is building a new multicore processor based on a bus and its product marketing team is adamant that the engineers make it cache coherent. Coreta engineers have read the Per Stenstrom paper which discusses two broad classes of snooping cache coherence protocols called write-invalidate protocols and write-update protocols, but cannot make up their mind as to which protocol to choose. They soon get deeply divided into two warring factions who call themselves the Invalidators and the Updaters. To resolve this situation they seek the advice of their CTO Sir Kit Hacker, an MIT alum, who has taken 6.173. Kit thinks for a moment and suggests that they use a hybrid protocol that first uses updates for any given cache line, and then switches that cache line over to an invalidate protocol when the number of updates exceeds a given THRESHOLD count. The engineers love this solution and dive into writing Verilog for the coherence state machines. They soon run into a second challenge -- how to choose the THRESHOLD count -- and hire you as a consultant to solve this problem for them. Describe in a few sentences how you might advise them to pick the THRESHOLD count. Answer: there is no single right answer. The suggestion presented in class was to model the cost of updating versus invalidating in terms of cycles and pick the threshold value for which the cost curves intersect. The threshold depends on system parameters like the cost of update messages, the cost of DRAM accesses, etc.
R03: The IBM RP3 [PDF]. No assignment due but you're encouraged to read it. It's an easy read.
R02: The Ultracomputer [PDF]. Assignment due Thurs. 10/7: In no more than 1-2 sentences, indicate one property of the Ultracomputer that favors a parallel blocked implementation of Jacobi Relaxation. Similarly, indicate one property that hinders the same application. Please turn in your answers on a sheet of paper in class like last week. Answer: The ultracomputer is suited to Jacobi because its shared memory system and fetch-and-add synchronization mechanism make parititioning and sharing boundary data simple. One drawback is that its Omega network and memory organization make memory accesses slow because memories are far away from the processors. Another drawback is that sharing data through memory is typically less power efficient than messaging.
R01: The Cosmic Cube [PDF]. Assignment due Thurs. 9/30: In no more than 1-2 sentences, indicate one property of the Cosmic Cube that favors a parallel blocked implementation of Jacobi Relaxation. Similarly, indicate one property that hinders the same application. Please turn in your answers on a sheet of paper in class. We will also discuss these papers in class. Answer: The cosmic cube is suited to Jacobi because it's a message-passing machine with a network topology upon which the algorithm's locality maps well. One drawback is that some care must be taken in the placement of threads/processes to exploit the locality of the hypercube network.