THE JELLYBEAN MACHINE
MIT Artificial Intelligence Laboratory
This is http://www.cva.stanford.edu/j-machine/cva_j_machine.html
Last updated July 7, 1998
The J-Machine is a fine grained concurrent computer designed by the
MIT Concurrent
VLSI Architecture group (now located at Stanford University) in
conjunction with Intel Corporation.
Pictures of J-machine Hardware
Note
At the moment, we're trying to bring a few more users on-line with creative
applications. If you have a cycle or bandwidth hungry application please
contact
Andrew Chang or
Richard Lethin.
Overview
The J-machine project was started at MIT in about 1988 as an experiment in
message-passing computing based on work that Bill Dally did at Caltech for
his doctoral dissertation.
The work was driven by the VLSI philosophy "processors are cheap" and
"memory is expensive". This philosophy is based on a idealistic view
of VLSI economics, in which the cost of a function is based on the
VLSI area dedicated to it. Although the standard view is that
processors are much more expensive than memory (and this standard view
was very much true before levels of VLSI integration allowed
processors to be integrated on a single chip), if we look at a typical
workstation with 32 Mbyte of memory, the amount of silicon area
dedicated to memory is roughly 100 times that for the CPU. A bit of
DRAM is 100 lambda^2, so 32Mbyte is 32G lambda^2, versus the
arithmetic units in the CPU, which are about 300M lambda^2.
Of course, this ignores issues related to the relative production
volumes and process technologies for logic vs. memory, and runs against
the current "wisdom" that the best way to build a fast parallel processor
is to bolt a network-interface and coherent-cached-shared-memory hardware
onto a standard microprocessor. However, we're interested in technology
imperatives much more than market imperatives.
With CPUs so cheap, in the silicon-area sense, the J-machine project set
out to explore an architecture in which the processors are more "liberally
scattered" through the machine. We envisioned a component with economies
of scale like that for DRAMs: a "message-driven processor" with a small
processor and network interface integrated *with* the memory. The "J" in
"J-machine" stands for "Jellybean", in the sense that the processors would
be cheap and plentiful, like jellybean candies.
The design of the J-machine incorporated several novel technologies. The
machine architects immediately realized that a key to performance would be
fast, low-overhead messaging. The processor and network interface are
tightly coupled, so that user-level messages can be sent with
very little overhead for copying. The network is a 3-dimensional
deterministic wormhole-routed mesh. User-level message handlers dispatch
on message arrival, with a small amount of queuing in on-chip memory
at the destination. We also dispatch to handlers before the message
has completely arrived, trying to speed the dispatch process.
The communication capacity of the J-machine is pattern-dependent, of course.
Each node can inject into the network at 2 words (72 bits) per clock.
Messages travel over the links in 18-bit "flits" at 12.5MHz. Message
reception is into on-chip memory, buffered in "4-word (144-bit) queue row buffers"
that write in one clock cycle to minimize interference with processing and
instruction fetching. The major bisection of a 1024 node machine is 8 by 8 channels,
each 18 bits wide and running at 12.5 Mhz, or 1.8 Gbyte/sec. Microbenchmark
studies have shown that random traffic patterns can achieve about 40% utilization
of this bandwidth.
Each processor node has about 4kb of on-chip memory and 1Mbyte of off-chip
external memory. The off-chip memory was added when we discovered the
amount of memory we could put on-chip in the available ~1989 process
technology was too small. We'd have prefered to have put the memory
on-chip and have had more processors. Furthermore, pinout constraints
on our packaging technology restricted us to a narrow interface to
external memory. This remote and narrow path result in an external
access latency of 6 clock cycles, vs. one clock for the on-chip memory.
The node's processor itself is very modest: it runs at 10MHz internal clock, with
a limited number of registers, and no floating point hardware, for performance
that's about equivalent to a 25Mhz 386.
The J-machine also incorporates mechanisms to support a concurrent
object-oriented programming model over a global object space. Instructions
are included for a quick hashed lookup in an on-chip table. This can be
used for caching, roughly equivalent to a TLB, except that it is managed by
the compiler/os and policy is not set by the hardware. This is used
to translate "object identifiers" (equivalent to a segmented global virtual
address) to the node which handles methods for the object and to the memory
address on the local node where the object resides in a single instruction.
Hardware also supports dynamic typing: each 32-bit word is tagged with 4 bits.
Hardware instructions like "add" will trap if they see a type other than integer.
The tag is also used to identify "futures"; these are used for
place-markers for the results of asynchronous method calls. An attempt to
use a future by an arithmetic instruction results in a fault; the OS
suspends the thread waiting for the arrival of the result.
We currently have two programming environments for the J-machine.
Three 1024-node J-machine systems have been built, and live at MIT,
Caltech, and Argonne National Research Labs.
The 1024-node J-machine at MIT is hosted by the machine jelly-donut.ai.mit.edu,
eg. it's on the Internet. The 1024-node machine has a peak performance of
1G instructions/sec, peak memory bandwidth of 6 GB/sec to external memory,
1.28 GB/sec bandwidth across the central bisection. The J-machine also
includes a dedicated filesystem and a
distributed graphics system.
Architectural Evaluation
Programming Systems
Design and Development
Early Architecture
Miscellaneous
lethin@ai.mit.edu