Design Project One:
Frequently-Asked Questions
Last updated: $Date: 2007/03/21 17:55:49 $
Q: Do I really need to stay within 2,500 words?
A: Yes. We expect you to adhere to the word limit. That said,
we're not going to penalize you for being just a few words over the
limit -- we're mostly concerned with the quality of your report. If
your report contains 2,500 + X words, and you're worried whether X
counts as "just a few," you should probably cut some words.
Remember to print the word count at the end of your final
report.
Q: Do the figure captions count against the word limit?
A: Yes.
Q: Do I need to worry about executable code size (fitting in memory)?
A: No, you may assume that the memory footprint of your executable code is negligible.
Q: Is there a specified behavior if one calls with find, to get an
array of triplets more than the memory allows (e.g. FIND
("*","*","*", 0, e9))?
A: You don't need to address this case. It's fine to assume that an
error is returned, or that the system just crams the result into
virtual memory.
Q: From the design document:
You can also assume that ... the total storage required for your
database is about 100 GB ... and that each triplet occupies about 100
bytes -- hence the database contains about 1 billion triplets.
What's the primary measure? Are we forced to have 100GB data base
or 1 billion triplets? Are you requiring us to use a given internal
representation for the triplets?
A: Assume that the data is 100 GB / 1 billion triplets before
compression. You are right that there are a number of techniques
that might reduce the size of this data.
Q: How detailed should the pseudocode be?
Use the pseudocode in the course notes as a guide for pseudocode in
your report. Generally, each pseudocode example shouldn't be more
than about 10 lines of code, and it's fine to be relatively high
level. You'll need to exercise some discretion, glossing over
details that aren't relevant but showing enough details to get across
the point of the algorithm or approach you are describing. Make sure
you describe the pseudocode in english in the text as well!
Q: The API includes a FIND method in which the subject,
relationship, or object fields can be wild cards. The workloads,
however, only include calls of the form FIND(*, relationship, object)
and FIND(subject, relationship, *). Furthermore, the workload does
not appear to use all the relationships that are inserted.
Do we need to support wildcards anywhere? Is it preferable to have a
system which handles all permutations of the FIND method equally or
one that is more efficient with the two forms above, but resorts to
sequential scan of the triples for the unused forms?
A: You should support any type of relationship query, but you can
optimize for those given in the workload. Success is defined first by
designing a fast lookup behind the general API, assuming more or less
uniform access patterns, and detailed workload-based optimizations are
secondary.
Q: The assignment speaks of KB and GB and TB, but says that 4
KB = 4096 B and 1 TB = 1000 GB, and further that the disk is 1 TB and
main memory is 1 GB. I would like to know exactly how many bytes I
have to work with.
A: For some reason, computer scientists typically measure everything
in powers of two, but not always (for example, disk drives are usually
sized using the conventional metric system definition of
kilo/mega/giga, whereas operating systems use the conventional
computer science definition for reporting disk sizes, which is why
your OS thinks your 100 GB drive is less that 100 GB!) So, 1 KB is
either 1024 bytes (2^10) or 1000 bytes. Similarly, 1 MB is either
2^20 bytes or 10^6 bytes, 1 GB is either 2^30 bytes or 10^9 bytes, and
1 TB is either 2^40 bytes or 10^12 bytes. We've used both definitions
in the DP. :-)
If this really matters to your design, you may assume that we meant
to write 1 TB = 1024 GB, such that you have 2^40 bytes of storage.
Q: What is the int returned by READ_BLOCKS()?
A: The intention is that its the number of bytes read, or some kind of
status code indicating failure if there's a problem. You can ignore
it, or assume that it can be used to tell when you are trying to read
blocks that don't exist, if you need that functionality.
Q: What do you mean by a 'System Diagram'?
A: No formal definitions here. Just any sort of diagram to help the reader
understand your design or particular aspects thereof. For instance, you
might illustrate some data structures to explain how your data is
usually laid out on disk, with boxes representing tuples, blocks, the
disk, memory, and whatever other logical/physical components are
relevant to your design, and arrows depicting the relationships among
the various components. It's important that you explain the diagrams
with words too, using labels, captions, your main text, etc.
From some recent reading, Figures 5-2 and 5-3 from the text are
examples of such diagrams.
Q: Are we to assume that each application (library and flickr) has 1B triplets that we need to deal with? And to clarify, is the same system dealing with each set of triplets at the same time, or are we assuming two separate Terabyte drives that are running these different applications independent of each other?
A: You should think of each application as having an independent database / disk, each of which is about 1B triples.
|