Spring 2007



FAQ

Design Project One: Frequently-Asked Questions

Last updated: $Date: 2007/03/21 17:55:49 $


Q: Do I really need to stay within 2,500 words?

A: Yes. We expect you to adhere to the word limit. That said, we're not going to penalize you for being just a few words over the limit -- we're mostly concerned with the quality of your report. If your report contains 2,500 + X words, and you're worried whether X counts as "just a few," you should probably cut some words. Remember to print the word count at the end of your final report.

Q: Do the figure captions count against the word limit?

A: Yes.

Q: Do I need to worry about executable code size (fitting in memory)?

A: No, you may assume that the memory footprint of your executable code is negligible.

Q: Is there a specified behavior if one calls with find, to get an array of triplets more than the memory allows (e.g. FIND ("*","*","*", 0, e9))?

A: You don't need to address this case. It's fine to assume that an error is returned, or that the system just crams the result into virtual memory.

Q: From the design document:
You can also assume that ... the total storage required for your database is about 100 GB ... and that each triplet occupies about 100 bytes -- hence the database contains about 1 billion triplets.
What's the primary measure? Are we forced to have 100GB data base or 1 billion triplets? Are you requiring us to use a given internal representation for the triplets?

A: Assume that the data is 100 GB / 1 billion triplets before compression. You are right that there are a number of techniques that might reduce the size of this data.

Q: How detailed should the pseudocode be?

Use the pseudocode in the course notes as a guide for pseudocode in your report. Generally, each pseudocode example shouldn't be more than about 10 lines of code, and it's fine to be relatively high level. You'll need to exercise some discretion, glossing over details that aren't relevant but showing enough details to get across the point of the algorithm or approach you are describing. Make sure you describe the pseudocode in english in the text as well!

Q: The API includes a FIND method in which the subject, relationship, or object fields can be wild cards. The workloads, however, only include calls of the form FIND(*, relationship, object) and FIND(subject, relationship, *). Furthermore, the workload does not appear to use all the relationships that are inserted.

Do we need to support wildcards anywhere? Is it preferable to have a system which handles all permutations of the FIND method equally or one that is more efficient with the two forms above, but resorts to sequential scan of the triples for the unused forms?

A: You should support any type of relationship query, but you can optimize for those given in the workload. Success is defined first by designing a fast lookup behind the general API, assuming more or less uniform access patterns, and detailed workload-based optimizations are secondary.

Q: The assignment speaks of KB and GB and TB, but says that 4 KB = 4096 B and 1 TB = 1000 GB, and further that the disk is 1 TB and main memory is 1 GB. I would like to know exactly how many bytes I have to work with.

A: For some reason, computer scientists typically measure everything in powers of two, but not always (for example, disk drives are usually sized using the conventional metric system definition of kilo/mega/giga, whereas operating systems use the conventional computer science definition for reporting disk sizes, which is why your OS thinks your 100 GB drive is less that 100 GB!) So, 1 KB is either 1024 bytes (2^10) or 1000 bytes. Similarly, 1 MB is either 2^20 bytes or 10^6 bytes, 1 GB is either 2^30 bytes or 10^9 bytes, and 1 TB is either 2^40 bytes or 10^12 bytes. We've used both definitions in the DP. :-)

If this really matters to your design, you may assume that we meant to write 1 TB = 1024 GB, such that you have 2^40 bytes of storage.

Q: What is the int returned by READ_BLOCKS()?

A: The intention is that its the number of bytes read, or some kind of status code indicating failure if there's a problem. You can ignore it, or assume that it can be used to tell when you are trying to read blocks that don't exist, if you need that functionality.

Q: What do you mean by a 'System Diagram'?

A: No formal definitions here. Just any sort of diagram to help the reader understand your design or particular aspects thereof. For instance, you might illustrate some data structures to explain how your data is usually laid out on disk, with boxes representing tuples, blocks, the disk, memory, and whatever other logical/physical components are relevant to your design, and arrows depicting the relationships among the various components. It's important that you explain the diagrams with words too, using labels, captions, your main text, etc.

From some recent reading, Figures 5-2 and 5-3 from the text are examples of such diagrams.

Q: Are we to assume that each application (library and flickr) has 1B triplets that we need to deal with? And to clarify, is the same system dealing with each set of triplets at the same time, or are we assuming two separate Terabyte drives that are running these different applications independent of each other?

A: You should think of each application as having an independent database / disk, each of which is about 1B triples.

Questions or comments regarding 6.033? Send e-mail to the 6.033 staff at or to the 6.033 TAs at

Top // 6.033 home // $Id: faq.html,v 1.7 2007/03/21 17:55:49 benmv Exp $