Massachusetts Institute of Technology - Department of Urban Studies and
Planning
11.521 |
Spatial
Database Management and Advanced Geographic Information Systems |
11.523 |
Fundamentals
of Spatial Database Management |
Referential Integrity and Relational Database Design
[Additional notes for use with Problem Set B]
Introduction to Relational Database Design (based on Lecture by Tom Grayson)
Review: The Relational Model
Review: Qualities
of a Good Database Design
Introduction to Entity-Relationship
Modeling
E-R Modeling Process
Database Normalization
Review:
Database Design Rules of Thumb
Enforcing Referential Integrity in Oracle
[Optional: Correlated
Updates - examples of UPDATEs only for SELECTed rows]
Review: The Relational
Model
- All data are represented as tables (relations)
- Tables are comprised of rows and columns (tuples)
- Rows are (officially) unordered (i.e., the order in which rows are referenced
does not matter)
- A proper relational table contains no duplicate rows.
- Each table has a primary key, a unique identifier constructed from
one or more columns
- Most primary keys are a single column (e.g., OWNERNUM for OWNERS)
- A table is linked to another by including the other table's primary key.
Such an included column is called a foreign key
Review:
Qualities of a Good Database Design
- Reflects real-world structure of the problem
- Can represent all expected data over time
- Avoids redundant storage of data items
- Provides efficient access to data
- Supports the maintenance of data integrity over time
- Clean, consistent, and easy to understand
- Note: These objectives are sometimes contradictory!
Introduction to Entity-Relationship Modeling
- Entity-Relationship (E-R) Modeling: A method for designing databases
- A simplified version is presented here
- Represents the data by entities that have attributes.
- An entity is a class of distinct identifiable objects or concepts
- Entities have relationships with one another
- Result of the process is a normalized database that facilitates
access and avoids duplicate data
- Here's an ER diagram of portions of the 'parcels' test database. It is drawn
by importing the tables from Oracle to MS-Access and then using the Tool/Relationships
capability of MS-Access to draw the diagrams:
E-R Modeling Process
- Identify the entities that your database must represent
- Determine the cardinality relationships among the entities and classify
them as one of
- One-to-one (e.g., a parcel has one address)
- One-to-many (e.g., a parcel may be involved in many fires)
- Many-to-many (e.g., parcel sales: a parcel may be sold many times by different owners,
and an individual owner may sell many parcels)
- Draw the entity-relationship diagram
- Determine the attributes of each entity
- Define the (unique) primary key of each entity
- Define the relationships between primary keys in one table and foreign keys in another
From E-R Model to Database Design
- Entities with one-to-one relationships can (and often should) be merged into a single entity
- Each remaining entity is modeled by a table with a primary key and attributes,
some of which may be foreign keys
- One-to-many relationships are modeled by a foreign key attribute in the
table representing the entity on the "many" side of the relationship (e.g., the
FIRES table has a foreign key that refers to the PARCELS table)
- Many-to-many relationships among two entities are modeled by defining a third table
that has foreign keys that refer to the entities in each original table. These foreign keys should
be included in the third table's primary key, if appropriate
- Commercially available tools can automate the process of converting a E-R
model to a database schema
Database Normalization
Much of formal database design is focused on normalizing the database
and ensuring that the design conforms to a level of normalization (e.g.,
first normal form, second normal form, etc.). Although there are higher
normal forms, Third Normal Form is generally considered good enough for
typical applications. Normalization generally involves taking a design with
fewer tables and many columns and transforming it into a design with more tables
with fewer columns -- after conducting some tests and applying some rules. First normal forms requires that there be no multi-valued attributes (e.g., for the URISA database, do not put multiple keyword codes in the same cell, and do not have keyword1, keyword2, etc. columns). Second Normal Form requires that non-key fields must be dependent upon the entire key (e.g., don't put the owner name as well as an owner ID in the parcel table). Third Normal Form prohibits transitive dependencies whereby a non-key attribute is dependent on another non-key attribute (e.g., only put a unique owner ID in the parcel table and not any other owner information such as their age or mailing address). Here are some online sites with useful database management concepts and tips including some good discussion of database normalization: and first, second, and third normal form: (In particular, PHPBuilder has understandable examples of the different levels of normalization.)
Review: Database
Design Rules of Thumb
- Keep data items atomic (e.g., first and last names are separate). Concatenating
columns together later on-the-fly is generally easy, but separating them is
not. (First Normal Form)
- What is an example of where parsing subfields from a column may go
awry?
- When might you want to include the combined fields in a column anyway?
- Define the primary key first. Use a descriptive name (PARCELID, not ID)
- In fact, use descriptive names that give a new user a decent chance of
guessing what they mean for all your columns! (E.g., use PARCEL_COUNT
rather than PACT)
- Use a single column for the primary key whenever possible; multi-column
primary keys are appropriate for many-to-many relationships
- Use lookup tables rather than storing long values
- Use numeric keys whenever possible (What about ZIP codes?)
- Avoid intelligent keys (exception: lookup tables)
- Avoid using multiple columns to represent a one-to-many relationship (e.g.,
columns such as CHILD1, CHILD2 in a table called PARENT rather than putting
the children in a separate table. (First Normal Form)
- For readability, use the primary key name for foreign keys unless the same
foreign key is used multiple times in the same table (e.g., state of work
and state of residence for a person might both be foreign keys that reference
a table of states)
- Do not include two columns whose values are linked together (e.g., county
name and county ID) unless one of the columns is the primary key of the table
(Third Normal Form)
- Avoid allowing NULL values in columns that have a discrete range of possible
values (e.g., integers between 1 and 10, inclusive)
- Not applicable to DBF files, which do not support NULLs
- Avoid using multiple tables with similar structures that represent minor
variants on the same entity (e.g., putting Boston parcels and Cambridge parcels
in separate tables).
- Why is this rule often hard to practice with GIS?
- Plan ahead for transferring data to a different database. For example,
you may want to move data from Oracle to DBF, or Microsoft Access to Oracle.
- Avoid column names with characters with other than UPPER CASE letters
(A-Z), digits (0-9), and the underscore (_). Other characters may not
be accepted by a database. Some database systems may be case sensitive
with regard to column names, while others are not.
- Keep your column names relatively short. Different databases support
different numbers of characters in column names (e.g., 30 for Oracle,
64 for Microsoft Access, 10 for DBF). Try to make column names differ
in the first few characters rather than at the end to avoid column name
duplication if the names are truncated during the conversion process (e.g.,
use COL1 and COL2, not LONG_COLUMN_NAME_1 and LONG_COLUMN_NAME_2).
- Note that keeping column names short may be at odds with keeping
your column names meaningful for neophytes. Be aware that you are making
a tradeoff!
- Remember that these are rules of thumb, not absolute laws! Bend
the rules if you must but have a justification for your decision. The limitations
of a GIS software package often provide a good reason.
See Tom Grayson's notes regarding the syntax of SQL statements to define primary and foreign keys and to constrain table manipulation in order to enforce referential integrity - that is, to keep you from making changes to tables that destroy their internal consistency.
For more information about this page, please
contact the 11.521 Staff Mailing List
<11.521staff@MIT.EDU>.
Written February, 2001 (Tom Grayson), Last Modified 11 March 2010 (jf)