Massachusetts Institute of Technology - Department of Urban Studies and Planning


11.521
Spatial Database Management and Advanced Geographic Information Systems
11.523
Fundamentals of Spatial Database Management

Advanced Queries, Referential Integrity, and Relational Database Design

25 February 2010


Administrative

Today


Practice thinking relationally

What SQL query answers the following questions:

In lab#2, when you ran q_eb05new to join your expanded t_noncondo table with MASS_LANDUSE, you may have noticed that q_eb05new produced a table with one fewer record than t_noncondo. Which record dropped out?

SELECT bos05eb.PTYPE 
  FROM bos05eb 
 WHERE bos05eb.ptype not in (select stclass_txt from mass_landuse); 

Or, equivalently:

SELECT bos05eb.PTYPE 
  FROM bos05eb
 WHERE NOT EXISTS(SELECT NULL from mass_landuse m
                   WHERE bos05eb.ptype = m.stclass_txt); 

Let's figure out how to do the same query in MS-Access - by pasting the SQL into MS-Access ...

Lab #4 - Mapping zoning variances


Lab #3: Evolution of City Databases

Zoning Variances

Schema of ZONING table (and related lookup tables with codes for Decision, Use, NSA, and Neighbrhd)

1980 Census data (by Boston NSA)

Schema of 1980 Boston Census data (grouped by neighborhood statistical area, NSA)

SQL examples using zoning variances

Annotated SQL queries of ZONING table 

Grouping zoning applicants via 'lookup' tables

Annotated SQL queries illustrating use of lookup tables to categorize ownership of properties seeking zoning variances 

Zoning Variance Database Evolution Chart

Stages of evolution of the ZONING variance database

The ZONING table and associated lookup tables are in our Oracle database and accessible to everyone in the class. In addition, the same tables are available as an MS-Access database in the class data locker: http://mit.edu/11.521/data/11.521_lab3.mdb.


Lab #3 lookup example: spelling corrections for Boston parcel ownership

Build lookup table for Boston parcel ownership

Using lookup tables to merge local data with large, 'read-only' official databases

The book chapter explains some of the complexities and technical strategies involved in trying to empower communities by providing local access to parcel-level city databases. We would like to facilitate distributed access to centrally maintained databases - while allowing some capacity on the part of end-users to make changes in the 'official' data. Our example uses parcel-level databases of land use and ownership - a very detailed data layer that has recently become much more standardized and accessible.  To emphasize how often it is necessary to do extensive data processing before such data can be utilized effectively, we consider the problem of addressing  the 'spelling errors' in owner names that complicate efforts to determine ovnership patterns from typical parcel databases.  For example, there are at least 17 different spellings of the Boston Redevelopment Authority in the 'owner' field of the official parcel database for Boston. 

In parts of Lab #3 and Problem Set A, we use the methods discussed in the book chapter (i.e. local 'lookup' tables and SQL 'update' queries) to categorize the owners of the properties in the zoning variance database. After making sure we understand these techniques, we discus how these methods can provide a useful alternative to 'top-down' and 'bottom up' strategies for fixing the spelling errors.  In particular, they provide a 'middle-out' alternative that allows decentralized accumulation of useful knowledge that can be readily linked to official datasets outside the user's control.  Here are some queries that can will run in class and the powerpoint slides that outline the methods and ideas in the book chapter.

A set of notes illustrating these types of queries using the zoning variance database is available within the SQL Notes section of the class web page at this link: Grouping zoning applicants via 'lookup' tables.


More Advanced SQL Query Construction Techniques.

Review: SELECT Statement Syntax

Basic Syntax :

  SELECT expr1, expr2, expr3, ...
    FROM object1, object2, ...
   WHERE conditions
GROUP BY expr4, expr5, expr6, ...
  HAVING conditions
ORDER BY expr7, expr8, expr9, ...
Note that the order of the clauses matters! The clauses, if included, must appear in the order shown! Oracle will report an error if you make a mistake, but the error message (e.g., "ORA-00933: SQL command not properly ended") may not be very informative.

Ordering Rows Returned by a Query

First of all, almost every SQL statement you write should specify the way the rows will be sorted. That means you should include an ORDER BY clause in nearly every SQL SELECT statement. While examples are in the notes, we haven't emphasized how to sort in descending order. For this you can use the 'DESC' keyword after the expression you want to sort that way. (SQL also has an 'ASC' keyword for ascending order. Since it is the default, it can be omitted.)

The syntax looks like this:

    SELECT ...
    ORDER BY expr1 [ [ ASC | DESC ] , expr2 [ ASC | DESC ] ... ]
For example, let's sort the parcels by land use in ascending order and the square footage in descending order:
  SELECT parcelid, landuse, sqft
    FROM parcels
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        15          34800
        10          10900
         7 A        14000
        14 A        10000
        16 A         9600
        18 A         9500
        11 C       210000
         2 C       100000
        19 C        40000
         8 C        24800
         1 C        20000
         4 CL
         6 CM        2100
        20 E        50000
         5 E        25000
         9 R1        1800
        17 R1        1500
        13 R2
         3 R3        5500
        12 R3        5300

20 rows selected.
Notice that the query can mix the ASC and DESC keywords in a single ORDER BY clause.

Order of Operations with 'AND' and 'OR'

The Boolean operators 'AND' and 'OR' can do unexpected things when you combine them in a query and you're not careful. Suppose we want to find parcels that have a square footage of 5300 or 10000, and, of those, find the ones with land use code 'A'.

We might (incorrectly) write a query like this:

  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE sqft = 5300
      OR sqft = 10000
     AND landuse = 'A'
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000
        12 R3        5300
Notice that it returned a row with LANDUSE = 'R3'. That's not what we wanted! The problem is that the 'AND' and 'OR' operators, when mixed, are not processed in the sequence written, but rather follow an order of operations much as in algebra (exponentation before everything, then multiplication and division before addition and subtraction). In Boolean logic, 'AND' is like multiplication and 'OR' is like addition, and Oracle orders their processing accordingly. Hence, the query above is actually equivalent to this one:
  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE sqft = 5300
      OR (    sqft = 10000
          AND landuse = 'A')
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000
        12 R3        5300
Since the order of operations can surprise you at inconvenient times, you should always use parentheses to force the correct order whenever you mix 'AND' and 'OR' in a WHERE clause. Here is the correct way to write the query:
  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE (   SQFT = 10000
          OR SQFT = 5300)
     AND LANDUSE = 'A'
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000

SELECT DISTINCT

Normally, a query may return duplicate rows. For example, if we query the FIRES table to list the parcels that had fires, we'll find that the parcels that had more than one fire (parcels 2 and 3) show up multiple times:

  SELECT parcelid
    FROM fires
ORDER BY parcelid;
 
  PARCELID
----------
         2
         2
         3
         3
         7
        20
 
6 rows selected.
If we don't want to see the duplicates, we can add the keyword DISTINCT right after SELECT:
  SELECT DISTINCT parcelid
    FROM fires
ORDER BY parcelid;
 
  PARCELID
----------
         2
         3
         7
        20
 
4 rows selected.
Now parcels 2 and 3 show up only once. You only use the DISTINCT once, right after SELECT, to apply to the entire row. You do not not apply it to each column. Hence, this query is valid:
  SELECT DISTINCT p.onum, o.oname
    FROM parcels p, owners o
   WHERE p.onum = o.ownernum
ORDER BY p.onum;
 
     ONUM ONAME
---------- -----------------------
         9 PATRICK KING
        10 METHUINON  TRUST
        11 FERNANDO MARTINEZ
        18 JOHN MCCORMACK
        29 FRANK O'BRIEN
        32 GERALD RAPPAPORT
        38 BAY STATE, INC.
        55 THOMAS KELLIHER
        89 JOSEPH NOONAN
       100 MGH, INC.
       200 VANDELAY INDUSTRIES
 
11 rows selected.
However, the following incorrect query, with two DISTINCT keywords, generates the cryptic error message "ORA-00936: missing expression":
  SELECT DISTINCT p.onum, DISTINCT o.oname
    FROM parcels p, owners o
   WHERE p.onum = o.ownernum
ORDER BY p.onum;
Note also that you can use DISTINCT with the group functions (e.g., COUNT, AVG, STDDEV) to get these functions to consider only distinct values within a group: (do you see why the counts would not all be the same if you replace 'wpb' with 'sqft'?)
  SELECT COUNT(wpb) wpb_ct,
         COUNT(DISTINCT wpb) wpb_ct_distinct,
         COUNT(*) row_ct
    FROM parcels;
 
    WPB_CT WPB_CT_DISTINCT     ROW_CT
---------- --------------- ----------
        20              20         20
 
1 row selected.
Finally, note that COUNT(DISTINCT *) does not work. This result is not surprising iif you think about it. Why do you think this is the case?

Cartesian Products

What's wrong with this query?
SELECT p.parcelid, p.onum, o.ownernum, o.oname
  FROM parcels p, owners o;
This query returns 200 rows!

There are only 20 rows in PARCEL and 10 rows in OWNERS, so what's going on? The problem here is that this query has no WHERE clause that specifies how the PARCEL table relates to the OWNERS table. Without this information, Oracle does not know how to match a row in PARCEL to a corresponding row in OWNERS. What does it do instead? It matches every row in PARCEL to every row in OWNERS. Hence, we end up with:

(20 rows in PARCEL) matched to (10 rows in OWNERS) = 20 x 10 = 200 rows returned
This kind of unconstrained join is called a Cartesian product. This result is desireable only under rare circumstances. If you have queries that are returning a suspiciously large number of rows, you have probably unwittingly requested a Cartesian product. Note that for tables of even modest size the number of rows returned by a Cartesian product can be explosive. If you generate a Cartesian product of one table with 1,000 rows with another table with 2,000 rows, your query will return 1,000 x 2,000 = 2,000,000 rows! That's right--two million rows! Hence, you should very careful to avoid unintentional Cartesian products.

To fix this query, we need to specify how the owner numbers stored in the PARCEL table should be matched to owner numbers in the OWNERS table. In PARCEL, the owner numbers are stored in ONUM, while in OWNERS they are stored in OWNERNUM. Here is the repaired query with the appropriate join condition in the WHERE clause:

SELECT p.parcelid, p.onum, o.ownernum, o.oname
  FROM parcels p, owners o
 WHERE p.onum = o.ownernum;

This query returns 20 rows, which is definitely an improvement.

Note that a Cartesian product can easily be hidden in a query that requires multiple joins. Suppose we want to find all the papers, with associated keywords, written by authors with the last name WALKER. We could try this query:

COLUMN keyword FORMAT A20 TRUNC
COLUMN title   FORMAT A25 TRUNC

  SELECT a.lastname, a.fnamemi, k.keyword, t.title
    FROM keywords k, match m, titles t, authors a
   WHERE m.code = k.code
     AND t.paper = a.paper
     AND a.lastname = 'WALKER'
ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;

This query returns a whopping 6174 rows! What's wrong? All of the tables appear to be involved in a join condition. The problem is that while the MATCH and KEYWORDS tables are tied together, and the TITLES and AUTHORS tables are linked to each other, nothing links these two sets of tables together. Adding another join condition fixes it:

  SELECT a.lastname, a.fnamemi, k.keyword, t.title
    FROM keywords k, match m, titles t, authors a
   WHERE m.code = k.code
     AND t.paper = a.paper
     AND m.paper = t.paper
     AND a.lastname = 'WALKER'
ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;

This query returns a much more reasonable 14 rows. Note that we could have also specified "m.paper = a.paper" and it would have worked too. Why?

A Cartesian product can easily be hidden by a GROUP BY, since it will aggregate all the spurious rows, and you will not see the rows that made up the groups. Here's a variation on the earlier, broken example, now with a GROUP BY:

  SELECT a.lastname, a.fnamemi, count(k.keyword) keywords, count(t.title) titles
    FROM keywords k, match m, titles t, authors a
   WHERE m.code = k.code
     AND t.paper = a.paper
     AND a.lastname = 'WALKER'
GROUP BY a.lastname, a.fnamemi
ORDER BY a.lastname, a.fnamemi;

Because this query returns only counts, it's not obvious that the query is defective--unless you have an idea about what results are reasonable! Always scrutinize your results!

NULL Values

NULL is a special value that means "missing" or "unknown." It can be placed in a column of any type, unless the column has been declared NOT NULL, in which case NULL values are forbidden. A NULL is not the same as zero or any other value.

Special rules apply when NULL values are involved:

For more information on NULL values and how Oracle treats them, consult the Oracle documentation. With luck, their link to the Oracle 8i documentation hasn't changed much and you can find the SQL*Plus documentation here: http://download-west.oracle.com/docs/cd/A87860_01/doc/server.817/index.htm (Note also the link to an explanation of oracle error messages).

Outer Joins

Take a look at this query:

  SELECT p.parcelid, f.fdate
    FROM parcels p, fires f
   WHERE p.parcelid = f.parcelid
ORDER BY p.parcelid, f.fdate;

The query above returned 6 rows, but there are 20 parcels. Fourteen parcels seem to be missing. Where did they go? The answer is that Oracle will only list the parcels that occur in both tables, PARCELS and FIRES.

How do we get around this problem if we want to see all 20 rows in the parcel table, whether they match a record in FIRES or not? The answer is an outer join. The standard join is also known as an "inner join," meaning that the default behavior of not matching NULL values occurs. In an outer join, we explicitly tell Oracle that we want it to display NULL values that would otherwise be excluded by adding the (+) outer join operator to each column in the WHERE clause where the additional NULLs should appear.

    SELECT p.parcelid, f.fdate
    FROM parcels p, fires f
   WHERE p.parcelid = f.parcelid (+)
ORDER BY p.parcelid, f.fdate;
Note that the position of the outer join operator (+) is significant! This query will run but return a different result:
    SELECT p.parcelid, f.fdate
    FROM parcels p, fires f
   WHERE p.parcelid (+) = f.parcelid
ORDER BY p.parcelid, f.fdate;
Some additional class notes on outer joins are available. If you wish, you can peruse the Oracle 8i documentation. Be aware that, while the concept remains the same, the syntax for outer joins varies somewhat across vendors and, in particular, is different in MS-Access and SQL-server.

Inner vs. Outer Joins with GROUP BY

List all the parcels that had a fire, including the address and date of the fire (this requires an inner join): Repeat the same query, except list all the parcels, whether they had a fire or not (this requires an outer join):
  SELECT p.parcelid, p.add1, p.add2, f.fdate
    FROM parcels p, fires f
   WHERE p.parcelid = f.parcelid (+)
ORDER BY p.parcelid, f.fdate;
List the count of fires for the parcels that had a fire (this requires an inner join with grouping): Note that the query above lists the parcels in descending order of the count of fires. When specifying a group function or other expression in the ORDER BY clause, you must use the full expression, even if you defined a column alias for it in the SELECT list. In this case, we must use COUNT(FDATE) rather than FIRE_COUNT in the ORDER BY clause. The DESC keyword after COUNT(FDATE) indicates that we want the fire counts shown in descending order, rather than the default ascending order. You need to apply the DESC keyword to every expression in the ORDER BY clause that you want in descending order.

List the count of fires for all parcels, whether it experienced a fire or not (this requires an outer join with grouping):

Note that the query above uses an outer join on the FIRES table -- indicated by the outer join symbol (+) -- to include the parcels that are not listed in the FIRES table in the count. The outer join symbol indicates where a NULL should replace a real value if none is available in the table. Note that we need to use the outer join symbol with both columns in FIRES that we are joining with PARCEL.

The query below runs but returns the wrong result--some parcels that had no fires show up with one fire in the count. Why?

Views and Other Table-Like Database Objects

Views are the most useful of these alternate object types for our purposes. A view is a query that is stored in the database, then treated like a table. Unlike a table you create using CREATE TABLE ... AS SELECT, which creates a one-time snapshot of the data returned by the query, a view will reflect the current state of the tables in the underlying query. Hence, if the tables in the database are changing over time, the same query on a view may return different results at different times. Creating a view is similar to the CREATE TABLE ... AS SELECT statement; instead, use CREATE VIEW ... AS SELECT. For example:
CREATE VIEW parcel_owners
  AS SELECT p.parcelid, o.oname, P.ADD1, P.ADD2, P.ZIP
       FROM parcels p, owners o
      WHERE p.onum = o.ownernum(+);

CREATE VIEW owner_sqft
  AS SELECT o.oname, SUM(SQFT) TOTAL_SQFT
       FROM parcels p, owners o
      WHERE p.onum = o.ownernum
   GROUP BY o.oname;
Note that the column alias TOTAL_SQFT in the example above is required because Oracle needs to know what to name the column in the view. Do not include an ORDER BY clause in the SELECT statement that you use to create the view.

Once the view is created, it can be treated for (almost) all intents and purposes as a true table. You can describe them to see their structure:

DESCRIBE parcel_owners
DESCRIBE owner_sqft

The Data Dictionary

Information describing all the Oracle objects in the database is stored in the Oracle data dictionary, which you can access through a large number of data dictionary views. We can query data dictionary views just like any view or table. You're already familiar with the view CAT which provides you with a catalog of objects that you own:
SELECT * FROM cat;
Many other such views are available. The USER_SYNONYMS view mentioned above is one of them. The data dictionary view USER_VIEWS includes information about the views we just created. To see its structure, we can use the DESCRIBE statement in SQL*Plus:
SQL> DESCRIBE user_views
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 VIEW_NAME                                 NOT NULL VARCHAR2(30)
 TEXT_LENGTH                                        NUMBER
 TEXT                                               LONG
 TYPE_TEXT_LENGTH                                   NUMBER
 TYPE_TEXT                                          VARCHAR2(4000)
 OID_TEXT_LENGTH                                    NUMBER
 OID_TEXT                                           VARCHAR2(4000)
 VIEW_TYPE_OWNER                                    VARCHAR2(30)
 VIEW_TYPE                                          VARCHAR2(30)
To see the definition of the views we just created, we can use the following statements:
SET LONG 5000
SELECT view_name, text
  FROM user_views
 WHERE view_name IN ('PARCEL_OWNERS', 'OWNER_SQFT');
Note that the column TEXT has type "LONG". In order to ensure that SQL*Plus displays this LONG column properly, we used the "SET LONG 5000" statement before running the query on USER_VIEWS.

Data Manipulation Language

SELECT statements view or query the contents of tables. With Data Manipulation Language (DML) statements, we can alter the contents of the tables. DML statements include:

INSERT: Add Rows to a Table

General syntax:

Example: Add a row to the FIRES table

DELETE: Delete Rows from a Table

General syntax: Example: Delete fires with losses less than $50000 from the database

UPDATE: Modify Data in a Table

General syntax:

Example: Change the building value of a particular parcel

Note: You can also update a table using a subquery. This typically involves one or more correlated subqueries. However, correlated subqueries are beyond this overview and not standardized in the SQL syntax used by the major vendors.

Transaction Control

Transaction control statements allow several SQL statements to be grouped together into a unit (a transaction) that are either processed or rejected as a whole. These statements include:

Normally, we are not particularly concerned about transaction control since we are focused on analyzing and cross-referencing existing databases to which we have read-only access and we are generating new data only in temporary tables or personal lookup tables that do not need to be shared. However, we will need to concern ourselves with COMMIT and ROLLBACK for exercises involving the use of data manipulation language terms such as UPDATE. If we run an UPDATE query that fails due to a SQL error, then any pending changes to the database tables that have occurred since the last COMMIT will be rolled back.

Data Definition Language

Data Definition Language (DDL) statements affect the structure and security provisions of the database, among other things.

Create and Drop Objects

Create new objects

Drop objects permanently

DROP TABLE table1;

DROP VIEW view1;

DROP INDEX index1;

Modify existing objects

ALTER TABLE table1;

ALTER VIEW view1;

ALTER INDEX index1;
 

Access Privileges for Objects

Specific privileges on tables can be given to individual users

Privileges can be revoked

 

More examples of advanced query construction

Examining Views and other system tables: Suppose we have taken advantage of Oracle to save complex queries as views that can be imported into ArcMap. How can we find the definition of a view?

describe all_views
set linesize 100 
column text format A40
      

select substr(view_name,1,12) as name, text_length, text from ALL_VIEWS where owner = 'JF';
  grant select on madensity to public;
  create public synonym v_madensity for madensity

Computing the percentage of East Boston parcels that are triple-deckers (LU=R3)

Compute same percentage in a single SQL query:

select 100 * n.r3_count / d.all_lu 
  from (select count(LU) R3_count from bos05eb where LU = 'R3') n,
       (select count(LU) all_lu from bos05eb) d; 

 


Home | Syllabus | Lectures | Labs | CRN | 37-312 lab | MIT

Written February, 2001-2006 (Tom Grayson, Joe Ferreira), Last Modified 26 February 2009 (jf)