Massachusetts Institute of Technology - Department of Urban Studies and Planning


11.521
Spatial Database Management and Advanced Geographic Information Systems
11.523
Fundamentals of Spatial Database Management

Advanced Queries, Referential Integrity, and Intro to Relational Database Design

1 March 2018


Administrative

Today


Practice thinking relationally

What SQL query answers the following questions:

In lab#2, when you ran q_eb05new to join your expanded t_noncondo table with MASS_LANDUSE, you may have noticed that q_eb05new produced a table with one fewer record than t_noncondo. Which record dropped out?

SELECT bos05eb.PTYPE 
  FROM bos05eb 
 WHERE bos05eb.ptype not in (select stclass_txt from mass_landuse); 

Or, equivalently:

SELECT bos05eb.PTYPE 
  FROM bos05eb
 WHERE NOT EXISTS(SELECT NULL from mass_landuse m
                   WHERE bos05eb.ptype = m.stclass_txt); 

Let's figure out how to do the same query in MS-Access - by pasting the SQL into MS-Access ...

Lab #4 - Mapping zoning variances - Helpful Notes

Geocoding location

Creating pin-map of locations (data conversion, coordinate system, pin-map from data table)

Symbolization

Spatial join (point-in-polygon operation)

Exporting data tables back to Postgres or MS-Access from ArcGIS

Miscellaneous other Lab #4 Notes


More Advanced SQL Query Construction Techniques.

Review: SELECT Statement Syntax

Basic Syntax :

  SELECT expr1, expr2, expr3, ...
    FROM object1, object2, ...
   WHERE conditions
GROUP BY expr4, expr5, expr6, ...
  HAVING conditions
ORDER BY expr7, expr8, expr9, ...

Note that the order of the clauses matters! The clauses, if included, must appear in the order shown! Oracle will report an error if you make a mistake, but the error message (e.g., "ORA-00933: SQL command not properly ended") may not be very informative.

Nested Queries: SELECT statements in the FROM clause

Compute the percentage of East Boston parcels that are triple-deckers (LU=R3)

Method 1: 'Brute Force' but reliable and easy to understand

create view v_num as
(select count(ptype) as R3_count 
   from bos05eb 
  where ptype = '105');
create view v_denom as 
(select count(ptype) as all_ptype 
   from bos05eb); 

Ordering Rows Returned by a Query

First of all, almost every SQL statement you write should specify the way the rows will be sorted. That means you should include an ORDER BY clause in nearly every SQL SELECT statement. While examples are in the notes, we haven't emphasized how to sort in descending order. For this you can use the 'DESC' keyword after the expression you want to sort that way. (SQL also has an 'ASC' keyword for ascending order. Since it is the default, it can be omitted.)

The syntax looks like this:

    SELECT ...
    ORDER BY expr1 [ [ ASC | DESC ] , expr2 [ ASC | DESC ] ... ]
For example, let's sort the parcels by land use in ascending order and the square footage in descending order:
  SELECT parcelid, landuse, sqft
    FROM parcels
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        15          34800
        10          10900
         7 A        14000
        14 A        10000
        16 A         9600
        18 A         9500
        11 C       210000
         2 C       100000
        19 C        40000
         8 C        24800
         1 C        20000
         4 CL
         6 CM        2100
        20 E        50000
         5 E        25000
         9 R1        1800
        17 R1        1500
        13 R2
         3 R3        5500
        12 R3        5300

20 rows selected.
Notice that the query can mix the ASC and DESC keywords in a single ORDER BY clause.

Order of Operations with 'AND' and 'OR'

The Boolean operators 'AND' and 'OR' can do unexpected things when you combine them in a query and you're not careful. Suppose we want to find parcels that have a square footage of 5300 or 10000, and, of those, find the ones with land use code 'A'.

We might (incorrectly) write a query like this:

  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE sqft = 5300
      OR sqft = 10000
     AND landuse = 'A'
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000
        12 R3        5300
Notice that it returned a row with LANDUSE = 'R3'. That's not what we wanted! The problem is that the 'AND' and 'OR' operators, when mixed, are not processed in the sequence written, but rather follow an order of operations much as in algebra (exponentation before everything, then multiplication and division before addition and subtraction). In Boolean logic, 'AND' is like multiplication and 'OR' is like addition, and Oracle orders their processing accordingly. Hence, the query above is actually equivalent to this one:
  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE sqft = 5300
      OR (    sqft = 10000
          AND landuse = 'A')
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000
        12 R3        5300
Since the order of operations can surprise you at inconvenient times, you should always use parentheses to force the correct order whenever you mix 'AND' and 'OR' in a WHERE clause. Here is the correct way to write the query:
  SELECT parcelid, landuse, sqft
    FROM parcels
   WHERE (   SQFT = 10000
          OR SQFT = 5300)
     AND LANDUSE = 'A'
ORDER BY landuse ASC, sqft DESC;

  PARCELID LAN       SQFT
---------- --- ----------
        14 A        10000

SELECT DISTINCT

Normally, a query may return duplicate rows. For example, if we query the FIRES table to list the parcels that had fires, we'll find that the parcels that had more than one fire (parcels 2 and 3) show up multiple times:

  SELECT parcelid
    FROM fires
ORDER BY parcelid;
 
  PARCELID
----------
         2
         2
         3
         3
         7
        20
 
6 rows selected.
If we don't want to see the duplicates, we can add the keyword DISTINCT right after SELECT:
  SELECT DISTINCT parcelid
    FROM fires
ORDER BY parcelid;
 
  PARCELID
----------
         2
         3
         7
        20
 
4 rows selected.
Now parcels 2 and 3 show up only once. You only use the DISTINCT once, right after SELECT, to apply to the entire row. You do not not apply it to each column. Hence, this query is valid:
  SELECT DISTINCT p.onum, o.oname
    FROM parcels p, owners o
   WHERE p.onum = o.ownernum
ORDER BY p.onum;
 
     ONUM ONAME
---------- -----------------------
         9 PATRICK KING
        10 METHUINON  TRUST
        11 FERNANDO MARTINEZ
        18 JOHN MCCORMACK
        29 FRANK O'BRIEN
        32 GERALD RAPPAPORT
        38 BAY STATE, INC.
        55 THOMAS KELLIHER
        89 JOSEPH NOONAN
       100 MGH, INC.
       200 VANDELAY INDUSTRIES
 
11 rows selected.
However, the following incorrect query, with two DISTINCT keywords, generates the cryptic error message "ORA-00936: missing expression":
  SELECT DISTINCT p.onum, DISTINCT o.oname
    FROM parcels p, owners o
   WHERE p.onum = o.ownernum
ORDER BY p.onum;
Note also that you can (in Postgres, Oracle, SQL-server, etc. but *not* in MS-Access) use DISTINCT within the group functions (e.g., COUNT, AVG, STDDEV) to consider only the distinct values within a group: (do you see why the counts would not all be the same if you replace 'wpb' with 'sqft'?)
  SELECT COUNT(wpb) wpb_ct,
         COUNT(DISTINCT wpb) wpb_ct_distinct,
         COUNT(*) row_ct
    FROM parcels;
 
    WPB_CT WPB_CT_DISTINCT     ROW_CT
---------- --------------- ----------
        20              20         20
 
1 row selected.
Finally, note that COUNT(DISTINCT *) does not work. This result is not surprising if you think about it. Why do you think this is the case?

Cartesian Products

What's wrong with this query?
SELECT p.parcelid, p.onum, o.ownernum, o.oname
  FROM parcels p, owners o;
This query returns 200 rows!

There are only 20 rows in PARCEL and 10 rows in OWNERS, so what's going on? The problem here is that this query has no WHERE clause that specifies how the PARCEL table relates to the OWNERS table. Without this information, Oracle does not know how to match a row in PARCEL to a corresponding row in OWNERS. What does it do instead? It matches every row in PARCEL to every row in OWNERS. Hence, we end up with:

(20 rows in PARCEL) matched to (10 rows in OWNERS) = 20 x 10 = 200 rows returned
This kind of unconstrained join is called a Cartesian product. This result is desireable only under rare circumstances. If you have queries that are returning a suspiciously large number of rows, you have probably unwittingly requested a Cartesian product. Note that for tables of even modest size the number of rows returned by a Cartesian product can be explosive. If you generate a Cartesian product of one table with 1,000 rows with another table with 2,000 rows, your query will return 1,000 x 2,000 = 2,000,000 rows! That's right--two million rows! Hence, you should very careful to avoid unintentional Cartesian products.

To fix this query, we need to specify how the owner numbers stored in the PARCEL table should be matched to owner numbers in the OWNERS table. In PARCEL, the owner numbers are stored in ONUM, while in OWNERS they are stored in OWNERNUM. Here is the repaired query with the appropriate join condition in the WHERE clause:

SELECT p.parcelid, p.onum, o.ownernum, o.oname
  FROM parcels p, owners o
 WHERE p.onum = o.ownernum;

This query returns 20 rows, which is definitely an improvement.

Note that an unintended Cartesian product can easily be hidden in a query that joins three or more tables or in a query with joins plus a GROUP BY expression (because it will not be obvious that too many rows were collapsed into the rows with the unique 'group by' values).

NULL Values

NULL is a special value that means "missing" or "unknown." It can be placed in a column of any type, unless the column has been declared NOT NULL, in which case NULL values are forbidden. A NULL is not the same as zero or any other value.

Special rules apply when NULL values are involved:

For more information on NULL values and how Postgrestreats them, use Google or another search engine with keywords such as 'postgres SQL NULL in group by'. 

Outer Joins

Take a look at this query:

  SELECT p.parcelid, f.fdate
    FROM parcels p, fires f
   WHERE p.parcelid = f.parcelid
ORDER BY p.parcelid, f.fdate;

The query above returned 6 rows, but there are 20 parcels. Fourteen parcels seem to be missing. Where did they go? The answer is that the SQL query only listed the parcels that occur in both tables, PARCELS and FIRES.

How do we get around this problem if we want to see all 20 rows in the parcel table, whether they match a record in FIRES or not? The answer is an outer join. The standard join is also known as an "inner join," meaning that the default behavior of not matching NULL values occurs. In an outer join, we explicitly tell our SQL engine that we want it to display NULL values that would otherwise be excluded to each column in the WHERE clause where the additional NULLs should appear.The syntax is the same as MS-Access uses to specify joins:

    SELECT p.parcelid, f.fdate
      FROM parcels p left outer join fires f on p.parcelid = f.parcelid
  ORDER BY p.parcelid, f.fdate;
Note the difference between 'left' and 'right' outer joins.  This query will run but return a different result:
    SELECT p.parcelid, f.fdate
      FROM parcels p right outer join fires f on p.parcelid = f.parcelid
  ORDER BY p.parcelid, f.fdate;

Inner vs. Outer Joins with GROUP BY

List all the parcels that had a fire, including the address and date of the fire (this requires an inner join): Redo with inner join explicitly required in the 'from' clause:  Repeat the same query, except list all the parcels, whether they had a fire or not (this requires an outer join):
  SELECT p.parcelid, p.add1, p.add2, f.fdate
    FROM parcels p left outer join fires f on p.parcelid = f.parcelid
ORDER BY p.parcelid, f.fdate;
List the count of fires for the parcels that had a fire (this requires an inner join with grouping): Note that the query above lists the parcels in descending order of the count of fires. When specifying a group function or other expression in the ORDER BY clause, you must use the full expression, even if you defined a column alias for it in the SELECT list. In this case, we must use COUNT(FDATE) rather than FIRE_COUNT in the ORDER BY clause. The DESC keyword after COUNT(FDATE) indicates that we want the fire counts shown in descending order, rather than the default ascending order. You need to apply the DESC keyword to every expression in the ORDER BY clause that you want in descending order.

List the count of fires for all parcels, whether it experienced a fire or not (this requires an outer join with grouping):

Note that the query above uses an outer join on the FIRES table -- indicated by the outer join symbol (+) -- to include the parcels that are not listed in the FIRES table in the count. The outer join symbol indicates where a NULL should replace a real value if none is available in the table. Note that, if joining PARCELS and FIRES required matching two columns instead of just one, then we would need to use the outer join symbol on the FIRES side of the join for both columns.

The query below runs but returns the wrong result--some parcels that had no fires show up with one fire in the count. Why?

Views and Other Table-Like Database Objects

Views are the most useful of these alternate object types for our purposes. A view is a query that is stored in the database, then treated like a table. Unlike a table you create using CREATE TABLE ... AS SELECT, which creates a one-time snapshot of the data returned by the query, a view will reflect the current state of the tables in the underlying query. Hence, if the tables in the database are changing over time, the same query on a view may return different results at different times. Creating a view is similar to the CREATE TABLE ... AS SELECT statement; instead, use CREATE VIEW ... AS SELECT. For example:

CREATE VIEW parcel_owners
  AS SELECT p.parcelid, o.oname, P.ADD1, P.ADD2, P.ZIP
       FROM parcels p, owners o
      WHERE p.onum = o.ownernum(+);

CREATE VIEW owner_sqft
  AS SELECT o.oname, SUM(SQFT) TOTAL_SQFT
       FROM parcels p, owners o
      WHERE p.onum = o.ownernum
   GROUP BY o.oname;

Data Manipulation Language

SELECT statements view or query the contents of tables. With Data Manipulation Language (DML) statements, we can alter the contents of the tables. DML statements include:

INSERT: Add Rows to a Table

General syntax:

Example: Add a row to the FIRES table

DELETE: Delete Rows from a Table

General syntax: Example: Delete fires with losses less than $50000 from the database

UPDATE: Modify Data in a Table

General syntax:

Example: Change the building value of a particular parcel

Note: You can also update a table using a subquery. This typically involves one or more correlated subqueries. However, correlated subqueries are beyond this overview and not standardized in the SQL syntax used by the major vendors.

Transaction Control

Transaction control statements allow several SQL statements to be grouped together into a unit (a transaction) that are either processed or rejected as a whole. These statements include:

Normally, we are not particularly concerned about transaction control since we are focused on analyzing and cross-referencing existing databases to which we have read-only access and we are generating new data only in temporary tables or personal lookup tables that do not need to be shared. However, we will need to concern ourselves with COMMIT and ROLLBACK for exercises involving the use of data manipulation language terms such as UPDATE. If we run an UPDATE query that fails due to a SQL error, then any pending changes to the database tables that have occurred since the last COMMIT will be rolled back.

Data Definition Language

Data Definition Language (DDL) statements affect the structure and security provisions of the database, among other things.

Create and Drop Objects

Create new objects

Drop objects permanently

DROP TABLE table1;

DROP VIEW view1;

DROP INDEX index1;

Modify existing objects

ALTER TABLE table1 ADD COLUMN columname integer;

ALTER VIEW IF EXISTS view1 OWNER TO new_owner;

ALTER INDEX IF EXISTS index1 RENAME TO index2;
 

Access Privileges for Objects

Specific privileges on tables can be given to individual users

Privileges can be revoked


Home | Syllabus | Lectures | Labs | CRN | 37-312 lab | MIT

Written February, 2001-2006 (Tom Grayson, Joe Ferreira), Last Modified 28 February 2018 (jf)