| 11.521 | Spatial
Database Management and Advanced Geographic Information Systems |
| 11.523 | Fundamentals
of Spatial Database Management |
What SQL query answers the following questions:
- Does every Assessing record in BOS05EB have a corresponding 'ground' parcel in ebos_parcels05.shp?
- Does every ground parcel have an Assessing record in BOS05EB?
- Is every property type code in BOS05EB explained in the MASS_LANDUSE lookup table?
- Which land uses in MASS_LANDUSE are not found within East Boston?
In lab#2, when you ran q_eb05new to join your expanded t_noncondo table with MASS_LANDUSE, you may have noticed that q_eb05new produced a table with one fewer record than t_noncondo. Which record dropped out?
SELECT bos05eb.PTYPE FROM bos05eb WHERE bos05eb.ptype not in (select stclass_txt from mass_landuse);Or, equivalently:
SELECT bos05eb.PTYPE FROM bos05eb WHERE NOT EXISTS(SELECT NULL from mass_landuse m WHERE bos05eb.ptype = m.stclass_txt);Let's figure out how to do the same query in MS-Access - by pasting the SQL into MS-Access ...
set linesize 100
set pagesize 40
set pause on
set pause more...
select * from cat;
Schema of ZONING table (and related lookup tables with codes for Decision, Use, NSA, and Neighbrhd) |
|
Schema of 1980 Boston Census data (grouped by neighborhood statistical area, NSA) |
|
Annotated SQL queries of ZONING table |
|
Annotated SQL queries illustrating use of lookup tables to categorize ownership of properties seeking zoning variances |
|
Stages of evolution of the ZONING variance database |
The ZONING table and associated lookup tables are in our Oracle database and accessible to everyone in the class. In addition, the same tables are available as an MS-Access database in the class data locker: http://mit.edu/11.521/data/11.521_lab3.mdb.
Build lookup table for Boston parcel ownership
- Book Chapter describing 'lookup table' example: http://mit.edu/11.521/papers/techcity_7ferreira.pdf
- Queries to produce results discussed in book chapter: lookup_inclass_01feb22.html
Using lookup tables to merge local data with large, 'read-only' official databases
The book chapter explains some of the complexities and technical strategies involved in trying to empower communities by providing local access to parcel-level city databases. We would like to facilitate distributed access to centrally maintained databases - while allowing some capacity on the part of end-users to make changes in the 'official' data. Our example uses parcel-level databases of land use and ownership - a very detailed data layer that has recently become much more standardized and accessible. To emphasize how often it is necessary to do extensive data processing before such data can be utilized effectively, we consider the problem of addressing the 'spelling errors' in owner names that complicate efforts to determine ovnership patterns from typical parcel databases. For example, there are at least 17 different spellings of the Boston Redevelopment Authority in the 'owner' field of the official parcel database for Boston.
In parts of Lab #3 and Problem Set A, we use the methods discussed in the book chapter (i.e. local 'lookup' tables and SQL 'update' queries) to categorize the owners of the properties in the zoning variance database. After making sure we understand these techniques, we discus how these methods can provide a useful alternative to 'top-down' and 'bottom up' strategies for fixing the spelling errors. In particular, they provide a 'middle-out' alternative that allows decentralized accumulation of useful knowledge that can be readily linked to official datasets outside the user's control. Here are some queries that can will run in class and the powerpoint slides that outline the methods and ideas in the book chapter.
A set of notes illustrating these types of queries using the zoning variance database is available within the SQL Notes section of the class web page at this link: Grouping zoning applicants via 'lookup' tables.
SELECT expr1, expr2, expr3, ...
FROM object1, object2, ...
WHERE conditions
GROUP BY expr4, expr5, expr6, ...
HAVING conditions
ORDER BY expr7, expr8, expr9, ...
Note that the order of the clauses matters! The clauses, if included, must appear
in the order shown! Oracle will report an error if you make a mistake, but the
error message (e.g., "ORA-00933: SQL command not properly ended") may not be very
informative.
The syntax looks like this:
SELECT ...
ORDER BY expr1 [ [ ASC | DESC ] , expr2 [ ASC | DESC ] ... ]
For example, let's sort the parcels by land use in ascending order
and the square footage in descending order:
SELECT parcelid, landuse, sqft
FROM parcels
ORDER BY landuse ASC, sqft DESC;
PARCELID LAN SQFT
---------- --- ----------
15 34800
10 10900
7 A 14000
14 A 10000
16 A 9600
18 A 9500
11 C 210000
2 C 100000
19 C 40000
8 C 24800
1 C 20000
4 CL
6 CM 2100
20 E 50000
5 E 25000
9 R1 1800
17 R1 1500
13 R2
3 R3 5500
12 R3 5300
20 rows selected.
Notice that the query can mix the ASC and DESC keywords in a single ORDER BY clause.
We might (incorrectly) write a query like this:
SELECT parcelid, landuse, sqft
FROM parcels
WHERE sqft = 5300
OR sqft = 10000
AND landuse = 'A'
ORDER BY landuse ASC, sqft DESC;
PARCELID LAN SQFT
---------- --- ----------
14 A 10000
12 R3 5300
Notice that it returned a row with LANDUSE = 'R3'. That's not what we wanted!
The problem is that the 'AND' and 'OR' operators, when mixed, are not processed
in the sequence written, but rather follow an order of operations much as in algebra
(exponentation before everything, then multiplication and division before addition
and subtraction). In Boolean logic, 'AND' is like multiplication and 'OR' is like
addition, and Oracle orders their processing accordingly. Hence, the query above
is actually equivalent to this one:
SELECT parcelid, landuse, sqft
FROM parcels
WHERE sqft = 5300
OR ( sqft = 10000
AND landuse = 'A')
ORDER BY landuse ASC, sqft DESC;
PARCELID LAN SQFT
---------- --- ----------
14 A 10000
12 R3 5300
Since the order of operations can surprise you at inconvenient times, you should always use parentheses to force the correct order whenever you mix 'AND'
and 'OR' in a WHERE clause. Here is the correct way to write the query:
SELECT parcelid, landuse, sqft
FROM parcels
WHERE ( SQFT = 10000
OR SQFT = 5300)
AND LANDUSE = 'A'
ORDER BY landuse ASC, sqft DESC;
PARCELID LAN SQFT
---------- --- ----------
14 A 10000
Normally, a query may return duplicate rows. For example, if we query the FIRES table to list the parcels that had fires, we'll find that the parcels that had more than one fire (parcels 2 and 3) show up multiple times:
SELECT parcelid
FROM fires
ORDER BY parcelid;
PARCELID
----------
2
2
3
3
7
20
6 rows selected.
If we don't want to see the duplicates, we can add the keyword DISTINCT right
after SELECT:
SELECT DISTINCT parcelid
FROM fires
ORDER BY parcelid;
PARCELID
----------
2
3
7
20
4 rows selected.
Now parcels 2 and 3 show up only once.
You only use the DISTINCT once, right after SELECT, to apply to the entire
row. You do not not apply it to each column. Hence, this query is valid:
SELECT DISTINCT p.onum, o.oname
FROM parcels p, owners o
WHERE p.onum = o.ownernum
ORDER BY p.onum;
ONUM ONAME
---------- -----------------------
9 PATRICK KING
10 METHUINON TRUST
11 FERNANDO MARTINEZ
18 JOHN MCCORMACK
29 FRANK O'BRIEN
32 GERALD RAPPAPORT
38 BAY STATE, INC.
55 THOMAS KELLIHER
89 JOSEPH NOONAN
100 MGH, INC.
200 VANDELAY INDUSTRIES
11 rows selected.
However, the following incorrect query, with two DISTINCT keywords, generates
the cryptic error message "ORA-00936: missing expression":
SELECT DISTINCT p.onum, DISTINCT o.oname
FROM parcels p, owners o
WHERE p.onum = o.ownernum
ORDER BY p.onum;
Note also that you can use DISTINCT with the group functions (e.g., COUNT,
AVG, STDDEV) to get these functions to consider only distinct values within a
group: (do you see why the counts would not all be the same if you replace 'wpb' with 'sqft'?)
SELECT COUNT(wpb) wpb_ct,
COUNT(DISTINCT wpb) wpb_ct_distinct,
COUNT(*) row_ct
FROM parcels;
WPB_CT WPB_CT_DISTINCT ROW_CT
---------- --------------- ----------
20 20 20
1 row selected.
Finally, note that COUNT(DISTINCT *) does not work. This result is not surprising iif
you think about it. Why do you think this is the case?
This query returns 200 rows!SELECT p.parcelid, p.onum, o.ownernum, o.oname FROM parcels p, owners o;
There are only 20 rows in PARCEL and 10 rows in OWNERS, so what's going on? The problem here is that this query has no WHERE clause that specifies how the PARCEL table relates to the OWNERS table. Without this information, Oracle does not know how to match a row in PARCEL to a corresponding row in OWNERS. What does it do instead? It matches every row in PARCEL to every row in OWNERS. Hence, we end up with:
(20 rows in PARCEL) matched to (10 rows in OWNERS) = 20 x 10 = 200 rows returnedThis kind of unconstrained join is called a Cartesian product. This result is desireable only under rare circumstances. If you have queries that are returning a suspiciously large number of rows, you have probably unwittingly requested a Cartesian product. Note that for tables of even modest size the number of rows returned by a Cartesian product can be explosive. If you generate a Cartesian product of one table with 1,000 rows with another table with 2,000 rows, your query will return 1,000 x 2,000 = 2,000,000 rows! That's right--two million rows! Hence, you should very careful to avoid unintentional Cartesian products.
To fix this query, we need to specify how the owner numbers stored in the PARCEL table should be matched to owner numbers in the OWNERS table. In PARCEL, the owner numbers are stored in ONUM, while in OWNERS they are stored in OWNERNUM. Here is the repaired query with the appropriate join condition in the WHERE clause:
SELECT p.parcelid, p.onum, o.ownernum, o.oname FROM parcels p, owners o WHERE p.onum = o.ownernum;
This query returns 20 rows, which is definitely an improvement.
Note that a Cartesian product can easily be hidden in a query that requires multiple joins. Suppose we want to find all the papers, with associated keywords, written by authors with the last name WALKER. We could try this query:
COLUMN keyword FORMAT A20 TRUNC
COLUMN title FORMAT A25 TRUNC
SELECT a.lastname, a.fnamemi, k.keyword, t.title
FROM keywords k, match m, titles t, authors a
WHERE m.code = k.code
AND t.paper = a.paper
AND a.lastname = 'WALKER'
ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;
This query returns a whopping 6174 rows! What's wrong? All of the tables appear to be involved in a join condition. The problem is that while the MATCH and KEYWORDS tables are tied together, and the TITLES and AUTHORS tables are linked to each other, nothing links these two sets of tables together. Adding another join condition fixes it:
SELECT a.lastname, a.fnamemi, k.keyword, t.title
FROM keywords k, match m, titles t, authors a
WHERE m.code = k.code
AND t.paper = a.paper
AND m.paper = t.paper
AND a.lastname = 'WALKER'
ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;
This query returns a much more reasonable 14 rows. Note that we could have also specified "m.paper = a.paper" and it would have worked too. Why?
A Cartesian product can easily be hidden by a GROUP BY, since it will aggregate all the spurious rows, and you will not see the rows that made up the groups. Here's a variation on the earlier, broken example, now with a GROUP BY:
SELECT a.lastname, a.fnamemi, count(k.keyword) keywords, count(t.title) titles
FROM keywords k, match m, titles t, authors a
WHERE m.code = k.code
AND t.paper = a.paper
AND a.lastname = 'WALKER'
GROUP BY a.lastname, a.fnamemi
ORDER BY a.lastname, a.fnamemi;
Because this query returns only counts, it's not obvious that the query is
defective--unless you have an idea about what results are reasonable! Always
scrutinize your results!
NULL is a special value that means "missing" or "unknown." It can be placed in a column of any type, unless the column has been declared NOT NULL, in which case NULL values are forbidden. A NULL is not the same as zero or any other value.
Special rules apply when NULL values are involved:
For more information on NULL values and how Oracle treats them, consult the Oracle documentation. With luck, their link to the Oracle 8i documentation hasn't changed much and you can find the SQL*Plus documentation here: http://download-west.oracle.com/docs/cd/A87860_01/doc/server.817/index.htm (Note also the link to an explanation of oracle error messages).
Take a look at this query:
SELECT p.parcelid, f.fdate
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid
ORDER BY p.parcelid, f.fdate;
The query above returned 6 rows, but there are 20 parcels. Fourteen parcels seem to be missing. Where did they go? The answer is that Oracle will only list the parcels that occur in both tables, PARCELS and FIRES.
How do we get around this problem if we want to see all 20 rows in the parcel table, whether they match a record in FIRES or not? The answer is an outer join. The standard join is also known as an "inner join," meaning that the default behavior of not matching NULL values occurs. In an outer join, we explicitly tell Oracle that we want it to display NULL values that would otherwise be excluded by adding the (+) outer join operator to each column in the WHERE clause where the additional NULLs should appear.
SELECT p.parcelid, f.fdate
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid (+)
ORDER BY p.parcelid, f.fdate;
Note that the position of the outer join operator (+) is significant! This
query will run but return a different result:
SELECT p.parcelid, f.fdate
FROM parcels p, fires f
WHERE p.parcelid (+) = f.parcelid
ORDER BY p.parcelid, f.fdate;
Some additional class notes on outer joins are available. If you wish, you can peruse the Oracle 8i documentation. Be aware
that, while the concept remains the same, the syntax for outer joins varies somewhat
across vendors and, in particular, is different in MS-Access and SQL-server.
SELECT p.parcelid, p.add1, p.add2, f.fdate
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid
ORDER BY p.parcelid, f.fdate;
Repeat the same query, except list all the parcels, whether they
had a fire or not (this requires an outer join):
SELECT p.parcelid, p.add1, p.add2, f.fdate
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid (+)
ORDER BY p.parcelid, f.fdate;
List the count of fires for the parcels that had a fire (this requires
an inner join with grouping):
SELECT p.parcelid, COUNT(fdate) fire_count
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid
GROUP BY P.PARCELID
ORDER BY COUNT(fdate) DESC, p.parcelid;
Note that the query above lists the parcels in descending order of the count of
fires. When specifying a group function or other expression in the ORDER BY clause,
you must use the full expression, even if you defined a column alias for
it in the SELECT list. In this case, we must use COUNT(FDATE) rather than FIRE_COUNT
in the ORDER BY clause. The DESC keyword after COUNT(FDATE) indicates that we
want the fire counts shown in descending order, rather than the default ascending
order. You need to apply the DESC keyword to every expression in the ORDER
BY clause that you want in descending order.
List the count of fires for all parcels, whether it experienced a fire or not (this requires an outer join with grouping):
SELECT p.parcelid, COUNT(fdate) fire_count
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid (+)
GROUP BY p.parcelid
ORDER BY COUNT(fdate) DESC, p.parcelid;
Note that the query above uses an outer join on the FIRES table -- indicated
by the outer join symbol (+) -- to include the parcels that are not listed
in the FIRES table in the count. The outer join symbol indicates where a NULL
should replace a real value if none is available in the table. Note that we need
to use the outer join symbol with both columns in FIRES that we are joining
with PARCEL.
The query below runs but returns the wrong result--some parcels that had no fires show up with one fire in the count. Why?
SELECT p.parcelid, COUNT(*) bogus_fire_count
FROM parcels p, fires f
WHERE p.parcelid = f.parcelid (+)
GROUP BY p.parcelid
ORDER BY COUNT(*) DESC, p.parcelid;
A synonym is simply a second name for an existing object. These are particularly convenient when a table is owned by another user (or, stated differently, stored in a different schema). You have been using synonyms all along for the objects in the PARCELS, URISA, and ZONING databases. To see the synonyms, use this query:
SELECT synonym_name, table_owner, table_name FROM user_synonyms ORDER BY synonym_name;
CREATE VIEW parcel_owners
AS SELECT p.parcelid, o.oname, P.ADD1, P.ADD2, P.ZIP
FROM parcels p, owners o
WHERE p.onum = o.ownernum(+);
CREATE VIEW owner_sqft
AS SELECT o.oname, SUM(SQFT) TOTAL_SQFT
FROM parcels p, owners o
WHERE p.onum = o.ownernum
GROUP BY o.oname;
Note that the column alias TOTAL_SQFT in the example above is required because Oracle needs to know what to name the column in the view. Do not include an ORDER BY clause in the SELECT statement that you use to create the
view.
Once the view is created, it can be treated for (almost) all intents and purposes as a true table. You can describe them to see their structure:
DESCRIBE parcel_ownersDESCRIBE owner_sqft
Many other such views are available. The USER_SYNONYMS view mentioned above is one of them. The data dictionary view USER_VIEWS includes information about the views we just created. To see its structure, we can use the DESCRIBE statement in SQL*Plus:SELECT * FROM cat;
To see the definition of the views we just created, we can use the following statements:SQL> DESCRIBE user_viewsName Null? Type ----------------------------------------- -------- ---------------------------- VIEW_NAME NOT NULL VARCHAR2(30) TEXT_LENGTH NUMBER TEXT LONG TYPE_TEXT_LENGTH NUMBER TYPE_TEXT VARCHAR2(4000) OID_TEXT_LENGTH NUMBER OID_TEXT VARCHAR2(4000) VIEW_TYPE_OWNER VARCHAR2(30) VIEW_TYPE VARCHAR2(30)
SET LONG 5000
SELECT view_name, text
FROM user_views
WHERE view_name IN ('PARCEL_OWNERS', 'OWNER_SQFT');
Note that the column TEXT has type "LONG". In order to ensure that SQL*Plus
displays this LONG column properly, we used the "SET LONG 5000" statement
before running the query on USER_VIEWS.
INSERT INTO table1 (col1, col2, ...) VALUES (value1, value2, ...)
or
INSERT INTO table1 (col1, col2, ...) SELECT ...
INSERT INTO fires (parcelid, fdate, ignfactor, estloss) VALUES (12, '17-JAN-96', 2, 35000)
DELETE FROM table1 WHERE conditions;Example: Delete fires with losses less than $50000 from the database
DELETE FROM fires WHERE estloss < 50000;
UPDATE table1 SET col1 = value1, col2 = value2, ... WHERE conditions;
or
UPDATE table1
SET col1 =
(SELECT ...)
WHERE conditions;
UPDATE tax SET bldval = 200000 WHERE parcelid = 11;Note: You can also update a table using a subquery. This typically involves one or more correlated subqueries. However, correlated subqueries are beyond this overview and not standardized in the SQL syntax used by the major vendors.
Normally, we are not particularly concerned about transaction control since we are focused on analyzing and cross-referencing existing databases to which we have read-only access and we are generating new data only in temporary tables or personal lookup tables that do not need to be shared. However, we will need to concern ourselves with COMMIT and ROLLBACK for exercises involving the use of data manipulation language terms such as UPDATE. If we run an UPDATE query that fails due to a SQL error, then any pending changes to the database tables that have occurred since the last COMMIT will be rolled back.
CREATE TABLE table1 ... ;
CREATE VIEW view1 ... ; CREATE INDEX index1 ON table1 (col1, ... );
DROP TABLE table1; DROP VIEW view1; DROP INDEX index1;
ALTER TABLE table1; ALTER VIEW view1; ALTER INDEX index1;
GRANT SELECT ON table1 TO user1; GRANT SELECT, INSERT, UPDATE (col1) ON table1 TO user1 WITH GRANT OPTION;
REVOKE ALL
ON table1
FROM user1;
Examining Views and other system tables: Suppose we have taken advantage of Oracle to save complex queries as views that can be imported into ArcMap. How can we find the definition of a view?
- Info about VIEWS is in the system table ALL_VIEWS
describe all_views set linesize 100 column text format A40 select substr(view_name,1,12) as name, text_length, text from ALL_VIEWS where owner = 'JF';
- Set environment to read long text strings and list MADENSITY view:
set long 685 set linesize 180 select view_name, text from all_views where owner = 'JF' and view_name = 'MADENSITY';- This query computes the total area of each Mass town and then calculates population density
select town_id, town, count(*) as parts, avg(pop90) as pop90, sum(area) as area, 10000*(avg(pop90)/sum(area)) as density from matown2000 group by town_id, town order by town;- Now allow anyone to select from this view as v_madensity (instead of jf.madensity)
grant select on madensity to public; create public synonym v_madensity for madensityComputing the percentage of East Boston parcels that are triple-deckers (LU=R3)
- Numerator: create view v_num as (select count(LU) as R3_count from bos05eb where LU = 'R3');
- Denominator: create view v_denom as (select count(LU) as all_lu from bos05eb);
- Ratio: select 100 * r3_count / all_lu from v_num, v_denom;
Compute same percentage in a single SQL query:
select 100 * n.r3_count / d.all_lu from (select count(LU) R3_count from bos05eb where LU = 'R3') n, (select count(LU) all_lu from bos05eb) d;
Written February, 2001-2006 (Tom Grayson, Joe Ferreira), Last Modified 26 February 2009 (jf)