11.521 | Spatial
Database Management and Advanced Geographic Information Systems |
11.523 | Fundamentals
of Spatial Database Management |
What SQL query answers the following questions:
- Does every Assessing record in BOS05EB have a corresponding 'ground' parcel in ebos_parcels05.shp?
- Does every ground parcel have an Assessing record in BOS05EB?
- Is every property type code in BOS05EB explained in the MASS_LANDUSE lookup table?
- Which land uses in MASS_LANDUSE are not found within East Boston?
In lab#2, when you ran q_eb05new to join your expanded t_noncondo table with MASS_LANDUSE, you may have noticed that q_eb05new produced a table with one fewer record than t_noncondo. Which record dropped out?
SELECT bos05eb.PTYPE FROM bos05eb WHERE bos05eb.ptype not in (select stclass_txt from mass_landuse);Or, equivalently:
SELECT bos05eb.PTYPE FROM bos05eb WHERE NOT EXISTS(SELECT NULL from mass_landuse m WHERE bos05eb.ptype = m.stclass_txt);Let's figure out how to do the same query in MS-Access - by pasting the SQL into MS-Access ...
set linesize 100
set pagesize 40
set pause on
set pause more...
select * from cat;
Schema of ZONING table (and related lookup tables with codes for Decision, Use, NSA, and Neighbrhd) |
|
Schema of 1980 Boston Census data (grouped by neighborhood statistical area, NSA) |
|
Annotated SQL queries of ZONING table |
|
Annotated SQL queries illustrating use of lookup tables to categorize ownership of properties seeking zoning variances |
|
Stages of evolution of the ZONING variance database |
The ZONING table and associated lookup tables are in our Oracle database and accessible to everyone in the class. In addition, the same tables are available as an MS-Access database in the class data locker: http://mit.edu/11.521/data/11.521_lab3.mdb.
Build lookup table for Boston parcel ownership
- Book Chapter describing 'lookup table' example: http://mit.edu/11.521/papers/techcity_7ferreira.pdf
- Queries to produce results discussed in book chapter: lookup_inclass_01feb22.html
Using lookup tables to merge local data with large, 'read-only' official databases
The book chapter explains some of the complexities and technical strategies involved in trying to empower communities by providing local access to parcel-level city databases. We would like to facilitate distributed access to centrally maintained databases - while allowing some capacity on the part of end-users to make changes in the 'official' data. Our example uses parcel-level databases of land use and ownership - a very detailed data layer that has recently become much more standardized and accessible. To emphasize how often it is necessary to do extensive data processing before such data can be utilized effectively, we consider the problem of addressing the 'spelling errors' in owner names that complicate efforts to determine ovnership patterns from typical parcel databases. For example, there are at least 17 different spellings of the Boston Redevelopment Authority in the 'owner' field of the official parcel database for Boston.
In parts of Lab #3 and Problem Set A, we use the methods discussed in the book chapter (i.e. local 'lookup' tables and SQL 'update' queries) to categorize the owners of the properties in the zoning variance database. After making sure we understand these techniques, we discus how these methods can provide a useful alternative to 'top-down' and 'bottom up' strategies for fixing the spelling errors. In particular, they provide a 'middle-out' alternative that allows decentralized accumulation of useful knowledge that can be readily linked to official datasets outside the user's control. Here are some queries that can will run in class and the powerpoint slides that outline the methods and ideas in the book chapter.
A set of notes illustrating these types of queries using the zoning variance database is available within the SQL Notes section of the class web page at this link: Grouping zoning applicants via 'lookup' tables.
SELECT expr1, expr2, expr3, ... FROM object1, object2, ... WHERE conditions GROUP BY expr4, expr5, expr6, ... HAVING conditions ORDER BY expr7, expr8, expr9, ...Note that the order of the clauses matters! The clauses, if included, must appear in the order shown! Oracle will report an error if you make a mistake, but the error message (e.g., "ORA-00933: SQL command not properly ended") may not be very informative.
The syntax looks like this:
For example, let's sort the parcels by land use in ascending order and the square footage in descending order:SELECT ... ORDER BY expr1 [ [ ASC | DESC ] , expr2 [ ASC | DESC ] ... ]
Notice that the query can mix the ASC and DESC keywords in a single ORDER BY clause.SELECT parcelid, landuse, sqft FROM parcels ORDER BY landuse ASC, sqft DESC; PARCELID LAN SQFT ---------- --- ---------- 15 34800 10 10900 7 A 14000 14 A 10000 16 A 9600 18 A 9500 11 C 210000 2 C 100000 19 C 40000 8 C 24800 1 C 20000 4 CL 6 CM 2100 20 E 50000 5 E 25000 9 R1 1800 17 R1 1500 13 R2 3 R3 5500 12 R3 5300 20 rows selected.
We might (incorrectly) write a query like this:
Notice that it returned a row with LANDUSE = 'R3'. That's not what we wanted! The problem is that the 'AND' and 'OR' operators, when mixed, are not processed in the sequence written, but rather follow an order of operations much as in algebra (exponentation before everything, then multiplication and division before addition and subtraction). In Boolean logic, 'AND' is like multiplication and 'OR' is like addition, and Oracle orders their processing accordingly. Hence, the query above is actually equivalent to this one:SELECT parcelid, landuse, sqft FROM parcels WHERE sqft = 5300 OR sqft = 10000 AND landuse = 'A' ORDER BY landuse ASC, sqft DESC; PARCELID LAN SQFT ---------- --- ---------- 14 A 10000 12 R3 5300
Since the order of operations can surprise you at inconvenient times, you should always use parentheses to force the correct order whenever you mix 'AND' and 'OR' in a WHERE clause. Here is the correct way to write the query:SELECT parcelid, landuse, sqft FROM parcels WHERE sqft = 5300 OR ( sqft = 10000 AND landuse = 'A') ORDER BY landuse ASC, sqft DESC; PARCELID LAN SQFT ---------- --- ---------- 14 A 10000 12 R3 5300
SELECT parcelid, landuse, sqft FROM parcels WHERE ( SQFT = 10000 OR SQFT = 5300) AND LANDUSE = 'A' ORDER BY landuse ASC, sqft DESC; PARCELID LAN SQFT ---------- --- ---------- 14 A 10000
Normally, a query may return duplicate rows. For example, if we query the FIRES table to list the parcels that had fires, we'll find that the parcels that had more than one fire (parcels 2 and 3) show up multiple times:
SELECT parcelid FROM fires ORDER BY parcelid; PARCELID ---------- 2 2 3 3 7 20 6 rows selected.If we don't want to see the duplicates, we can add the keyword DISTINCT right after SELECT:
SELECT DISTINCT parcelid FROM fires ORDER BY parcelid; PARCELID ---------- 2 3 7 20 4 rows selected.Now parcels 2 and 3 show up only once. You only use the DISTINCT once, right after SELECT, to apply to the entire row. You do not not apply it to each column. Hence, this query is valid:
SELECT DISTINCT p.onum, o.oname FROM parcels p, owners o WHERE p.onum = o.ownernum ORDER BY p.onum; ONUM ONAME ---------- ----------------------- 9 PATRICK KING 10 METHUINON TRUST 11 FERNANDO MARTINEZ 18 JOHN MCCORMACK 29 FRANK O'BRIEN 32 GERALD RAPPAPORT 38 BAY STATE, INC. 55 THOMAS KELLIHER 89 JOSEPH NOONAN 100 MGH, INC. 200 VANDELAY INDUSTRIES 11 rows selected.However, the following incorrect query, with two DISTINCT keywords, generates the cryptic error message "ORA-00936: missing expression":
SELECT DISTINCT p.onum, DISTINCT o.oname FROM parcels p, owners o WHERE p.onum = o.ownernum ORDER BY p.onum;Note also that you can use DISTINCT with the group functions (e.g., COUNT, AVG, STDDEV) to get these functions to consider only distinct values within a group: (do you see why the counts would not all be the same if you replace 'wpb' with 'sqft'?)
SELECT COUNT(wpb) wpb_ct, COUNT(DISTINCT wpb) wpb_ct_distinct, COUNT(*) row_ct FROM parcels; WPB_CT WPB_CT_DISTINCT ROW_CT ---------- --------------- ---------- 20 20 20 1 row selected.Finally, note that COUNT(DISTINCT *) does not work. This result is not surprising iif you think about it. Why do you think this is the case?
This query returns 200 rows!SELECT p.parcelid, p.onum, o.ownernum, o.oname FROM parcels p, owners o;
There are only 20 rows in PARCEL and 10 rows in OWNERS, so what's going on? The problem here is that this query has no WHERE clause that specifies how the PARCEL table relates to the OWNERS table. Without this information, Oracle does not know how to match a row in PARCEL to a corresponding row in OWNERS. What does it do instead? It matches every row in PARCEL to every row in OWNERS. Hence, we end up with:
(20 rows in PARCEL) matched to (10 rows in OWNERS) = 20 x 10 = 200 rows returnedThis kind of unconstrained join is called a Cartesian product. This result is desireable only under rare circumstances. If you have queries that are returning a suspiciously large number of rows, you have probably unwittingly requested a Cartesian product. Note that for tables of even modest size the number of rows returned by a Cartesian product can be explosive. If you generate a Cartesian product of one table with 1,000 rows with another table with 2,000 rows, your query will return 1,000 x 2,000 = 2,000,000 rows! That's right--two million rows! Hence, you should very careful to avoid unintentional Cartesian products.
To fix this query, we need to specify how the owner numbers stored in the PARCEL table should be matched to owner numbers in the OWNERS table. In PARCEL, the owner numbers are stored in ONUM, while in OWNERS they are stored in OWNERNUM. Here is the repaired query with the appropriate join condition in the WHERE clause:
SELECT p.parcelid, p.onum, o.ownernum, o.oname FROM parcels p, owners o WHERE p.onum = o.ownernum;
This query returns 20 rows, which is definitely an improvement.
Note that a Cartesian product can easily be hidden in a query that requires multiple joins. Suppose we want to find all the papers, with associated keywords, written by authors with the last name WALKER. We could try this query:
COLUMN keyword FORMAT A20 TRUNC COLUMN title FORMAT A25 TRUNC SELECT a.lastname, a.fnamemi, k.keyword, t.title FROM keywords k, match m, titles t, authors a WHERE m.code = k.code AND t.paper = a.paper AND a.lastname = 'WALKER' ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;
This query returns a whopping 6174 rows! What's wrong? All of the tables appear to be involved in a join condition. The problem is that while the MATCH and KEYWORDS tables are tied together, and the TITLES and AUTHORS tables are linked to each other, nothing links these two sets of tables together. Adding another join condition fixes it:
SELECT a.lastname, a.fnamemi, k.keyword, t.title FROM keywords k, match m, titles t, authors a WHERE m.code = k.code AND t.paper = a.paper AND m.paper = t.paper AND a.lastname = 'WALKER' ORDER BY a.lastname, a.fnamemi, k.keyword, t.title;
This query returns a much more reasonable 14 rows. Note that we could have also specified "m.paper = a.paper" and it would have worked too. Why?
A Cartesian product can easily be hidden by a GROUP BY, since it will aggregate all the spurious rows, and you will not see the rows that made up the groups. Here's a variation on the earlier, broken example, now with a GROUP BY:
SELECT a.lastname, a.fnamemi, count(k.keyword) keywords, count(t.title) titles FROM keywords k, match m, titles t, authors a WHERE m.code = k.code AND t.paper = a.paper AND a.lastname = 'WALKER' GROUP BY a.lastname, a.fnamemi ORDER BY a.lastname, a.fnamemi;
Because this query returns only counts, it's not obvious that the query is
defective--unless you have an idea about what results are reasonable! Always
scrutinize your results!
NULL is a special value that means "missing" or "unknown." It can be placed in a column of any type, unless the column has been declared NOT NULL, in which case NULL values are forbidden. A NULL is not the same as zero or any other value.
Special rules apply when NULL values are involved:
For more information on NULL values and how Oracle treats them, consult the Oracle documentation. With luck, their link to the Oracle 8i documentation hasn't changed much and you can find the SQL*Plus documentation here: http://download-west.oracle.com/docs/cd/A87860_01/doc/server.817/index.htm (Note also the link to an explanation of oracle error messages).
Take a look at this query:
SELECT p.parcelid, f.fdate FROM parcels p, fires f WHERE p.parcelid = f.parcelid ORDER BY p.parcelid, f.fdate;
The query above returned 6 rows, but there are 20 parcels. Fourteen parcels seem to be missing. Where did they go? The answer is that Oracle will only list the parcels that occur in both tables, PARCELS and FIRES.
How do we get around this problem if we want to see all 20 rows in the parcel table, whether they match a record in FIRES or not? The answer is an outer join. The standard join is also known as an "inner join," meaning that the default behavior of not matching NULL values occurs. In an outer join, we explicitly tell Oracle that we want it to display NULL values that would otherwise be excluded by adding the (+) outer join operator to each column in the WHERE clause where the additional NULLs should appear.
Note that the position of the outer join operator (+) is significant! This query will run but return a different result:SELECT p.parcelid, f.fdate FROM parcels p, fires f WHERE p.parcelid = f.parcelid (+) ORDER BY p.parcelid, f.fdate;
Some additional class notes on outer joins are available. If you wish, you can peruse the Oracle 8i documentation. Be aware that, while the concept remains the same, the syntax for outer joins varies somewhat across vendors and, in particular, is different in MS-Access and SQL-server.SELECT p.parcelid, f.fdate FROM parcels p, fires f WHERE p.parcelid (+) = f.parcelid ORDER BY p.parcelid, f.fdate;
SELECT p.parcelid, p.add1, p.add2, f.fdate FROM parcels p, fires f WHERE p.parcelid = f.parcelid ORDER BY p.parcelid, f.fdate;Repeat the same query, except list all the parcels, whether they had a fire or not (this requires an outer join):
List the count of fires for the parcels that had a fire (this requires an inner join with grouping):SELECT p.parcelid, p.add1, p.add2, f.fdate FROM parcels p, fires f WHERE p.parcelid = f.parcelid (+) ORDER BY p.parcelid, f.fdate;
SELECT p.parcelid, COUNT(fdate) fire_count FROM parcels p, fires f WHERE p.parcelid = f.parcelid GROUP BY P.PARCELID ORDER BY COUNT(fdate) DESC, p.parcelid;Note that the query above lists the parcels in descending order of the count of fires. When specifying a group function or other expression in the ORDER BY clause, you must use the full expression, even if you defined a column alias for it in the SELECT list. In this case, we must use COUNT(FDATE) rather than FIRE_COUNT in the ORDER BY clause. The DESC keyword after COUNT(FDATE) indicates that we want the fire counts shown in descending order, rather than the default ascending order. You need to apply the DESC keyword to every expression in the ORDER BY clause that you want in descending order.
List the count of fires for all parcels, whether it experienced a fire or not (this requires an outer join with grouping):
SELECT p.parcelid, COUNT(fdate) fire_count FROM parcels p, fires f WHERE p.parcelid = f.parcelid (+) GROUP BY p.parcelid ORDER BY COUNT(fdate) DESC, p.parcelid;Note that the query above uses an outer join on the FIRES table -- indicated by the outer join symbol (+) -- to include the parcels that are not listed in the FIRES table in the count. The outer join symbol indicates where a NULL should replace a real value if none is available in the table. Note that we need to use the outer join symbol with both columns in FIRES that we are joining with PARCEL.
The query below runs but returns the wrong result--some parcels that had no fires show up with one fire in the count. Why?
SELECT p.parcelid, COUNT(*) bogus_fire_count FROM parcels p, fires f WHERE p.parcelid = f.parcelid (+) GROUP BY p.parcelid ORDER BY COUNT(*) DESC, p.parcelid;
A synonym is simply a second name for an existing object. These are particularly convenient when a table is owned by another user (or, stated differently, stored in a different schema). You have been using synonyms all along for the objects in the PARCELS, URISA, and ZONING databases. To see the synonyms, use this query:
SELECT synonym_name, table_owner, table_name FROM user_synonyms ORDER BY synonym_name;
Note that the column alias TOTAL_SQFT in the example above is required because Oracle needs to know what to name the column in the view. Do not include an ORDER BY clause in the SELECT statement that you use to create the view.CREATE VIEW parcel_owners AS SELECT p.parcelid, o.oname, P.ADD1, P.ADD2, P.ZIP FROM parcels p, owners o WHERE p.onum = o.ownernum(+); CREATE VIEW owner_sqft AS SELECT o.oname, SUM(SQFT) TOTAL_SQFT FROM parcels p, owners o WHERE p.onum = o.ownernum GROUP BY o.oname;
Once the view is created, it can be treated for (almost) all intents and purposes as a true table. You can describe them to see their structure:
DESCRIBE parcel_ownersDESCRIBE owner_sqft
Many other such views are available. The USER_SYNONYMS view mentioned above is one of them. The data dictionary view USER_VIEWS includes information about the views we just created. To see its structure, we can use the DESCRIBE statement in SQL*Plus:SELECT * FROM cat;
To see the definition of the views we just created, we can use the following statements:SQL> DESCRIBE user_viewsName Null? Type ----------------------------------------- -------- ---------------------------- VIEW_NAME NOT NULL VARCHAR2(30) TEXT_LENGTH NUMBER TEXT LONG TYPE_TEXT_LENGTH NUMBER TYPE_TEXT VARCHAR2(4000) OID_TEXT_LENGTH NUMBER OID_TEXT VARCHAR2(4000) VIEW_TYPE_OWNER VARCHAR2(30) VIEW_TYPE VARCHAR2(30)
Note that the column TEXT has type "LONG". In order to ensure that SQL*Plus displays this LONG column properly, we used the "SET LONG 5000" statement before running the query on USER_VIEWS.SET LONG 5000 SELECT view_name, text FROM user_views WHERE view_name IN ('PARCEL_OWNERS', 'OWNER_SQFT');
INSERT INTO table1 (col1, col2, ...) VALUES (value1, value2, ...)
or
INSERT INTO table1 (col1, col2, ...) SELECT ...
INSERT INTO fires (parcelid, fdate, ignfactor, estloss) VALUES (12, '17-JAN-96', 2, 35000)
DELETE FROM table1 WHERE conditions;Example: Delete fires with losses less than $50000 from the database
DELETE FROM fires WHERE estloss < 50000;
UPDATE table1 SET col1 = value1, col2 = value2, ... WHERE conditions;
or
UPDATE table1 SET col1 = (SELECT ...) WHERE conditions;
UPDATE tax SET bldval = 200000 WHERE parcelid = 11;Note: You can also update a table using a subquery. This typically involves one or more correlated subqueries. However, correlated subqueries are beyond this overview and not standardized in the SQL syntax used by the major vendors.
Normally, we are not particularly concerned about transaction control since we are focused on analyzing and cross-referencing existing databases to which we have read-only access and we are generating new data only in temporary tables or personal lookup tables that do not need to be shared. However, we will need to concern ourselves with COMMIT and ROLLBACK for exercises involving the use of data manipulation language terms such as UPDATE. If we run an UPDATE query that fails due to a SQL error, then any pending changes to the database tables that have occurred since the last COMMIT will be rolled back.
CREATE TABLE table1 ... ;
CREATE VIEW view1 ... ; CREATE INDEX index1 ON table1 (col1, ... );
DROP TABLE table1; DROP VIEW view1; DROP INDEX index1;
ALTER TABLE table1; ALTER VIEW view1; ALTER INDEX index1;
GRANT SELECT ON table1 TO user1; GRANT SELECT, INSERT, UPDATE (col1) ON table1 TO user1 WITH GRANT OPTION;
REVOKE ALL ON table1 FROM user1;
Examining Views and other system tables: Suppose we have taken advantage of Oracle to save complex queries as views that can be imported into ArcMap. How can we find the definition of a view?
- Info about VIEWS is in the system table ALL_VIEWS
describe all_views set linesize 100 column text format A40 select substr(view_name,1,12) as name, text_length, text from ALL_VIEWS where owner = 'JF';
- Set environment to read long text strings and list MADENSITY view:
set long 685 set linesize 180 select view_name, text from all_views where owner = 'JF' and view_name = 'MADENSITY';- This query computes the total area of each Mass town and then calculates population density
select town_id, town, count(*) as parts, avg(pop90) as pop90, sum(area) as area, 10000*(avg(pop90)/sum(area)) as density from matown2000 group by town_id, town order by town;- Now allow anyone to select from this view as v_madensity (instead of jf.madensity)
grant select on madensity to public; create public synonym v_madensity for madensityComputing the percentage of East Boston parcels that are triple-deckers (LU=R3)
- Numerator: create view v_num as (select count(LU) as R3_count from bos05eb where LU = 'R3');
- Denominator: create view v_denom as (select count(LU) as all_lu from bos05eb);
- Ratio: select 100 * r3_count / all_lu from v_num, v_denom;
Compute same percentage in a single SQL query:
select 100 * n.r3_count / d.all_lu from (select count(LU) R3_count from bos05eb where LU = 'R3') n, (select count(LU) all_lu from bos05eb) d;
Written February, 2001-2006 (Tom Grayson, Joe Ferreira), Last Modified 26 February 2009 (jf)