11.521/11.523 SQL Exam, Spring 2011

Massachusetts Institute of Technology - Department of Urban Studies and Planning

11.521	Spatial Database Management and Advanced Geographic Information Systems
11.523	Fundamentals of Spatial Database Management

In-Lab Exam - 15 March 2016 ...with Solutions

Instructions

This is an open-book, open-note examination. You are free to use the web (especially the class web pages and Oracle documentation) to help you. Like any examination, however, we expect your submission to reflect your work exclusively. Hence any dialogue via any means (e.g., voice, paper, e-mail, IM, texting) with anyone other than the class instructors is prohibited.

The exam starts at 4:05 PM in the Room W31-301 Computing Lab. It is intended to be finished by 5:30 PM but you can continue until 7:00 PM so that there is less time pressure. All the tables needed for the exam questions are accessible in Oracle via SQL*Plus using your personal Oracle account that we have been using for lab exercises and homework. These tables are also available in a personal geodatabase that can be opened using MS-Access (as well as ArcMap).

The exam is worth a total of 100 points.

An effective way for you to prepare your exam answers is to use the same familiar methods you have used for the lab assignments and homework. Create an ASCII text or HTML or Word file in your Athena locker to contain your answers. Name your exam username_exam.txt if you use text format or username_exam.html if you use HTML format. (An MS-Word formatted document is also okay.) For your own security, we suggest that you save it in the "Private" subdirectory of your locker (i.e., the user smith might create the text file on her I:\ drive from a WinAthena PC and name it I:\Private\smith_exam.txt). We strongly recommend that you retain the file containing your answers until we return the graded exams to you.

Remember to turn in your exam (in plain ASCII text or MS-Word or HTML format or PDF format) before leaving the room.

Make sure to include your name and Athena user ID near the beginning of your file (failure to include this will cost you 2 points)!

Please confirm with an instructor that your exam has been received.

To turn in your exam, upload it to Stellar: https://stellar.mit.edu/S/course/11/sp16/11.521/homework

Finally, don't spend all your time on one or two questions. Start by looking over all the questions to get a general sense of what, and how much, is being asked, and the grade points associated with each question. Then start work on your answers but move on to the next question if you've spent more than 10 minutes on any one of them.

Good luck!

Data Preparation

Before beginning your work, copy the 'exam16_data' folder from the 'data' folder in the class locker into a local drive such as C:\TEMP. All the data that you will need for the exam can be found either in your local copy of this folder or in the Oracle database. Unless specified otherwise, all tables referenced in this document are the names of tables in the Oracle database CRL that we have used for the class. All the queries required for the test can be done either in Oracle (via SQL*Plus) or in MS-Access (in 11.521_exam16_data..mdb). The 'exam16_data' folder contains the following:

Type of Data	File or Table Name	Description
Shapefile	MA_TOWNS00.SHP	Shapefile of Mass municipalities (which we call Towns). This shapefile is a copy of the same one that we have used in class exercises.
Shapefile	TAZ2727.SHP	Shapefile of 2,727 Traffic Analysis Zones (TAZ) in the Eastern Mass region included in the MetroFuture regional plan of Boston's Metropolitan Area Planning Council (MAPC).
Oracle Table	TAZ2727SHP	Attribute table for traffic analysis zone shapefile of TAZ in MetroFuture area.
Oracle Table	TAZ2727_MFDATA	Current population and housing unit estimates and projections to 2030 from alternative MetroFuture scenarios for 2,727 TAZ .
Oracle Table	COMTYPE	Community type names and codes used by MAPC to categorize each community included in the MetroFuture scenario modeling.
Oracle Table	BOS05EB	The same East Boston table of 2005 assessing data that we have used for lab exercises.
Oracle Table	STCLASS	The same lookup table of State land use codes STCLASS that we have used in class exercises. In the MS-Access database the table is called MASS_LANDUSE and has an additional column, STCLASS_TXT, with the state class code expressed as a text string instead of as an integer.
Spreadsheet	11.521_exam16_dictionary.xlsx	Spreadsheet containing data dictionary and code sheets for the above tables.
MS-Access database	11.521_exam16_data..mdb	MS-Access database containing the five Oracle tables listed above, TAZ2727, TAZ2727_MFDATA, COMTYPE, BOS05EB, STCLASS.

For your information, the Oracle tables used in this exam are owned by 'jf' (as in, JF.COMTYPE). These SQL commands are used to create a public synonym (so you do not have to prefix the table name with 'jf.') and then grant read-only (select) permission on the table to the public. For example:

create public synonym comtype for comtype;
grant select on comtype to public;

Part I: Short Answer and 'Fix SQL' Questions (26 points)

Question I-1. (6 points each, 12 points total)

Question I-1a: The following query is intended to list the Massachusetts municipalities along with their 1980-1990 change in population for those municipalities that have more than ten polygons associated with them in the shapefile. However, the query does not work properly. Briefly explain why and show the corrected SQL query. (Note: the Oracle table jf.matown2000 is the attribute table associated with the ma_towns00.shp shapefile that we have used in lab exercises.)
SELECT town, count(town) tcount, (pop90 - pop80) pop_change
  FROM jf.matown2000
 WHERE count(town) > 10 
 GROUP BY town
 ORDER BY tcount desc;
There are two problems with the query: (a) the population change computation needs to use aggregation functions since it is not part of the 'group by', and (b) the count(town)>10 condition should be in a 'having clause' rather than the 'where' clause. A correct query is:
SELECT town, count(town) tcount, (min(pop90) - min(pop80)) pop_change
  FROM jf.matown2000
 GROUP BY town
HAVING count(town) > 10
 ORDER BY tcount desc;

TOWN                      TCOUNT POP_CHANGE
--------------------- ---------- ----------
BOSTON                        19       3564
WESTPORT                      17         -9
BARNSTABLE                    15       5869
GLOUCESTER                    15        624
GOSNOLD                       15         40
IPSWICH                       12        307
SCITUATE                      12       -487

7 rows selected. 
Question I-1b: Consider the following query:
select count(*), count(pop80), count(town), count(distinct town)
  from jf.matown2000;
The query returns the counts: 632, 632, 611, and 351 respectively. Explain briefly why all four numbers are not the same and what each number means.

The first two counts of 632 represent respectively the number of rows in the table and the number of rows with a non-null value in the 'pop80' column. The 611 count is the number of rows with a non-null value in the 'Town' column. The 351 count is the number of unique Town names in the 'Town' column.

Note, this result is common whereby numeric fields might have a 'zero' in cases that really should be null but strings, like Town name, will be null for those cases (polygons in the MassGIS shapefile in this case) that are not associated with a particular Town.

Question I-2. (8 points)

Question I-2: Explain briefly what is meant by the phrase 'many-to-many relationship.' How are many-to-many relationships typically handled when designing relational databases? Illustrate your points using one of the database schemas that we have considered in class lectures and exercises. (A verbal explanation is sufficient. No need for an ER diagram or table definitions.)

Many-to-many relationships exist when trying to relate two sets of data where particular records in one set can be associated with more than one row in the other set and vice versa. For instance, in the Websis example, students can take more than one class and any class can have more than one student. Many-to-many relationships are handled by creating two one-to-many relationships through the use of a matching table. For example, a student-class matching table is used with one row for every student-class pairing that occurs. The matching table pairs (as local foreign keys) the primary keys from each of the tables you wish to relate.

Question I-3. (6 points total)

Consider the seven illustrative PARCELS tables that we have used in lecture and lab exercises.

Question I-3: What would you suggest as the primary key for the SALES table? Explain briefly. What foreign key in the SALES table would you use when joining that table with the PARCELS table?

For the SALES table, the primary key should be the combination of the parcel identifier (parcelid) and sale date (sdate) columns. The parcelid field in the sales table is the foreign key that matches the primary key parcelid in the parcels table. Note: In practice, a property can - and sometimes does - sell more than once within a single day. So we may want the sale date to be a 'date' field that measures milleseconds past midnight as well as the Julian date. Our toy sales table only includes the day.

Part II: GIS and Thematic Mapping of SQL results (58 points)

In Part II, we will examine population and housing estimates for the metro Boston area that were generated by Boston's Metropolitan Area Planning Council (MAPC) as part of the recent MetroFuture planning process. The estimates were made for several development scenarios including a business-as-usual scenario called "Let It Be" (LIB) and an alternative that became the MetroFuture plan (abbreviated MF). The relevant data are available in the exam data directory (exam16_data) and contain the shapefiles and data tables listed earlier in this exam. Note that the shapefile, TAZ2727, showing the TAZ boundaries does not always line up precisely with Town boundaries in MA_TOWNS00. For questions involving which Town contains which TAZ, rely on the tabular attribute information in TAZ2727. All the data tables are stored in the MS-Access database, 11.521_exam16_data.mdb, and in Oracle tables that you can access from your individual Oracle accounts.

Question II-1a. (6 points total)

How many Mass Towns are included among the 2727 traffic analysis zones? ___164_____ Show your SQL query to count the number of unique Towns.

select count(distinct town_name) 
  from taz2727shp
 order by town_name;


COUNT(DISTINCTTOWN_NAME)
------------------------
                     164


Since MS-Access does not support 'distinct' within a count() function, we can also do this:


select count(*)
  from (select distinct town_name from taz2727shp);


  COUNT(*)
----------
       164  


We could also apply the 'summarize' command within ArcMap to the 'Town_Name' column of taz2727bos shapefile.

Question II-1b. (6 points total)

Which Town contains the second largest number of traffic analysis zones? ____Cambridge_______ How many TAZ does that Town contain? ___104________ Show your SQL query.

select town_name, count(*) as taz_count
  from taz2727shp
 group by town_name
having count(*) >50
 order by count(*) desc;

TOWN_NAME             TAZ_COUNT
-------------------- ----------
Boston                      447
Cambridge                   104
Newton                       53
Quincy                       52
Brockton                     52

Note that Boston has four times as many as Cambridge, which has twice as many as the next town, Newton!

Question II-1c. (6 points total)

Which TAZ is expected to see the largest increase in housing units between 2000 and 2030 according to the LIB scenario? ____625___
How many housing units are projected to be added to this TAZ? ____3458______

For this LIB scenario, which Town is this TAZ within? ___Cambridge!!_____

Show your SQL queries.
select taz, lib_hu2030-hu2000 as lib_delta_hu
  from taz2727_mfdata
 where mf_hu2030-hu2000 > 2000
 order by mf_hu2030-hu2000 desc;

       TAZ LIB_DELTA_HU
---------- ------------
       625      3458.03
       594      1448.53
      1847      2372.76
       757        129.5
       145       666.25
      1864      2088.52
	  
select t.taz, town_name
  from taz2727_mfdata m, taz2727shp t
 where m.taz = t.taz
   and lib_hu2030-hu2000 in 
      (select max(lib_hu2030-hu2000) from taz2727_mfdata);

       TAZ TOWN_NAME
---------- --------------------
       625 Cambridge 
	   
Note that it was also okay just to look up 625 in the TAZ table to see that it was Cambridge:

select * from taz2727shp where taz = 625;
       TAZ   SQMTRTOT TOWN_NAME            TOWN_CAPS
---------- ---------- -------------------- ----------
       625   425135.8 Cambridge            CAMBRIDGE
	   
Note also that this same TAZ gets the most new housing units under the MF scenario:

select taz, mf_hu2030-hu2000 as mf_delta_hu
  from taz2727_mfdata
 where mf_hu2030-hu2000 > 2000
 order by mf_hu2030-hu2000 desc;

       TAZ MF_DELTA_HU
---------- -----------
       625     3902.71
       594     2187.92
      1847     2182.73
       757     2165.84
       145     2117.69
      1864        2020
      
      
This development occurs near Lechmere Square and the Science Museum at the edge of Cambridge, Somerville, and Boston and along the transportation corridor heading north out of North Station and including the planned Green Line extension into Somerville and Medford. These area is slated for significant mixed use development during this decade.

Question II-1d. (6 points total)

Now, let's examine the TAZ-level differences between scenarios in the estimated housing unit growth. Write a query that computes the difference in expected housing unit growth between the MF and LIB scenarios (from 2000 to 2030) for each TAZ. Call this difference MF_less_LIB. Show your SQL queries.

Which TAZ has the largest value of MF_less_LIB? ___757_____ For that TAZ, what is MF_less_LIB? __2036_________
select taz, (mf_hu2030 - lib_hu2030) as MF_less_LIB
  from taz2727_mfdata
 where (mf_hu2030 - lib_hu2030) > 500
 order by (mf_hu2030 - lib_hu2030) desc;
       TAZ MF_LESS_LIB
---------- -----------
       757     2036.34
       136     1794.32
        20     1509.69
        19     1509.69
       145     1451.44
       172      1257.7
...
Which TAZ has the smallest (most negative) value of MF_less_LIB? ___2715________ For that TAZ, what is MF_less_LIB? __-1054_________
select taz, (mf_hu2030 - lib_hu2030) as MF_less_LIB
  from taz2727_mfdata
 where (mf_hu2030 - lib_hu2030) < -500
 order by (mf_hu2030 - lib_hu2030) asc;
       TAZ MF_LESS_LIB
---------- -----------
      2715    -1053.87
      1223     -872.14
      2233     -779.74
      1596     -737.71
      1256     -593.77
      2149     -531.17
      1592     -530.74

Question II-1e. (6 points total)

Next, let's aggregate these differences across TAZ that are considered suitable for development. The MetroFuture scenario (MF) targeted housing additions toward those communities that were considered to have areas suitable for development (in terms of infrastructure, access, and redevelopment opportunities). The attribute CODA in TAZ2727_MFDATA indicates which TAZ are considered to be suitable areas (CODA=1) for further development. Write a SQL query that reports the net difference in projected housing unit growth (MF_less_LIB) for each CODA group. Show your SQL query.
One way to write this query is to insert your previous answer as a sub-query in the 'from' clause:
select coda, count(d.taz) tcount, sum(MF_less_LIB) as delta
  from taz2727_mfdata t,
      (select taz, (mf_hu2030 - lib_hu2030) as MF_less_LIB
         from taz2727_mfdata) d 
 where t.taz=d.taz
 group by coda
 order by coda;
 
       CODA     TCOUNT      DELTA
---------- ---------- ----------
         0        909  -64248.07
         1       1818  105980.32
Alternatively, and even simpler in this case, you can write the query this way:
select coda, count(taz), 
       sum(mf_hu2030 - lib_hu2030) as MF_less_LIB
  from taz2727_mfdata
 group by coda order by coda;
 
      CODA COUNT(TAZ) MF_LESS_LIB
---------- ---------- -----------
         0        909   -64248.07
         1       1818   105980.32
We see that the MetroFuture scenario allocated 64 thousand fewer housing units to the 909 TAZ that had CODA=0 TAZ, and 106 thousand more housing units to the 1818 TAZ in CODA=1 areas.

Question II-1f. (6 points total)

Now, let's aggregate these differences by Town. Write a query that computes the net difference in projected housing unit growth by Town from 2000-2030 between the two scenarios. Call this difference MF_less_LIB_byTown. Show your SQL queries..

Which Town has the largest value of MF_less_LIB_byTown? ___Boston____ For that Town, what is MF_less_LIB_byTown? ___31,588________

Which Town has the smallest (most negative) value of MF_less_LIB_byTown? __Plymouth_________ For that TAZ, what is MF_less_LIB_byTown? ___-1558________
select t.town_name, sum(mf_hu2030 - lib_hu2030) as MF_less_LIB_byTown
  from taz2727_mfdata d, taz2727shp t
 where d.taz = t.taz
 group by t.town_name
 order by sum(mf_hu2030 - lib_hu2030) desc;



TOWN_NAME            MF_LESS_LIB_BYTOWN
-------------------- ------------------
Boston                         31588.05
Somerville                      3652.88
Lynn                            3061.97
Cambridge                       2778.33
Taunton                         2498.16
...
Littleton                       -923.93
Dracut                          -926.38
Upton                           -967.24
Wrentham                        -968.93
Ashland                         -996.04
Plymouth                       -1558.37

Question II-1g. (8 points total)

Redo the previous question, but this time compute MF_less_LIB_byTown as a percentage of the Town's 2000 housing unit count (HU2000). Show your SQL queries and call this percentage difference MF_diff_pct_byTown

Which Town has the largest value of MF_diff_pct_byTown? ___Attleboro________ For that Town, what is MF_diff_pct_byTown? __14.5%_________

Which Town has the smallest (most negative) value of MF_diff_pct_byTown? __Upton_________ For that TAZ, what is MF_diff_pct_byTown? ____-47.4%_______
select t.town_name, 100*(sum(mf_hu2030 - lib_hu2030))/sum(HU2000) as pct_MF_less_LIB_byTown
  from taz2727_mfdata d, taz2727shp t
 where d.taz = t.taz and HU2000 > 0
 group by t.town_name
 order by sum(mf_hu2030 - lib_hu2030)/sum(hu2000) desc;

TOWN_NAME            PCT_MF_LESS_LIB_BYTOWN
-------------------- ----------------------
Attleboro                         14.438015
Chelsea                          12.0911003
Taunton                          11.2149694
Boston                           10.5025278
Lynn                             9.13720868
...
Wrentham                         -28.481188
Middleton                        -28.645553
Littleton                        -31.213851
Mendon                           -34.047934
Bolton                           -42.938904
Upton                            -47.367287
The above query is almost right. We need the "HU2000 > 0" criterion in order to avoid dividing by zero. While we know that each town had some population in 2000, it is possible for HU2000 to be zero in one or another TAZ but then to have growth during 2000-2030 in that TAZ. In that case some of the growth might not be counted in the numerator total for that town. To be sure to avoid such a situation, you can use the following syntax which also has the advantage that it works in MS-Access (where a ratio of sums is not allowed). For this query, we determine the net difference by town before computing the percentage so we can then compute a housing total for all towns that had some housing in 2000. In this case, we also converted the Town names to upper case to facilitate matching with the town names in ma_town00.shp. The upper case operator in Oracle is Upper() and in MS-Access is UCase():
select upper(n.town_name) as Town, 100*(MF_less_LIB_byTown / hu2000_bytown) 
       as pct_mf_less_lib_bytown
  from (select town_name, sum(mf_hu2030 - lib_hu2030) as mf_less_lib_bytown
	      from taz2727_mfdata m, taz2727shp t
		 where t.taz = m.taz
         group by town_name) n,
       (select town_name, sum(hu2000) as hu2000_bytown
	      from taz2727_mfdata m, taz2727shp t
		 where t.taz = m.taz
 		 group by town_name) d
 where n.town_name = d.town_name 
   and HU2000_bytown > 0
 order by 2 desc;

TOWN_NAME            PCT_MF_LESS_LIB_BYTOWN
-------------------- ----------------------
Attleboro                         14.438015
Boston                           13.1872378
Chelsea                          12.4579408
Somerville                        11.576232
Taunton                          11.3320934
Lynn                             9.13720868
...
Wrentham                         -28.481188
Middleton                        -28.645553
Littleton                        -31.213851
Mendon                           -34.047934
Bolton                           -42.938904
Upton                            -47.367287
The sub-queries can be avoided by creating views for the numerator and denominator construction and then writing a third query that divides them. Note that there are a few changes in the results. If you look at the TAZ-level detail, you will find several dozen TAZ with no housing units in 2000 and non-zero growth projected. There are many more TAZ with no housing units whatsoever. These might be industrial areas or transportation hubs like the airport where no housing units exist but they are meaningful traffic analysis zones since they are travel destinations for employment, mode transfer, etc. Note also, that some TAZ have a population but no housing units. These might be dormatories, prisons, hospitals, or other such institutions.

Question II-1h. (14 points total)

Develop a thematic map of metro Boston Towns that is shaded to indicate the percentage difference, MF_diff_pct_byTown, computed in the previous question. Briefly discuss any pattern that you see and explain briefly your choice of classification method and color scheme to visualize the differences appropriately. (Note, if you had trouble with Question II-1f, then plot MF_less_LIB_byTown (or even HU2000 by town) so you can get partial credit for showing you can pull data into ArcMap for thematic mapping. Also, the town name is listed in upper case in some of the tables but with only leading capital letters in others. The TAZ2727SHP tables in Oracle and MS_Access includes both forms of the town name in order to facilitate table joins.)

On top of the Town-level map, overlay a TAZ level map that highlights only those two TAZ from Question II-1d that had the highest and lowest differences in added housing units, MF_less_LIB. Submit a PDF-formatted map with sufficient readability to see the highlighted TAZ and interpret whatever pattern you see in the Town-level map.

The map below is a good, readable map from submitted student work. Ten quantile categories are used so that about the same number of towns fall in each category and a color ramp is used that highlights high and low deviations from the norm. The most negative areas (new-MF << new-LIB) are primarily the outer suburbs near the second ring road (495) where the let-it-be scenario forecasts much more single-family, large lot housing development. The most positive areas (new-MF >> new-Lib) are the inner city towns and those in sub-centers and along the major transportation corridors. The highlighted TAZ are the one in Lynn (TAZ=757) with the largest value of MF_diff_pct_byTown and the TAZ in Plymouth (TAZ=1592) that had the lowest value.

Part III: SQL Queries (16 points)

This question uses the East Boston assessing data in BOS05EB that we used in labs #2 and #5. This table is available in Oracle as BOS05EB along with the lookup table for the state land use codes, STCLASS. The variable PTYPE in BOS05EB indicates the state land use code and matches the STATECLASS column in STCLASS. The column OWN_OCC in BOS05EB indicates 'Y' if the assessing record is for an owner-occupied unit and 'N" if it is not. We are interested in computing the percentage of assessing records that are for owner-occupied units and we would like to break this percentage down by PTYPE.

Question III-1a. (8 points total)

Write a SQL query that counts, for each value of PTYPE, the number of non-condo assessing records in BOS05EB that have OWN_OCC equal to 'Y'. Restrict your results to show only those PTYPE values that have at least 20 owner-occupied units associated with them. Show your SQL query.
select ptype, own_occ, count(*) ptotal
  from bos05eb
 where ptype is not null and own_occ ='Y'
   and cm_id is null
 group by ptype, own_occ
 having count(*) >= 20
  order by ptype;
  
PTY O     PTOTAL
--- - ----------
013 Y         34
101 Y        783
104 Y        908
105 Y       1274
111 Y         65

Question III-1b. (8 points total)

Write one or more queries that counts the number of non-condo, OWN_OCC='Y' and OWN_OCC='N' assessing records for each PTYPE and computes the percentage of cases that are owner-occupied (OWN_OCC='Y') for each PTYPE. Restrict the results to show only those PTYPE values with at least 40 cases. Show your SQL query (or queries).

select y.ptype, (ycount+ncount) as ptotal, 
       ycount, ncount, 
       100*ycount/(ycount+ncount) as pct_ownocc
  from (select ptype, count(*) as ycount
          from bos05eb
         where own_occ = 'Y'and cm_id is null
         group by ptype) y,
       (select ptype, count(*) as ncount
          from bos05eb
         where own_occ = 'N'and cm_id is null
         group by ptype) n
 where y.ptype = n.ptype
   and ycount +ncount >= 40
 order by y.ptype;


PTY     PTOTAL     YCOUNT     NCOUNT PCT_OWNOCC
--- ---------- ---------- ---------- ----------
013        196         34        162 17.3469388
031         64         11         53    17.1875
101        982        783        199 79.7352342
104       1246        908        338 72.8731942
105       2199       1274        925 57.9354252
111        230         65        165 28.2608696
112         48          8         40 16.6666667

This same result can be reached without sub-queries by creating one view 
for 'Y' and another for 'N' and then combining them into a thrid query:

create view v_ylist as
select ptype, count(*) as ycount
  from bos05eb
 where own_occ = 'Y'and cm_id is null
 group by ptype;

create view v_nlist as
select ptype, count(*) as ncount
  from bos05eb
 where own_occ = 'N'and cm_id is null
 group by ptype;

select y.ptype, (ycount+ncount) as ptotal, 
       ycount, ncount, 
       100*ycount/(ycount+ncount) as pct_ownocc
  from v_ylist y, v_nlist n
 where y.ptype = n.ptype
   and ycount +ncount >= 40
 order by y.ptype;

Exam Turnin Checklist

Is your name and Athena ID at the top of your exam? If not, you'll lose 2 points!

Is your completed test saved in a plain text (ASCII) file named username_exam.txt or HTML (web page) file named username_exam.html, or MS-Word file name username_exam.doc?

Did you upload your answers to Stellar: https://stellar.mit.edu/S/course/11/sp16/11.521/homework/

Did you confirm with an instructor that your exam was received?