MIT MiniClass on Urban Analytics

MIT Department of Urban Studies and Planning

Lab Exercise #2

for Fall 2012 Mini-Classes on
Urban Analytics & Spatial Data Management

This Lab Exercise builds on the results of Lab Exercise #1 and emphasizes the use of ArcGIS to map model results and explore spatial patterns including spatial autocorrelatoin issues. You do not need to finish all the R-project exercises from Lab #1 before doing this Lab. We have saved the necessary results from Lab #1 in the class locker. But we will want to make use of R to explore some of the data and results. There are three purposes to this Lab Exercise:

Setup your computer to facilitate data exchange among GIS / Database / Statistics tools

Just as for Lab #1, we will need to install our own version of 32-bit R. We need a specific workaround for permission and version skew issues on lab machines to exchange data among ArcGIS, R, and MS-Access The necessary steps from Lab #1 are repeated here. Note that we have already downloaded and saved in the class locker the 47 MB zip file (/R-2.15.1-win.exe) for installing R.
Planners often need to acquire and integrate data from different sources using different computing systems and software tools, so there is a learning opportunity to seeing what it takes to coordinate software on machines where you do not have administrative privileges.

Examine the spatial pattern of VMT, factors, and model results in Prof. Mi Diao's PhD dissertation on VMT and the built environment
- Gain experience with spatially detailed 'big data' and some of the tools needed to extract components, georeference them, and develop analytic models
- Improve spatial analysis skills by learning to take advantage of the particular strengths of GIS, relational database management, and statistical tools
Experiment with GeoStatistics tools and visualization techniques that help in understanding our models of the VMT / built environment relationship
- Visualize and explore 'big data' relationships and what they do (and do not) say about the travel implications of urban form

Gain experience with sensitivity analysis and model exploration for spatially varying phenomena

During today's lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Thursday session (on Nov. 8), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few powerpoint or PDF slides and plan a 5 minute talk explaining their 'Part 3' exploration. This presentation should be a 'work in progress' that explains ideas or hypotheses of interest, methods and results to date, and any questions about how to interpret the results and continue the exploration. We will have a 'homework' location on the miniclass Stellar site to upload your slides in advance of the presentation.

Part 1: Setting up R on your computer:

The R package installed on lab machines in 37-312 and CRON runs the '64-bit' version of R. That is good and forward thinking since this version can handle datasets larger than a few gigabytes. However, this version does not work with the 'open database connect' (ODBC) drivers that are currently installed on the lab machines to facilitate data exchange between R and MS-Access or PostgreSQL. We can use plain text files to exchange data between R and MS-Access, but that can be less flexible and involve other complications. Since planners often need to acquire and integrate data across platforms and packages, it will be instructive to see some of the setup issues involved in downloading and installing 'open source' software on machines - including machines where you do not have administrative privileges.

The lab machines run the 64-bit version of Windows (so that the operating system can address arrays and memory spaces bigger than 2^32 = 4 GB). However, the MS-Access software on the machines is a 32-bit application. It is possible to run 64-bit R and exchange data with 32-bit MS-Access but not with the ODBC drivers that are currently setup on the lab machines. The easiest workaround is to download and install a 32-bit version of R. In order to do this, we cannot use the default installation settings since we do not have administrative privileges on the machines and cannot write to the folders where software is normally installed.

(1) Create your own workspace

In order to install R, we need to specify a few folders to which you have read/write access. Use Windows Explorer to create: c:\temp\rinstall and c:\temp\rlib32. The first sub-directory is for the software installation and the second is for additional software libraries that you will need. Alternatively, you can create the two folders by using Start/Run and then entering this command to create the directory:

mkdir c:\temp\rinstall
mkdir c:\temp\rlib32

(2) Download the latest version of R for Windows-7 machines from: http://www.r-project.org/ For you convenience, we have saved the current Windows 32/64-bit version (R-2.15.1-win.exe) in the class locker. You can find it here: http://mit.edu/11.523/R-project/R-2.15.1-win.exe

Choose any of the CRAN mirror sites in the US to download the software. For example,

http://streaming.stat.iastate.edu/CRAN/ at Iowa State University

Click 'Download R for Windows', choose the 'base' option, and the click on 'download R 2.15.1 for Windows' to get the 47 MB executable that contains the binaries for both the 32 and 64-bit versions of R.

(3) Install 32-bit R by running the file you just downloaded (R-2.15.1-win.exe) using the defaults except for these cases:

Change the installation directory to be: C:\temp\rinstall

Install only the 'core' and the '32-bit' version. (The 64-bit version is already installed on the machine and you do not need another copy.)

Choose 'yes' to do a 'customized startup' rather than the default installation option

Check "don't create a start menu folder" since you 32-bit R installation will only work temporarily on this one machine

Uncheck the option to save the version number in the Registry (since you do not have permission to do this)

(4) Run 32-bit R and load some additional software packages into your personal library

Locate and run R by either finding the executable in Windows Explorer or entering this command in the Start/Run Window:

C:\temp\rinstall\bin\i386\Rgui.exe

A large Window labeled 'RGui (32-bit)' will open with a smaller 'R-Console' window where you can enter commands following the '>' prompt.

(5) Run the following commands to install several optional library packages: (a) 'RODBC' to provide ODBC connectivity with database managers, (b) 'hexbin' to support improved visualization of scattergrams, and (c) 'psych' and 'GPArotation' to support additional principal component analysis. You could do this from the m Packages tab, but we need to specify a particular personal library so it is better to use the command line approach.

install.packages("RODBC", lib="C:/temp/rlib32")

install.packages("hexbin", lib="C:/temp/rlib32")

install.packages("psych", lib="C:/temp/rlib32")

install.packages("GPArotation", lib="C:/temp/rlib32")

When you enter the first command, you will see a popup window asking you to specify which CRAN mirror site to use to get the package. Choose a US site such as USA(IN) for Indiana University. Each of these commands will locate and copy a zipped file of binaries and from the CRAN mirror site into a local workspace - the default workspace on the lab machines is on your 'H' drive in:

C:\Users\jf\AppData\Local\Temp\RtmpYH6bl2\downloaded_packages

You do not need to care where the binaries are installed, except to realize that it is on your networked H drive which could slow the download. If you have trouble downloading from a CRAN mirror site, use the Packages/set-CRAN-mirror... tab to choose another site. Once all the packages are downloaded to your machine they need to be unzipped and installed as a usable library. We want these libraries to be local and in one of the writeable directories that we created on C:\. Here are the appropriate commands:

library("RODBC",lib.loc="C:\\TEMP\\Rlib32")
library("hexbin", lib="C:/temp/rlib32")
library("psych",lib.loc="C:\\TEMP\\Rlib32")
library("GPArotation",lib.loc="C:\\TEMP\\Rlib32")

Note that some commands use forward slash '/' and others use double backward slashes '\\'. Either option gets converted to the same thing internally in R - namely, into the single forward slash '/' convention for unix paths. The double '\\' is needed since unix uses the single '\' as a escape character.

Part 2: Visualizing Factors and Residuals and Exporting Data to MS-Access and ArcMap

Here is a summary of the R commands (after the initial installation of your local copy of 32-bit R) that we used in Lab Exercise #1 to read the BE, demographic, community type, and VMT data from our MS-Access database, do the principal component analysis, and ordinary least squares modeling:
Step R command
Load custom libraries
library("RODBC",lib.loc="C:\\TEMP\\Rlib32")
library("hexbin", lib="C:/temp/rlib32")
library("psych",lib.loc="C:\\TEMP\\Rlib32")
library("GPArotation",lib.loc="C:\\TEMP\\Rlib32")
Open channel to MS-Access database
UAchannel <- odbcConnect("MS Access Database",uid="")
Read the 4 tables from MS-Access into R
demdata <- sqlFetch(UAchannel, "demographic_bg")
bedata <- sqlFetch(UAchannel, "built_environment_250m")
vmtdata <- sqlFetch(UAchannel, "vmt_250m")
ctdata <- sqlFetch(UAchannel, "ct_grid")
Remove unwanted columns
outlist <- c("BG_ID","CT_ID","F1_wealth","F2_children","F3_working")
demdata2 <- demdata[,setdiff(names(demdata),outlist)]
demdata3 <- demdata2[demdata2[,'INC_MED_HS']>0,]
Remove rows if no income
demdata3 <- demdata2[demdata2[,'INC_MED_HS']>0,] 
Remove incomplete rows
demdata4 <- demdata2[complete.cases(demdata2),]
Remove another column
outlist <- c("pct_ov2car")
demdata5 <-demdata4[,setdiff(names(demdata4),outlist)]
Find principal components
prin5 <- principal(demdata5, nfactors=3, rotate="varimax",scores=TRUE);
Get 4-way joined table from MS-Access
vmt_join <- sqlFetch(UAchannel, "t_vmt_join4")
Remove incomplete rows
vmtall <- vmt_join[complete.cases(vmt_join),]     
Regress 9-cell VMT average on the 8 factors
run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv) ~ f1_nw + f2_connect
                        + f3_transit + f4_auto + f5_walk +f1_wealth
                        + f2_kids + f3_work, data=vmtall)
(1) Rerun these commands to be sure to start off with the same results as those we had at the end of the Tuesday session.

(2) Visualize the principal components that we identified (they are similar to the 3 demographic factors that Prof. Diao constructed). The histograms show the distribution of each component across all block groups. The scatterplots use pairs of the 3 components along the X-Y axes and superimpose the axes showing high/low directions for each of the original variables.

biplot(prin5)

(3) Let's examine the residuals from our ordinary least squares (OLS) model. These residuals are the actual values minus the values predicted by the regression model. We can use the residual function to extract the residuals from our regression model results and plot them against the relevant VMT estimate (which is ]vmt_tot_mv / vin_mv] for our model). The scatterplot works but has so many points that it is hard to interpret. To get a better visualization, use the 'hexbin' package.

res1 <- resid(run_ols1)
plot(vmtall$vmt_tot_mv / vmtall$vin_mv,res1,ylab="Residuals",xlab="VMT",main="Errors in VMT prediction")
bin <- hexbin(vmtall$vmt_tot_mv / vmtall$vin_mv, res1)
plot(bin,ylab="Residuals",xlab="VMT",main="Errors in VMT prediction")

The scatterplots suggest that residuals and VMT are correlated - they are high or low together. This suggests that the adjustments to annual mileage that are associated with out factors may be better modeled as percentage changes rather than absolute value differences. With this in mind, we might consider fitting a model of log(VMT).

We might also want to use additional R tools to plot fitted values and residuals in order to check some of the model assumptions. The left quantile-quantile plot of fits and residuals assumes a uniform distribution and the right pair assumes a normal distribution. If the data follow the assumed distribution, then the plots will be linear. The fitted values look close to normal but the residuals are both widely varying and less normal.

rfs(run_ols1)

rfs(run_ols1,distribution=qnorm)

(4) Let's see if the residuals exhibit a different pattern within each community type. The community type is specified in one of the columns of 'vmtall'. We can stratify the block groups by community type and then examine the residuals for each group. To get the residuals into the same array as our original data, let's first add the residuals as an additional column to 'vmtall' - the large matrix used for our regression model. While we are at it, we use the fitted() function to pull the fitted values into a vector as well. We use the 'cbind' command (to combine columns) to merge the data, then we plot the residuals for each community type as a 'box and whiskers plot'

fit1 <- fitted(run_ols1)

vmtall2 <- cbind(vmtall,fit1,res1)

boxplot(res1 ~ ct, data=vmtall2,ylab="Residuals",main="VMT Residuals by Community Type")

The median and interquartile range are fairly similar but their might be some heteroscedasticity since the residuals seem to be much larger in the developing suburbs than in the inner core (the second column). Also, the inner core has a negative median residual suggesting that city cars tend to be driven even less than the model would suggest.

(5) Next, let's move some of the R results into MS-Access tables so we can get them into ArcGIS for spatial analysis and mapping. R provides functions that can save our data objects so we do not have to recreate them every time. We can save data in R's native format or move specific data arrays into external tables as plain text files or tables in a database manager such as MS-Access. The following commands save vmtall2 in R's format (indicated by the Rdata suffix), as a CSV text file, and as a table called 'vmtall2' in MS-Access. We also save in R format the three original tables that we imported from MS-Access.

save(vmtall2,file="c:/temp/data_ua/rdata/vmtall2.rdata")

save(bedata,file="c:/temp/data_ua/rdata/bedata.rdata")

save(demdata,file="c:/temp/data_ua/rdata/demdata.rdata")

save(vmtdata,file="c:/temp/data_ua/rdata/vmtdata.rdata")

write.csv(vmtall2,file="c:/temp/data_ua/rdata/vmtall2b.csv",row.names=FALSE) ## do not add row numbers to the output

sqlSave(UAchannel, vmtall2, rownames=FALSE)

Use Windows Explorer and MS-Access to check that these files and tables are written into the expected location.

There are many ways to save and retrieve data in R. Use help(save) to look into the variations. Sometimes, it is helpful to save your entire R workspace so that you can pick up where you left off, and the command save.image() bundles a particular flavor of save() options in order to do this. However, you may have trouble with save.image() on lab machines since you do not have write permission for certain default locations. Also, your workspace may includes external connections such as your channel to MS-Access. You are left on your own to review the help files and experiment with various 'save' options to find the variation of 'save' that works best for you.

(6) Examining the model results in ArcMap: If you have not already done so, find and double-click on the saved ArcMap document in c:\temp\data_ua\miniclass_lab2start.mxd to open the shapefiles for the MiniClass:

Click the 'add data' icon, navigate to your MS-Access database and add the 'vmtall2' table that we just create in R. Join this table to the 'all_stategrid' shapefile based on the grid cell ID, 'g250m_id'. Say 'yes' when asked about creating an index. Next, create a thematic map of the residuals from our OLS model (res1) and classify the data by 'standard deviation' so we can easily spot with high/med/low residuals. The red cells with high residuals are the places where the model underestimated the actual VMT. At this point, you will notice that the map takes a long time to redraw. This is because there are 356K grid cells in Massachusetts - but only 53250 have estimates (because they were in metro Boston and met the criteria regarding VINs, households, and estimated annual mileage). To speed up the processing, you may want to create a new shapefile, bos_vmt_grid.shp, with only those 53250 grid cells that have data. You will also want to set the grid cell outline color to 'none' so the boundaries do not obscure the shading. When you set the thematic shading in the classification window, click the 'sampling' button and change the maximum sample size to be 54000. Otherwise, the breakpoints for the shading will be determine only using the first 10000 grid cells. Finally, we have added two web mapping services (one from MassGIS and three from ESRI). You can turn these on to overlay images, road maps and shaded relief maps in order to help you interpret metro Boston areas at different scales. The result for the fitted values and the residuals should look something like this:

Take advantage of having the vmtall2 table joined to the all_stategrid to examine other thematic maps of factor values, fitted VMT values, and the like. From the plot of residuals, it does look like some spatial patterns have been missed. The residuals are way negative (i.e., red - implying that we overestimated mileage) for inner core grid cells and it looks like they are also low for some of the regional centers and transportation corridors. Alternatively, the residuals are relatively high (blue) for some of the suburban grid cells that are mixed in with the 'null' grid cells (where no one lives because it is a pond or open space). These impressions may help us think of additional factors or relationships that could capture some of the remaining spatial effects. For your convenience, the above ArcMap document has been saved as the miniclass_lab2_end.mxd in the 'data_ua' sub-directory of the class locker. The newly created bos_vmt_grid.shp shapefile has also been added to the 'shapefile' folder within the 'data_ua' sub-directory of the class locker so the ArcMap document can find this new shapefile. Be aware, however, that the 'vmtall2' table that we generated in R has not been added to the MS-Access database saved in the class locker. Finally, we also save the ArcMap document in version 10.0 format as, miniclass_lab2_end_v10.0.mxd in case some of you are still running ArcGIS ver. 10.0 on your personal machines.

(7) Open the ArcMap help files and search for spatial autocorrelation - that is, the extent to which a spatial pattern is non-random. You may want to look at the Introduction to the ArcGIS Geostatistical Analyst Tutorial. At least read about 'how spatial autocorrelation (Moran's I) works'. You may want to try some of the geostatistical tools to in the spatial statistics section to examine the nature and extent of spatial autocorrelation in the data.

(8) A popular free tool for analyzing spatial autocorrelation is Open GeoDa (developed by Lou Anselin, et al., now at Arizona State). Some of his tools are integrated into ArcGIS. We do not have GeoDa loaded on the lab machines, but you can download your own copy from here:

Home page: https://geodacenter.asu.edu/

Download site: https://geodacenter.asu.edu/software/downloads

You can use this tool to read shapefiles such as the bos_vmt_grid.shp that you just created and do spatial statistics. If you have time, load the package onto your personal machine and try it out on bos_vmt_grid.

Part 3: Experiment with alternative indicators and models

Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between ArcGIS, R and MS-Access that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)

Here are some suggestions for exploring the data further:

Use the Community Type category to subdivide the grid cells into those within the four community types. Refit the OLS model for each of the types of community. How different are the results? How might you dig into the data further to sort out what might be happening and what transform or redefinition of variables might improve the model fit?

We can use the regression model at the end of Part 2 to estimate VMT for each grid cell and compute the 'residuals' - that is the difference between the actual and predicted VMT values for each grid cell. Then we can import these residuals back into ArcGIS, join them to the grid cell layer and see whether the residuals seem to vary systematically across metro Boston. We can even do some 'Lisa' plots and compute spatial statistics such as Moran's I to identify hot spots and test how strong is the evidence of spatial autocorrelation. One of the R commands that you may find helpful here is the sqlSave() command to write tables back to MS-Access. use the online help function in R to learn about the sqlSave() command: help(sqlSave). Also, here is a webpage with a tutorial on plotting residuals from a regression model: http://www.r-tutor.com/elementary-statistics/simple-linear-regression/residual-plot

The pairwise scattergrams that we produced seemed to suggest skewness in some of the distributions and non-linearities in some of the relationships. Select a few variables where this appears problematic and experiment with some transformations and then refit the VMT model. Do the results suggest different effects? Do they suggest ways of identifying sub-markets within which the VMT / BE / Demographic relationships might be different.

From the data, it is pretty clear that annual mileage per vehicle and per household is relatively low in the central city and relatively high in the outer suburbs. The BE factors tend to measure local effects and they do seem to explain some of the within-town variation in VMT that is evident in the suburbs. What other measures might you consider to capture built environment effects that are at a different scale from those which Prof. Diao considered? Take a look at the often-cited paper by Cervero and Kockelman on "Travel Demand and the 3Ds: Density, Diversity, and Design" (Transportation Research, part D, Vol. 2, No. 3, pp. 199-219, 1997). A copy, Cervero_1997.pdf, has been placed in the 'readings' folder of the 11.523 class locker for your convenience.

For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next Tuesday session on Nov. 6 and the Thursday session on Nov. 8. Prepare a view powerpoint or PDF slides and be prepared to spend 5-10 minutes explaining and discussing your preliminary results. Those taking both the Tuesday 'urban analytics' track and the Thursday 'spatial data management and visualization' track should plan short presentations for each day with the Tuesday track focusing on statistics and modeling (with R) and the Thursday track focusing on geostats and visualization using ArcGIS, relational databases, and/or GeoDa.

Last modified: 8 November 2012 [jf]

Back to DUSP | CRON | MIT | MiniClass Homepage