Lab Exercise #2

This Lab Exercise builds on the results of Lab Exercise #1 and emphasizes the use of QGIS to map model results and explore spatial patterns. Specifically, we will:

Setup your computer to facilitate data exchange among GIS / Database / Statistics tools.
- Planners often need to acquire and integrate data from different sources using different computing systems and software tools, so there is a learning opportunity to seeing what it takes to coordinate software on machines where you do not have administrative privileges.
Examine the spatial pattern of VMT, factors, and model results in Prof. Mi Diao’s PhD dissertation on VMT and the built environment.
- Gain experience with spatially detailed ‘big data’ and some of the tools needed to extract components, georeference them, and develop analytic models.
- Improve spatial analysis skills by learning to take advantage of the particular strengths of GIS, relational database management, and statistical tools.
Experiment with GeoStatistics tools and visualization techniques that help in understanding our models of the VMT / built environment relationship
- Visualize and explore ‘big data’ relationships and what they do (and do not) say about the travel implications of urban form.
- Gain experience with sensitivity analysis and model exploration for spatially varying phenomena.

You do not need to finish all the R-project exercises from Lab #1 before doing this Lab. We have saved the necessary results from Lab #1 in the class locker. But we will want to make use of R to explore some of the data and results.

During today’s lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Thursday session (on Nov. 8), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few slides and plan a 5 minute talk explaining their ‘Part 3’ exploration. This presentation should be a ‘work in progress’ that explains ideas or hypotheses of interest, methods, and results to date, and any questions about how to interpret the results and continue the exploration. We will have a ‘homework’ location on the miniclass Stellar site where you will upload your slides in advance of the presentation.

Part 1: Setting up R on your computer

Open RStudio and run the following 'install' commands in the console IF you have not already done so on your computer. (The packages need only be downloaded and locally installed once on a particular computer.) These libraries that are required for the labs include: (a) RPostgreSQL to provide connectivity with PostgreSQL databases, (b) hexbin to support improved visualization of scattergrams, (c) psych and GPArotation to support additional principal component analysis, and (d) lattice to support the plotting of residuals. You could also do this installation by using the ‘Install Packages’ function, found in the Packages tab.

install.packages("RPostgreSQL")
install.packages("hexbin")
install.packages("psych")
install.packages("GPArotation")
install.packages("lattice")

Each of these install commands will locate and copy a zipped file of binaries and from the CRAN mirror site into a local workspace. Once all the packages are downloaded to your machine they need to be loaded each time you plan to use them. Place the following at the top of an R Script and execute it.

library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")
library("lattice")

Part 2: Visualizing Factors and Residuals and Exporting Data to PostgreSQL and QGIS

Rerun R Commands

Here is a summary of the R commands that we used in Lab Exercise #1 to read the built environment, demographic, community type, and VMT data from our Postgres database, do the principal component analysis, and ordinary least squares modeling. Rerun these commands to be sure to start off with the same results as those we had at the end of the Tuesday session.

# Load custom libraries
library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")
library("lattice")

# Connect to Postgres database
db <- dbConnect("PostgreSQL", dbname = "class946",
        host = "cronpgsql.mit.edu", port = 5432,
        user = name, password = pw)

# Read the 4 tables from Postgres into R
dem <- dbGetQuery(db, "SELECT * FROM vmtdata.demographic_bg")
be <- dbGetQuery(db, "SELECT * FROM vmtdata.built_environment_250m")
vmt <- dbGetQuery(db, "SELECT * FROM vmtdata.vmt_250m")
ct <- dbGetQuery(db, "SELECT * FROM vmtdata.ct_grid")

# Remove unwanted columns
outlist <- c("bg_id", "ct_id", "f1_wealth", "f2_children",
    "f3_working", "pct_ov2car")
dem2 <- dem[,setdiff(names(dem), outlist)]

# Remove rows if no income
dem3 <- dem2[dem2[,'inc_med_hs'] > 0,]

# Remove incomplete rows
dem4 <- dem3[complete.cases(dem3),]

# Find principal components
prin5 <- principal(dem4, nfactors = 3,
    rotate = "varimax", scores = TRUE)

# Get 4-way joined table from Postgres and remove incomplete rows.
vmt_join <- dbGetQuery(db, "SELECT * FROM vmtdata.v_vmt_join4")
vmtall <- vmt_join[complete.cases(vmt_join),]

# Regress 9-cell VMT average on the 8 factors
run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv) ~ f1_nw + f2_connect
               + f3_transit + f4_auto + f5_walk +f1_wealth
               + f2_kids + f3_work, data = vmtall)

Visualize Principal Components

Visualize the principal components that we identified (they are similar to the 3 demographic factors that Prof. Diao constructed).

biplot(prin5)

The histograms show the distribution of each component across all block groups. The scatterplots use pairs of the 3 components along the X-Y axes and superimpose the axes showing high/low directions for each of the original variables.

Plot residuals

Let’s examine the residuals from our ordinary least squares (OLS) model. These residuals are the actual values minus the values predicted by the regression model. We can use the residual function to extract the residuals from our regression model results and plot them against the relevant VMT estimate (which is [vmt_tot_mv / vin_mv] for our model). The scatterplot works but has so many points that it is hard to interpret. To get a better visualization, use the hexbin package.

res1 <- resid(run_ols1)
plot(vmtall$vmt_tot_mv / vmtall$vin_mv,res1, ylab = "Residuals",
    xlab = "VMT", main = "Errors in VMT prediction")

bin <- hexbin(vmtall$vmt_tot_mv / vmtall$vin_mv, res1)
plot(bin, ylab = "Residuals", xlab = "VMT",
    main = "Errors in VMT prediction")

The scatterplots suggest that residuals and VMT are correlated—they are high or low together. This suggests that the adjustments to annual mileage that are associated with out factors may be better modeled as percentage changes rather than absolute value differences. With this in mind, we might consider fitting a model of log(VMT).

We might also want to use additional R tools to plot fitted values and residuals in order to check some of the model assumptions. The first quantile-quantile plot of fits and residuals assumes a uniform distribution and the second pair assumes a normal distribution. If the data follow the assumed distribution, then the plots will be linear. The fitted values look close to normal but the residuals are both widely varying and less normal.

# Assumes a uniform distribution
rfs(run_ols1)

# Assumes a normal distribution
rfs(run_ols1,distribution=qnorm)

Examine Residuals by Community Type

Let’s see if the residuals exhibit a different pattern within each community type. The community type is specified in one of the columns of vmtall. We can stratify the block groups by community type and then examine the residuals for each group. To get the residuals into the same array as our original data, let’s first add the residuals as an additional column to vmtall—the large matrix used for our regression model. While we are at it, we use the fitted( ) function to pull the fitted values into a vector as well. We use the cbind command (to combine columns) to merge the data, then we plot the residuals for each community type as a box and whiskers plot.

# Box and whiskers plot of VMT by community type.
fit1 <- fitted(run_ols1)
vmtall <- cbind(vmtall,fit1,res1)
boxplot(res1 ~ ct, data = vmtall, ylab = "Residuals",
    main = "VMT Residuals by Community Type")

The median and interquartile range are fairly similar but their might be some heteroscedasticity since the residuals seem to be much larger in the developing suburbs than in the inner core (the second column). Also, the inner core has a negative median residual suggesting that city cars tend to be driven even less than the model would suggest.

Thus far, we have only examined the model that estimates VMT per-vehicle. Formulate a second model for VMT per household. Turn in, as part of your submission for Lab #2, the R-statement for your model along with the corresponding hexbin-based scatterplot of residuals and the box-plot of residuals by community type.

Push to Postgres

Next, let’s move some of the R results into Postgres tables so we can pass them into QGIS for spatial analysis and mapping. R provides functions that can save our data objects so we do not have to recreate them every time. We can save data in R’s native format or move specific data arrays into external tables as plain text files or tables in a database manager such as Postgres. The following commands save vmtall in R’s format (indicated by the Rdata suffix) and as a table in Postgres. Push the table using a table name that follow the following convention: "<your Kerberos username>_vmtall" (e.g., ehuntley_vmtal). We also save in R format the three original tables that we imported from Postgres.

# write data to Postgres database, do not add row numbers to the output
dbWriteTable(db, "ehuntley_vmtall", value = vmtall, row.names = FALSE)
save(vmtall, file = "C:/temp/data_ua/rdata/vmtall.rdata")
save(bedata, file = "C:/temp/data_ua/rdata/bedata.rdata")
save(demdata, file = "C:/temp/data_ua/rdata/demdata.rdata")
save(vmtdata, file = "C:/temp/data_ua/rdata/vmtdata.rdata")

Use Windows Explorer (or MacOS Finder) and PGAdmin to check that these files and tables are written into the expected locations.

There are many ways to save and retrieve data in R. Use help(save) to look into the variations. Sometimes, it is helpful to save your entire R workspace so that you can pick up where you left off, and the command save.image() bundles a particular flavor of save() options in order to do this. However, you may have trouble with save.image() on lab machines since you do not have write permission for certain default locations. Also, your workspace may includes external connections such as your connection to Postgres. You are left on your own to review the help files and experiment with various ‘save’ options to find the variation of ‘save’ that works best for you.

Part 3: Mapping model results

In QGIS…

Create a new document in QGIS. First, you will have to connect QGIS to our class Postgres/PostGIS database. Click the ‘Add PostGIS Layers’ icon along the left side of the window. Click ‘New’ to make a new connection. Name your connection ‘class946’. Leave the ‘Service’ field blank. The host is ‘cronpgsql.mit.edu’. Connect on port ‘5432’. Specify the ‘class946’ database. Under ‘Authentication’, use your Kerberos username and your class password, saving only your username. Ensure that ‘Also list tables with no geometry’ is selected. Your window should look something like the following:

Create a New PostGIS Connection in QGIS.

Connect to the database. You should see the public and vmtdata schemas appear in the Add PostGIS Table(s) window. Close the window. In your QGIS browser panel, navigate to your Postgres database connection and add the <your Kerberos username>_vmtall table that we just pushed to the database using R—it will appear under the public schema. Join this table to the g_all_stategrid table based on the grid cell ID column, g250m_id. QGIS lets you specify a column prefix, allowing you to dodge the nasty, long column names that can result from joins in ArcMap—I like to use an underscore (_).

Joining with QGIS.

You’ve probably noticed that this layer takes a long time to draw. This is because there are 355,728 grid cells in Massachusetts, but only 53,250 have estimates (because they were in metro Boston and met the criteria regarding VINs, households, and estimated annual mileage). To speed up the processing, you may want to create a new table that only includes cells with values. To do this, select features using a SQL expression like "_res1" IS NOT NULL (any of the joined columns would work). Then, open QGIS’s DB Manager (Database > DB Manager), select ‘Import layer/file’, and choose g_all_stategrid, ensuring that ‘Import only selected features’ is checked. Name the table <your Kerberos username>_g_bos and add it to the public schema. Your window should look like the screenshot below.

Joining with QGIS.

We’ve now read and written data tables to a Postgres database using R and QGIS!

Next, create a thematic map of the residuals from our OLS model (the res1 column) using the table you just created. Classify the data by ‘standard deviation’ so we can easily spot high/med/low residuals. Change the outline around the Simple Fill to a transparent border. The red cells with high residuals are the places where the model underestimated the actual VMT.

Much like ArcMap, QGIS allows you to load map tiles from web mapping services. These will help provide some context for our grid. To add a basemap, you’ll need to add the QuickMapServices plug-in (Plugins > Manage and Install Plugins…). Once you’ve installed the plugin, you can expand the number of basemaps available to you by opening QuickMapServices settings (Web > QuickMapServices > Settings), selecting ‘More services’ and clicking the ‘Get contributed pack’ button. In the screenshots below, I’ve added the CartoDB Positron (No Labels) basemap. You can turn these on to overlay images, road maps and shaded relief maps in order to help you interpret metro Boston areas at different scales. The result for the fitted values and the residuals should look something like this:

Map of fitted values.

Map of residuals.

Examine other thematic maps of factor values, fitted VMT values, and the like. From the plot of residuals, it does look like some spatial patterns have been missed. The residuals are way negative (implying that we overestimated mileage) for inner core grid cells and it looks like they are also low for some of the regional centers and transportation corridors. Alternatively, the residuals are relatively high for some of the suburban grid cells that are mixed in with the NULL grid cells (where no one lives because it is a pond or open space). These impressions may help us think of additional factors or relationships that could capture some of the remaining spatial effects. For your convenience, the table that results from a successful join and selection has been added to the Postgres database as the table bos_vmt_grid. Be aware, however, that the vmtall table that we generated in R has not been added to the Postgres database.

Turn in, as part of your submission for Lab #2, two new maps showing your fitted values of VMT-per-household and the residuals (using the VMT-per-houshold model that you specified earlier). You may use either QGIS or ArcGIS to prepare the maps.

…Or in ArcMap

Open ArcMap. In your catalog pane, expand the ‘database connections’ folder and click ‘Add Database Connection.’ Fill out the information in the Database Connection prompt such that it matches the screenshot below.

Add database in ArcMap.

Add the <your Kerberos username>_vmtall table that we just create in R (e.g., class946.public.ehuntley_vmtall). Join this table to the g_all_stategrid table (e.g., class946.vmtdata.ehuntley_vmtall) based on the grid cell ID, g250m_id. Say ‘yes’ when asked about creating an index. Next, create a thematic map of the residuals from our OLS model (res1) and classify the data by ‘standard deviation’ so we can easily spot with high/med/low residuals. The cells with high residuals are the places where the model underestimated the actual VMT. At this point, you will notice that the map takes a long time to redraw. This is because there are 356K grid cells in Massachusetts - but only 53,250 have estimates (because they were in metro Boston and met the criteria regarding VINs, households, and estimated annual mileage). To speed up the processing, you may want to create a new shapefile, bos_vmt_grid.shp, with only those 53,250 grid cells that have data. You will also want to set the grid cell outline color to ‘none’ so the boundaries do not obscure the shading. When you set the thematic shading in the classification window, click the ‘sampling’ button and change the maximum sample size to be 54,000. Otherwise, the breakpoints for the shading will be determine only using the first 10,000 grid cells. Finally, we have added two web mapping services (one from MassGIS and three from ESRI). You can turn these on to overlay images, road maps and shaded relief maps in order to help you interpret metro Boston areas at different scales. The result for the fitted values and the residuals should look something like this:

Map of fitted values.

Map of residuals.

Take advantage of having the vmtall table joined to the all_stategrid to examine other thematic maps of factor values, fitted VMT values, and the like. From the plot of residuals, it does look like some spatial patterns have been missed. The residuals are way negative (implying that we overestimated mileage) for inner core grid cells and it looks like they are also low for some of the regional centers and transportation corridors. Alternatively, the residuals are relatively high for some of the suburban grid cells that are mixed in with the ‘null’ grid cells (where no one lives because it is a pond or open space). These impressions may help us think of additional factors or relationships that could capture some of the remaining spatial effects.

Explore Tests for Spatial Autocorrelation

There are a number of popular free, open-source tools for analyzing spatial autocorrelation—that is, the extent to which a spatial pattern is non-random. these include Open GeoDa (developed by Luc Anselin, et al., now at University of Chicago). We do not have GeoDa loaded on the lab machines, but you can download your own copy from the GeoDa Center. The R Spatial (sp) package is also becoming quite popular and can be used to analyze spatial autocorrelation; see this presentation by Roger Bivand, one of the creators of the R spatial package.

ArcMap also contains tools for spatial autocorrelation analysis. You may want to look at the Introduction to the ArcGIS Geostatistical Analyst Tutorial. At least read about ‘how spatial autocorrelation (Moran’s I) works’. You may want to try some of the geostatistical tools to in the spatial statistics section to examine the nature and extent of spatial autocorrelation in the data.

Part 4: Experiment with alternative indicators and models

Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between QGIS, R and Postgres that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)

Here are some suggestions for exploring the data further:

Use the Community Type category to subdivide the grid cells into those within the four community types. Refit the OLS model for each of the types of community. How different are the results? How might you dig into the data further to sort out what might be happening and what transform or redefinition of variables might improve the model fit?
We can use the regression model at the end of Part 2 to estimate VMT for each grid cell and compute the ‘residuals’ - that is the difference between the actual and predicted VMT values for each grid cell. Then we can import these residuals back into QGIS, join them to the grid cell layer and see whether the residuals seem to vary systematically across metro Boston. Here is a tutorial on plotting residuals from a regression model. We can even do some ‘Lisa’ plots and compute spatial statistics such as Moran’s I to identify hot spots and test how strong is the evidence of spatial autocorrelation. One of the R commands that you may find helpful here is the dbWriteTable() command to write tables back to Postgres. Use the online help function in R to learn about the dbWriteTable() command: help(dbWriteTable). Note that this command requires the RPostgreSQL library that we loaded at the start of this lab to facilitate communication with PostgreSQL.
The pairwise scattergrams that we produced seemed to suggest skewness in some of the distributions and non-linearities in some of the relationships. Select a few variables where this appears problematic and experiment with some transformations and then refit the VMT model. Do the results suggest different effects? Do they suggest ways of identifying sub-markets within which the VMT / BE / Demographic relationships might be different.
From the data, it is pretty clear that annual mileage per vehicle and per household is relatively low in the central city and relatively high in the outer suburbs. The BE factors tend to measure local effects and they do seem to explain some of the within-town variation in VMT that is evident in the suburbs. What other measures might you consider to capture built environment effects that are at a different scale from those which Prof. Diao considered? Take a look at the often-cited paper by Cervero and Kockelman on “Travel Demand and the 3Ds: Density, Diversity, and Design” (Transportation Research, part D, Vol. 2, No. 3, pp. 199-219, 1997). A copy, Cervero_1997.pdf, has been placed in the ‘readings’ folder of the 11.523 class locker for your convenience.

For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next formal Tuesday session on Nov. 6 and the Thursday session on Nov. 8. By the end of lab on Tuesday, October 31, turn in a brief project proposal explaining in one or two paragraphs the exploration of indicators and models that you plan to pursue for you project. If you plan to work in a group (or 2 or 3) please indicate who is in your group. You should then plan to prepare some slides and a 5-10 minute presentation for the next week (Nov. 6 or 8) explaining and discussing your preliminary results.

What to turn in for Lab #2:

Prior to the start of class on Thursday, Nov. 2, turn in on the class Stellar website the two items that were requested in Italics during the exercise:

Your VMT-per-household model, the scatterplot of residuals, and the box-plot of residuals by community type that you prepared in Part 2.
Your two maps of fitted and residuals values for the VMT-per-household model that you prepared in Part 3.

Also, remember to also submit to Stellar on Oct. 31 your Lab #1 exercise and your brief project proposal.

Created by Joe Ferreira (2012); Modified by Joe Ferreira and Eric Huntley (2017)
Last modified: 24 October 2017 [Eric Huntley]

Back to DUSP | CRON | MIT | MiniClass Homepage