This Lab Exercise builds on the results of Lab Exercise #1 and emphasizes the use of QGIS to map model results and explore spatial patterns. Specifically, we will:
Setup your computer to facilitate data exchange among GIS / Database / Statistics tools.
Examine the spatial pattern of VMT, factors, and model results in Prof. Mi Diao’s PhD dissertation on VMT and the built environment.
Experiment with GeoStatistics tools and visualization techniques that help in understanding our models of the VMT / built environment relationship
You do not need to finish all the R-project exercises from Lab #1 before doing this Lab. We have saved the necessary results from Lab #1 in the class locker. But we will want to make use of R to explore some of the data and results.
During today’s lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Thursday session (on Nov. 8), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few slides and plan a 5 minute talk explaining their ‘Part 3’ exploration. This presentation should be a ‘work in progress’ that explains ideas or hypotheses of interest, methods, and results to date, and any questions about how to interpret the results and continue the exploration. We will have a ‘homework’ location on the miniclass Stellar site where you will upload your slides in advance of the presentation.
Open RStudio and run the following 'install' commands in the console
IF you have not already done so on your computer. (The packages need
only be downloaded and locally installed once on a particular
computer.) These libraries that are required for the labs include:
(a) RPostgreSQL
to provide connectivity with PostgreSQL
databases, (b) hexbin
to support improved visualization
of scattergrams, (c) psych
and GPArotation
to support additional principal component analysis, and (d) lattice
to support the plotting of residuals. You could also do this
installation by using the ‘Install Packages’ function, found in the
Packages tab.
install.packages("RPostgreSQL")
install.packages("hexbin")
install.packages("psych")
install.packages("GPArotation")
install.packages("lattice")
Each of these install commands will locate and copy a zipped file of binaries and from the CRAN mirror site into a local workspace. Once all the packages are downloaded to your machine they need to be loaded each time you plan to use them. Place the following at the top of an R Script and execute it.
library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")
library("lattice")
Here is a summary of the R commands that we used in Lab Exercise #1 to read the built environment, demographic, community type, and VMT data from our Postgres database, do the principal component analysis, and ordinary least squares modeling. Rerun these commands to be sure to start off with the same results as those we had at the end of the Tuesday session.
# Load custom libraries
library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")
library("lattice")
# Connect to Postgres database
db <- dbConnect("PostgreSQL", dbname = "class946",
host = "cronpgsql.mit.edu", port = 5432,
user = name, password = pw)
# Read the 4 tables from Postgres into R
dem <- dbGetQuery(db, "SELECT * FROM vmtdata.demographic_bg")
be <- dbGetQuery(db, "SELECT * FROM vmtdata.built_environment_250m")
vmt <- dbGetQuery(db, "SELECT * FROM vmtdata.vmt_250m")
ct <- dbGetQuery(db, "SELECT * FROM vmtdata.ct_grid")
# Remove unwanted columns
outlist <- c("bg_id", "ct_id", "f1_wealth", "f2_children",
"f3_working", "pct_ov2car")
dem2 <- dem[,setdiff(names(dem), outlist)]
# Remove rows if no income
dem3 <- dem2[dem2[,'inc_med_hs'] > 0,]
# Remove incomplete rows
dem4 <- dem3[complete.cases(dem3),]
# Find principal components
prin5 <- principal(dem4, nfactors = 3,
rotate = "varimax", scores = TRUE)
# Get 4-way joined table from Postgres and remove incomplete rows.
vmt_join <- dbGetQuery(db, "SELECT * FROM vmtdata.v_vmt_join4")
vmtall <- vmt_join[complete.cases(vmt_join),]
# Regress 9-cell VMT average on the 8 factors
run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv) ~ f1_nw + f2_connect
+ f3_transit + f4_auto + f5_walk +f1_wealth
+ f2_kids + f3_work, data = vmtall)
Visualize the principal components that we identified (they are similar to the 3 demographic factors that Prof. Diao constructed).
biplot(prin5)
The histograms show the distribution of each component across all block groups. The scatterplots use pairs of the 3 components along the X-Y axes and superimpose the axes showing high/low directions for each of the original variables.
Let’s examine the residuals from our ordinary least squares (OLS)
model. These residuals are the actual values minus the values
predicted by the regression model. We can use the residual function
to extract the residuals from our regression model results and plot
them against the relevant VMT estimate (which is [vmt_tot_mv / vin_mv]
for our model). The scatterplot works but has so many points that it
is hard to interpret. To get a better visualization, use the hexbin
package.
res1 <- resid(run_ols1)
plot(vmtall$vmt_tot_mv / vmtall$vin_mv,res1, ylab = "Residuals",
xlab = "VMT", main = "Errors in VMT prediction")
bin <- hexbin(vmtall$vmt_tot_mv / vmtall$vin_mv, res1)
plot(bin, ylab = "Residuals", xlab = "VMT",
main = "Errors in VMT prediction")
The scatterplots suggest that residuals and VMT are correlated—they are high or low together. This suggests that the adjustments to annual mileage that are associated with out factors may be better modeled as percentage changes rather than absolute value differences. With this in mind, we might consider fitting a model of log(VMT).
We might also want to use additional R tools to plot fitted values and residuals in order to check some of the model assumptions. The first quantile-quantile plot of fits and residuals assumes a uniform distribution and the second pair assumes a normal distribution. If the data follow the assumed distribution, then the plots will be linear. The fitted values look close to normal but the residuals are both widely varying and less normal.
# Assumes a uniform distribution
rfs(run_ols1)
# Assumes a normal distribution
rfs(run_ols1,distribution=qnorm)
Let’s see if the residuals exhibit a different pattern within each
community type. The community type is specified in one of the
columns of vmtall
. We can stratify the block groups by
community type and then examine the residuals for each group. To get
the residuals into the same array as our original data, let’s first
add the residuals as an additional column to vmtall
—the
large matrix used for our regression model. While we are at it, we
use the fitted( )
function to pull the fitted values
into a vector as well. We use the cbind
command (to
combine columns) to merge the data, then we plot the residuals for
each community type as a box and whiskers plot.
# Box and whiskers plot of VMT by community type.
fit1 <- fitted(run_ols1)
vmtall <- cbind(vmtall,fit1,res1)
boxplot(res1 ~ ct, data = vmtall, ylab = "Residuals",
main = "VMT Residuals by Community Type")
The median and interquartile range are fairly similar but their might be some heteroscedasticity since the residuals seem to be much larger in the developing suburbs than in the inner core (the second column). Also, the inner core has a negative median residual suggesting that city cars tend to be driven even less than the model would suggest.
Thus far, we have only examined the model that estimates VMT per-vehicle. Formulate a second model for VMT per household. Turn in, as part of your submission for Lab #2, the R-statement for your model along with the corresponding hexbin-based scatterplot of residuals and the box-plot of residuals by community type.
Next, let’s move some of the R results into Postgres tables so we
can pass them into QGIS for spatial analysis and mapping. R provides
functions that can save our data objects so we do not have to
recreate them every time. We can save data in R’s native format or
move specific data arrays into external tables as plain text files
or tables in a database manager such as Postgres. The following
commands save vmtall in R’s format (indicated by the Rdata suffix)
and as a table in Postgres. Push the table using a table
name that follow the following convention: "<your Kerberos username>_vmtall"
(e.g., ehuntley_vmtal). We also save in R format the
three original tables that we imported from Postgres.
# write data to Postgres database, do not add row numbers to the output
dbWriteTable(db, "ehuntley_vmtall", value = vmtall, row.names = FALSE)
save(vmtall, file = "C:/temp/data_ua/rdata/vmtall.rdata")
save(bedata, file = "C:/temp/data_ua/rdata/bedata.rdata")
save(demdata, file = "C:/temp/data_ua/rdata/demdata.rdata")
save(vmtdata, file = "C:/temp/data_ua/rdata/vmtdata.rdata")
Use Windows Explorer (or MacOS Finder) and PGAdmin to check that these files and tables are written into the expected locations.
There are many ways to save and retrieve data in R. Use help(save) to look into the variations. Sometimes, it is helpful to save your entire R workspace so that you can pick up where you left off, and the command save.image() bundles a particular flavor of save() options in order to do this. However, you may have trouble with save.image() on lab machines since you do not have write permission for certain default locations. Also, your workspace may includes external connections such as your connection to Postgres. You are left on your own to review the help files and experiment with various ‘save’ options to find the variation of ‘save’ that works best for you.
Create a new document in QGIS. First, you will have to connect QGIS to our class Postgres/PostGIS database. Click the ‘Add PostGIS Layers’ icon along the left side of the window. Click ‘New’ to make a new connection. Name your connection ‘class946’. Leave the ‘Service’ field blank. The host is ‘cronpgsql.mit.edu’. Connect on port ‘5432’. Specify the ‘class946’ database. Under ‘Authentication’, use your Kerberos username and your class password, saving only your username. Ensure that ‘Also list tables with no geometry’ is selected. Your window should look something like the following:
Create a New PostGIS Connection in QGIS.
Connect to the database. You should see the public
and vmtdata
schemas appear in the Add PostGIS Table(s)
window. Close the window. In your QGIS browser panel, navigate to
your Postgres database connection and add the <your Kerberos username>_vmtall
table that we just pushed to the database using R—it will appear
under the public
schema. Join this table to the g_all_stategrid
table based on the grid cell ID column, g250m_id
. QGIS
lets you specify a column prefix, allowing you to dodge the nasty,
long column names that can result from joins in ArcMap—I like to use
an underscore (_).
Joining with QGIS.
You’ve probably noticed that this layer takes a long time to draw.
This is because there are 355,728 grid cells in Massachusetts, but
only 53,250 have estimates (because they were in metro Boston and
met the criteria regarding VINs, households, and estimated annual
mileage). To speed up the processing, you may want to create a new
table that only includes cells with values. To do this, select
features using a SQL expression like "_res1" IS NOT NULL
(any of the joined columns would work). Then, open QGIS’s DB Manager
(Database > DB Manager), select ‘Import layer/file’, and choose g_all_stategrid
,
ensuring that ‘Import only selected features’ is checked. Name the
table <your Kerberos username>_g_bos
and add it
to the public schema. Your window should look like the screenshot
below.
Joining with QGIS.
We’ve now read and written data tables to a Postgres database using R and QGIS!
Next, create a thematic map of the residuals from our OLS model
(the res1
column) using the table you just created.
Classify the data by ‘standard deviation’ so we can easily spot
high/med/low residuals. Change the outline around the Simple Fill to
a transparent border. The red cells with high residuals are the
places where the model underestimated the actual VMT.
Much like ArcMap, QGIS allows you to load map tiles from web
mapping services. These will help provide some context for our grid.
To add a basemap, you’ll need to add the QuickMapServices plug-in
(Plugins > Manage and Install Plugins…). Once you’ve installed
the plugin, you can expand the number of basemaps available to you
by opening QuickMapServices settings (Web > QuickMapServices >
Settings), selecting ‘More services’ and clicking the ‘Get
contributed pack’ button. In the screenshots below, I’ve added the
CartoDB Positron (No Labels)
basemap. You can turn
these on to overlay images, road maps and shaded relief maps in
order to help you interpret metro Boston areas at different scales.
The result for the fitted values and the residuals should look
something like this:
Map of fitted values.
Map of residuals.
Examine other thematic maps of factor values, fitted VMT values,
and the like. From the plot of residuals, it does look like some
spatial patterns have been missed. The residuals are way negative
(implying that we overestimated mileage) for inner core grid cells
and it looks like they are also low for some of the regional centers
and transportation corridors. Alternatively, the residuals are
relatively high for some of the suburban grid cells that are mixed
in with the NULL
grid cells (where no one lives
because it is a pond or open space). These impressions may help us
think of additional factors or relationships that could capture some
of the remaining spatial effects. For your convenience, the table
that results from a successful join and selection has been added to
the Postgres database as the table bos_vmt_grid
. Be
aware, however, that the vmtall
table that we
generated in R has not been added to the Postgres database.
Turn in, as part of your submission for Lab #2, two new
maps showing your fitted values of VMT-per-household and the
residuals (using the VMT-per-houshold model that you specified
earlier). You may use either QGIS or ArcGIS to prepare the
maps.
Open ArcMap. In your catalog pane, expand the ‘database connections’ folder and click ‘Add Database Connection.’ Fill out the information in the Database Connection prompt such that it matches the screenshot below.
Add database in ArcMap.
Add the <your Kerberos username>_vmtall
table
that we just create in R (e.g., class946.public.ehuntley_vmtall
).
Join this table to the g_all_stategrid
table (e.g., class946.vmtdata.ehuntley_vmtall
)
based on the grid cell ID, g250m_id
. Say ‘yes’ when
asked about creating an index. Next, create a thematic map of the
residuals from our OLS model (res1
) and classify the
data by ‘standard deviation’ so we can easily spot with high/med/low
residuals. The cells with high residuals are the places where the
model underestimated the actual VMT. At this point, you will notice
that the map takes a long time to redraw. This is because there are
356K grid cells in Massachusetts - but only 53,250 have estimates
(because they were in metro Boston and met the criteria regarding
VINs, households, and estimated annual mileage). To speed up the
processing, you may want to create a new shapefile, bos_vmt_grid.shp
,
with only those 53,250 grid cells that have data. You will also want
to set the grid cell outline color to ‘none’ so the boundaries do
not obscure the shading. When you set the thematic shading in the
classification window, click the ‘sampling’ button and change the
maximum sample size to be 54,000. Otherwise, the breakpoints for the
shading will be determine only using the first 10,000 grid cells.
Finally, we have added two web mapping services (one from MassGIS
and three from ESRI). You can turn these on to overlay images, road
maps and shaded relief maps in order to help you interpret metro
Boston areas at different scales. The result for the fitted values
and the residuals should look something like this:
Map of fitted values.
Map of residuals.
Take advantage of having the vmtall
table joined to
the all_stategrid
to examine other thematic maps of
factor values, fitted VMT values, and the like. From the plot of
residuals, it does look like some spatial patterns have been missed.
The residuals are way negative (implying that we overestimated
mileage) for inner core grid cells and it looks like they are also
low for some of the regional centers and transportation corridors.
Alternatively, the residuals are relatively high for some of the
suburban grid cells that are mixed in with the ‘null’ grid cells
(where no one lives because it is a pond or open space). These
impressions may help us think of additional factors or relationships
that could capture some of the remaining spatial effects.
There are a number of popular free, open-source tools for analyzing
spatial autocorrelation—that is, the extent to which a spatial
pattern is non-random. these include Open GeoDa (developed by Luc
Anselin, et al., now at University of Chicago). We do not have GeoDa
loaded on the lab machines, but you can download your own copy from
the GeoDa Center. The R
Spatial (sp
) package is also becoming quite popular and
can be used to analyze spatial autocorrelation; see this
presentation by Roger Bivand, one of the creators of the R
spatial package.
ArcMap also contains tools for spatial autocorrelation analysis. You may want to look at the Introduction to the ArcGIS Geostatistical Analyst Tutorial. At least read about ‘how spatial autocorrelation (Moran’s I) works’. You may want to try some of the geostatistical tools to in the spatial statistics section to examine the nature and extent of spatial autocorrelation in the data.
Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between QGIS, R and Postgres that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)
Here are some suggestions for exploring the data further:
Use the Community Type category to subdivide the grid cells into those within the four community types. Refit the OLS model for each of the types of community. How different are the results? How might you dig into the data further to sort out what might be happening and what transform or redefinition of variables might improve the model fit?
We can use the regression model at the end of Part 2 to estimate
VMT for each grid cell and compute the ‘residuals’ - that is the
difference between the actual and predicted VMT values for each
grid cell. Then we can import these residuals back into QGIS, join
them to the grid cell layer and see whether the residuals seem to
vary systematically across metro Boston. Here
is a tutorial on plotting residuals from a regression model.
We can even do some ‘Lisa’ plots and compute spatial statistics
such as Moran’s I to identify hot spots and test how strong is the
evidence of spatial autocorrelation. One of the R commands that
you may find helpful here is the dbWriteTable()
command to write tables back to Postgres. Use the online help
function in R to learn about the dbWriteTable()
command: help(dbWriteTable)
. Note that this command
requires the RPostgreSQL
library that we loaded at
the start of this lab to facilitate communication with
PostgreSQL.
The pairwise scattergrams that we produced seemed to suggest skewness in some of the distributions and non-linearities in some of the relationships. Select a few variables where this appears problematic and experiment with some transformations and then refit the VMT model. Do the results suggest different effects? Do they suggest ways of identifying sub-markets within which the VMT / BE / Demographic relationships might be different.
From the data, it is pretty clear that annual mileage per vehicle
and per household is relatively low in the central city and
relatively high in the outer suburbs. The BE factors tend to
measure local effects and they do seem to explain some of the
within-town variation in VMT that is evident in the suburbs. What
other measures might you consider to capture built environment
effects that are at a different scale from those which Prof. Diao
considered? Take a look at the often-cited paper by Cervero and
Kockelman on “Travel Demand and the 3Ds: Density, Diversity, and
Design” (Transportation Research, part D, Vol. 2, No. 3,
pp. 199-219, 1997). A copy, Cervero_1997.pdf
, has
been placed in the ‘readings’ folder of the 11.523 class locker
for your convenience.
For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next formal Tuesday session on Nov. 6 and the Thursday session on Nov. 8. By the end of lab on Tuesday, October 31, turn in a brief project proposal explaining in one or two paragraphs the exploration of indicators and models that you plan to pursue for you project. If you plan to work in a group (or 2 or 3) please indicate who is in your group. You should then plan to prepare some slides and a 5-10 minute presentation for the next week (Nov. 6 or 8) explaining and discussing your preliminary results.
Also, remember to also submit to Stellar on Oct. 31 your Lab #1 exercise and your brief project proposal.
Created by Joe Ferreira (2012); Modified by Joe Ferreira and Eric
Huntley (2017)
Last modified: 24 October 2017 [Eric Huntley]
Back to DUSP | CRON | MIT | MiniClass Homepage