Lab Exercise #1

There are three purposes to this Lab Exercise:

Setup your computer to facilitate data exchange among QGIS / Database / Statistics tools
- Connect QGIS, R and PostgreSQL to exchange data among these three different platforms to take advantage of their unique capabilities handling statistical and spatial data
- Planners often need to acquire and integrate data from different sources using different computing systems and software tools, so there is a learning opportunity to see what it takes to coordinate software on machines where you do not have administrative privileges.
Reproduce some of the indicators and models in Prof. Mi Diao’s PhD dissertation on VMT and the built environment
- Gain experience with spatially detailed ‘big data’ and some of the tools needed to extract components, build indicators, and develop analytic models
- Improve spatial analysis skills by learning to take advantage of the particular strengths of GIS, relational database management, and statistical tools
Experiment with alternative indicators and models of the VMT / built environment relationship
- Visualize and explore ‘big data’ relationships and what they do (and do not) say about the travel implications of urban form
- Gain experience with sensitivity analysis and model exploration for spatially varying phenomena

During today’s lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Tuesday session (on Oct. 24), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few Powerpoint or PDF slides and plan a 5 minute talk explaining the ‘Part 3’ exploration. This presentation should be a ‘work in progress’ that explains ideas or hypotheses of interest, methods and results to date, and any questions about how to interpret the results and continue the exploration. We will have a ‘homework’ location on the miniclass Stellar site to upload your slides in advance of the presentation.

Part 1: Accessing class datasets and setting up R on your computer:

Class datasets: All the data that you will need for the miniclass is available in the class locker (http://mit.edu/11.s946/www/ClassData/data.zip). This locker contains:

./data/csv_files: Folder with plain text of four tables in CSV (comma-separated values) format.
./data/msaccess/vmt_data.mdb: MS-Access database containing four tables plus one query.
./data/shapefiles: folder with five Shapefiles.
./data_dictionary.xls: Spreadsheet with data dictionary for tables and shapefiles.
./11s946_qgs_startup.qgs: QGIS project file to get you started.

This ClassData folder contains 238 MB but several tables are duplicated in different formats and, in fact, you will only need the data dictionary and the QGIS startup project since all the required tables and geometry files are accessible from a PostgreSQL/PostGIS database server. The CSV files and MS-Access database are the same tables that we will access from Postgres but may be helpful if you are familiar with MS-Access or, say, with reading CSV files into MS-Excel. You may also need the shapefiles folder if you plan to use ArcGIS instead of QGIS for geoprocessing.

At this point, we suggest that you download and open only the data\_dictionary.xls spreadsheet so you can browse through the dictionary for the four non-spatial data tables. They contain some Year-2000 block group census data for Massachusetts plus the data from Prof. Mi Diao’s PhD dissertation organized by the 250x250m grid cells for Massachusetts.

To examine the tables, we suggest that you access the Postgres tables from a browser here: http://cronpgsql.mit.edu/phppgadmin.

Log in using your personal Postgres account for this database server as explained in class and emails. Remember that all the class datasets are saved in 9 tables within the vmtdata schema of the class946 database on our PostgreSQL database server cronpgsql.mit.edu. The five tables beginning with g\_ are spatial boundary files that we will not use today. Review the data dictionary and browse the tables to get a feel for what data are provided for each grid cell and census block group in the metro Boston area.

Can you find the grid cell that has the highest road density value? See the rdden variable in the built\_environment\_250m table. This variable measures the total length of all major road segments that fall within the grid cell. Does the distribution of values make sense? Can you find the grid cell in table vmt\_250m that is associated the largest number of registered private passenger vehicles (no\_vin)? What about the highest estimated annual mileage per vehicle (vmt\_vin)? Note that every grid cell in the vmt\_250m table has at least one registered vehicle (no\_vin > 0) since the grid cells with no vehicles are omitted - but they are included in the geometry tables that we will map later on. Note that one grid cell (g250m\_id = 88461) has 10602 gecoded vehicles! This is certainly an outlier (perhaps the location of a frequently used post office box) and many times higher than the next most populated grid cell. We will want to omit it from VMT modeling later on. Use simple SQL queries in the SQL pane of PHPPGADMIN to explore the tables. For example, try this:

select no_vin, count(*)
  from vmtdata.vmt_250m
  group by no_vin
  order by no_vin;

Once you are familiar with the data dictionary and browser-based postgres access to the four tables with grid cell and block group data, you are ready to beginning using the open source statistical software, R, to examine and interpret the data in more detail.

Running R: The lab machines run the 64-bit version of Windows (so that the operating system can address arrays and memory spaces bigger than 2^32 = 4 GB). The R package installed on CRON machines runs the ‘64-bit’ version of R. That is good and forward thinking since this version can handle datasets larger than a few gigabytes.

This is the list of software that you will need to complete the labs: (Click here for tips on installation.)

R and Rstudio (64-bit)
PgAdmin (64-bit)
QGIS (64-bit)

ArcGIS and MS-Access are optional software, on which you can rely if you feel more comfortable working with them, instead of with QGIS and PgAdmin. However, we encourage students to rely on the open source tools listed above and to use this lab as an exercise to become more familiar with such tools.

In today’s lab, will will focus only on R and data exploration using PHPPGADMIN. Let’s start, either on a lab machine, your personal laptop using windows or Mac natively, or with the VM setup provided by CRON:

Open RStudio and create a new R script (via File > New File > RScript) in order to launch a convenient script editing pane in the upper right of your RStudio window. You can construct your command lines in this script window and run them in ‘console’ pane that is now in the lower right of your RStudio window. Results and other messages and options will appear in the right-side panes.

Before accessing the class datasets, we need to install five optional library packages that allow us to connect to the PostgreSQL database server, and run several other data processing steps that we will find useful. In order to load and use these packages, we need to do two things:

find each library package online and install it into a writable folder on your computer
load each library package into your current R session

Install R Packages

Installing R packages is tricky on CRON machines because you do not have administrative privileges. Hence, you do not have permission to install extra library packages in the default location that R expects. If you try to install packages without permission to write files to the default location, R will complain and put them instead in a ‘personal library’ that it creates in your user workspace. (E.g. on Windows, this will be in: C:\\Users\\jf\\AppData\\Local\\Temp\\Rtmpo3eAeW\\downloaded_packages which appears to be on Drive C: but is actually saved in your H: drive on the network so that it travels with you when you log onto any CRON machine.

The five libraries that we need are: (a) DBI and RPostgreSQL to allow R to connect to PostgreSQL databases, (b) hexbin to support improved visualization of scattergrams, and (c) psych and GPArotation to support additional principal component analysis.

install.packages("DBI")
install.packages("RPostgreSQL")
install.packages("hexbin")
install.packages("psych")
install.packages("GPArotation")

You could do this installation by using the RStudio Packages tab (in the lower right pane), but we have chosen to use the command line approach.

When you enter the first command, you may see a popup window asking you to specify which CRAN mirror site to use to get the package. Choose a US site such as USA(IN) for Indiana University. Each of these commands will locate and copy a zipped file of binaries from the CRAN mirror site into a local workspace where they are unzipped and verified to make them ready for use. If you wish, you can install the libraries into a different location on your personal computer. Suppose, for example, you use Windows and wish to put them into ‘C:\temp\Rlib’; in this case, the first install command would be: install.packages("DBI", lib=" C:/temp/Rlib") and you will have to create the folder in advance of executing the command or you will get an error message. If you are working on your personal machine, you will not need to create C:\\temp\\Rlib unless you lack administrative privileges, but you may choose to do so anyway if you want to avoid disturbing the default R installation directory.

You do not need to care where the binaries are installed, except to realize that, for CRON machines, it is on your networked H drive which could slow the download. However, you need to do this installation only once and the installed files are not that large, so it is worth saving them once on your H: drive and having them available from any CRON machine. If you have trouble downloading from a CRAN mirror site, use the Packages/set-CRAN-mirror… tab to choose another site.

Load R Library

Packages only need to be installed once. However, every time you start an R or RStudio session, you need to load these optional libraries so they can be used during your session. Here are the appropriate commands:

library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")

Note: If you installed the libraries into C:/temp/Rlib you may need to specify that location as in: library("DBI",lib.loc="C:/temp/Rlib"). Note also that some commands use forward slash ‘/’ and others use double backward slashes ‘\\’ to move up and down among directories (or folders). Either option gets converted to the same thing internally in R - namely, into the single forward slash ‘/’ convention for unix paths. The double ‘\\’ is needed since unix uses the single ‘\’ as a escape character.

Part 2: Reproducing Indicators and Models from Prof. Diao’s Dissertation

In this part of the exercise, you can cut-and-paste the R commands into the R-Console window.

Copy the Data to a Local Drive (Optional)

Copy the data (S:\\11.s946\\ClassData) to a local drive use Windows Explorer or Finder to copy the entire MiniClass data locker (238 MB). Since we can leave the big tables on the postgres server and access them as needed from R, QGIS, or ArcGIS, there is no need to copy and save the tables locally but they are available if you want to examine them while offline.

Open a Channel to PostgreSQL

Open a data channel to PostgreSQL using the username and password that you received from the class instructor. Be sure to use double quotes for the parameters values without any spaces between the quotes and the values.

# Connect to Postgres database
db <- dbConnect("PostgreSQL", dbname = "class946",
    host = "cronpgsql.mit.edu", port = 5432,
    user = name, password = pw)

# List the tables in the "vmtdata" schema:
dbGetQuery(db, "SELECT table_name FROM information_schema.tables WHERE table_schema='vmtdata'")

##                table_name
## 1  built_environment_250m
## 2         g_taz2727_codas
## 3             v_vmt_join4
## 4                 ct_grid
## 5          demographic_bg
## 6         g_all_stategrid
## 7          g_bos_vmt_grid
## 8            g_ma_towns00
## 9           v_vmt_join_pg
## 10  g_ma2000bg_stateplane
## 11               vmt_250m

# to close close the connection, run the below command (can also exit R)
# dbDisconnect(UAchannel)

Pull data tables from PostgreSQL using R

Next, use this UAchannel to pull each of the four data tables from PostgreSQL into data frames in R:

# Read the 4 tables from Postgres into R
demdata <- dbGetQuery(db, "SELECT * FROM vmtdata.demographic_bg")
bedata <- dbGetQuery(db, "SELECT * FROM vmtdata.built_environment_250m")
vmtdata <- dbGetQuery(db, "SELECT * FROM vmtdata.vmt_250m")
ctdata <- dbGetQuery(db, "SELECT * FROM vmtdata.ct_grid")

Check that Tables were read properly

Use the matrix manipulation commands of R to check that the tables are properly read in and make sense.

# Retrieve the dimensions of 'demdata' object.
dim(demdata)

## [1] 3323   18

# Retrieve the dimensions of 'bedata' object.
dim(bedata)

## [1] 117196     33

# Retrieve the dimensions of 'vmtdata' object.
dim(vmtdata)

## [1] 60895    21

# Retrieve the dimensions of 'ctdata' object.
dim(ctdata)

## [1] 118668      6

# Display the first row of the 'ctdata' data frame.
ctdata[1,]

##   join_count target_fid g250m_id      community_ town_name codas
## 1          1         19       34 Regional Cities  Amesbury     1

# Display descriptive statistics for the 'ctdata' data frame.
summary(ctdata)

##    join_count   target_fid        g250m_id       community_
##  Min.   :1    Min.   :    19   Min.   :    34   Length:118668
##  1st Qu.:1    1st Qu.: 75668   1st Qu.: 75118   Class :character
##  Median :1    Median :174344   Median :173760   Mode  :character
##  Mean   :1    Mean   :166050   Mean   :165844
##  3rd Qu.:1    3rd Qu.:261092   3rd Qu.:261147
##  Max.   :2    Max.   :314601   Max.   :314374
##   town_name             codas
##  Length:118668      Min.   :0.0000
##  Class :character   1st Qu.:0.0000
##  Mode  :character   Median :0.0000
##                     Mean   :0.2595
##                     3rd Qu.:1.0000
##                     Max.   :1.0000

# Display descriptive statistics for the demographic data table.
summary(demdata)

##     bg_id              ct_id             pct_sch13        pct_owner
##  Length:3323        Length:3323        Min.   :0.2689   Min.   :0.0000
##  Class :character   Class :character   1st Qu.:0.8022   1st Qu.:0.3869
##  Mode  :character   Mode  :character   Median :0.8922   Median :0.6458
##                                        Mean   :0.8543   Mean   :0.6125
##                                        3rd Qu.:0.9431   3rd Qu.:0.8759
##                                        Max.   :1.0000   Max.   :1.0000
##                                        NA's   :6        NA's   :8
##    inc_med_hs        pct_pov          pct_white        pct_under5
##  Min.   :     0   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000
##  1st Qu.: 41544   1st Qu.:0.02464   1st Qu.:0.7586   1st Qu.:0.05283
##  Median : 55284   Median :0.05382   Median :0.9027   Median :0.07476
##  Mean   : 59057   Mean   :0.09154   Mean   :0.8171   Mean   :0.07528
##  3rd Qu.: 71745   3rd Qu.:0.11828   3rd Qu.:0.9633   3rd Qu.:0.09645
##  Max.   :200001   Max.   :0.84164   Max.   :1.1642   Max.   :0.32477
##                   NA's   :6         NA's   :4        NA's   :4
##    pct_old65         pct_ov2car        inc_capita      pct_elsc_sch
##  Min.   :0.00000   Min.   :0.00000   Min.   :     0   Min.   :0.0000
##  1st Qu.:0.07864   1st Qu.:0.05466   1st Qu.: 20045   1st Qu.:0.1329
##  Median :0.11993   Median :0.11227   Median : 25548   Median :0.1821
##  Mean   :0.13000   Mean   :0.12512   Mean   : 28052   Mean   :0.1803
##  3rd Qu.:0.16688   3rd Qu.:0.18182   3rd Qu.: 32804   3rd Qu.:0.2277
##  Max.   :0.62932   Max.   :0.52863   Max.   :122938   Max.   :0.6148
##  NA's   :4         NA's   :8                          NA's   :4
##   pct_less3hsh      pct_labor        pct_unempl        f1_wealth
##  Min.   :0.1385   Min.   :0.0000   Min.   :0.00000   Min.   :-4.0844
##  1st Qu.:0.4688   1st Qu.:0.6194   1st Qu.:0.01570   1st Qu.:-0.4908
##  Median :0.5632   Median :0.6844   Median :0.03187   Median : 0.2355
##  Mean   :0.5718   Mean   :0.6705   Mean   :0.04446   Mean   : 0.0000
##  3rd Qu.:0.6610   3rd Qu.:0.7370   3rd Qu.:0.05609   3rd Qu.: 0.6952
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.76060   Max.   : 2.6100
##  NA's   :8        NA's   :4        NA's   :6         NA's   :9
##   f2_children         f3_working
##  Min.   :-3.48684   Min.   :-6.90175
##  1st Qu.:-0.59101   1st Qu.:-0.50150
##  Median : 0.06503   Median : 0.08234
##  Mean   : 0.00000   Mean   : 0.00000
##  3rd Qu.: 0.72343   3rd Qu.: 0.61345
##  Max.   : 3.43180   Max.   : 3.92767
##  NA's   :9          NA's   :9

Exclude certain rows and columns

We want to exclude certain rows and columns from our analysis. We do this both to ensure that the factor analysis works, but also so that the results are more meaningful.

For a concise discussion of Factor analysis see: Wikipedia. For a more extensive discussion of factor analysis, principal component analysis, and the interpretation of results, take a look at Chapter 6 of Sumeeta Srinivasan’s PhD dissertation (DUSP/UIS PhD, 2000), "Linking land use and transportation: measuring the impact of neighborhood-scale spatial patterns on travel behavior," Massachusetts Institute of Technology: Urban Studies and Planning, PhD Dissertation.

We should exclude the first and last few columns from the data frame before doing a factor analysis. (The first two are the block group and census tract IDs and the last three are the first three demographic factors computed in Mi Diao’s thesis.

Create list of columns to remove from consideration and apply it to demdata to create demdata2.

outlist <- c("bg_id", "ct_id", "f1_wealth", "f2_children", "f3_working")
demdata2 <- demdata[,setdiff(names(demdata),outlist)]

Double check the dimensions and a few rows of the result in demdata2 and try a factor analysis (using varimax by default for the rotation of the factors).

fac3 <- factanal(demdata2, factors = 3)

## Error in cov.wt(z): 'x' must contain finite values only

Oops, there are still problems with missing values in the data. Let’s only consider block groups where household income > 0.

demdata3 <- demdata2[demdata2[,'inc_med_hs'] > 0,]
dim(demdata3)

## [1] 3315   13

fac3 <- factanal(demdata2, factors = 3)

## Error in cov.wt(z): 'x' must contain finite values only

Hmm, the right number of rows and columns, but still some problems. Let’s use the ‘complete.cases’ command to retain only those rows that are complete cases:

demdata4 <- demdata2[complete.cases(demdata2),]
dim(demdata4)

## [1] 3314   13

Note: Building demdata4 from demdata3 instead of demdata2 will yield the same result since all the complete cases happen to have inc_med_hs > 0.

#We've eliminated one more row and now the factor analysis works.
fac4 <- factanal(demdata4, factors = 3)
fac4

##
## Call:
## factanal(x = demdata4, factors = 3)
##
## Uniquenesses:
##    pct_sch13    pct_owner   inc_med_hs      pct_pov    pct_white
##        0.246        0.183        0.321        0.289        0.384
##   pct_under5    pct_old65   pct_ov2car   inc_capita pct_elsc_sch
##        0.688        0.005        0.389        0.518        0.375
## pct_less3hsh    pct_labor   pct_unempl
##        0.083        0.412        0.725
##
## Loadings:
##              Factor1 Factor2 Factor3
## pct_sch13     0.815  -0.202   0.220
## pct_owner     0.818   0.380
## inc_med_hs    0.778   0.217   0.163
## pct_pov      -0.842
## pct_white     0.764  -0.170
## pct_under5            0.552
## pct_old65     0.205  -0.324  -0.921
## pct_ov2car    0.623   0.470
## inc_capita    0.649  -0.216   0.116
## pct_elsc_sch -0.145   0.775
## pct_less3hsh -0.109  -0.946
## pct_labor     0.452           0.619
## pct_unempl   -0.518
##
##                Factor1 Factor2 Factor3
## SS loadings      4.593   2.438   1.351
## Proportion Var   0.353   0.188   0.104
## Cumulative Var   0.353   0.541   0.645
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 4112.4 on 42 degrees of freedom.
## The p-value is 0

This output is okay and looks similar to Mi Diao’s results but has one extra variable (pct_ov2car). Create a list with that column heading and use it to toss that column.

outlist <- c("pct_ov2car")
demdata5 <-demdata4[,setdiff(names(demdata4),outlist)]
dim(demdata5)

## [1] 3314   12

Okay, that column is eliminated, now try the factor analysis again

fac5 <- factanal(demdata5, factors = 3) # varimax is the default
fac5

##
## Call:
## factanal(x = demdata5, factors = 3)
##
## Uniquenesses:
##    pct_sch13    pct_owner   inc_med_hs      pct_pov    pct_white
##        0.241        0.193        0.315        0.284        0.395
##   pct_under5    pct_old65   inc_capita pct_elsc_sch pct_less3hsh
##        0.679        0.005        0.510        0.363        0.095
##    pct_labor   pct_unempl
##        0.410        0.724
##
## Loadings:
##              Factor1 Factor2 Factor3
## pct_sch13     0.819  -0.198   0.219
## pct_owner     0.811   0.382
## inc_med_hs    0.781   0.221   0.163
## pct_pov      -0.846
## pct_white     0.756  -0.171
## pct_under5            0.560
## pct_old65     0.206  -0.323  -0.921
## inc_capita    0.659  -0.207   0.115
## pct_elsc_sch -0.144   0.783
## pct_less3hsh -0.105  -0.940  -0.102
## pct_labor     0.454           0.620
## pct_unempl   -0.519
##
##                Factor1 Factor2 Factor3
## SS loadings      4.210   2.225   1.350
## Proportion Var   0.351   0.185   0.113
## Cumulative Var   0.351   0.536   0.649
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 3792.42 on 33 degrees of freedom.
## The p-value is 0

This works and is similar to Prof. Diao’s (except that factors lower than 0.35 are not omitted). We can get even closer to Prof. Diao’s results by using a slightly different procedure: principal components analysis (PCA).

prin5 <- principal(demdata5, nfactors = 3, rotate = "varimax")
prin5 # print results

## Principal Components Analysis
## Call: principal(r = demdata5, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                RC1   RC2   RC3   h2   u2 com
## pct_sch13     0.82 -0.20  0.27 0.78 0.22 1.3
## pct_owner     0.82  0.39 -0.05 0.82 0.18 1.4
## inc_med_hs    0.81  0.22  0.19 0.75 0.25 1.3
## pct_pov      -0.86 -0.03 -0.04 0.75 0.25 1.0
## pct_white     0.80 -0.18 -0.07 0.67 0.33 1.1
## pct_under5   -0.02  0.73  0.06 0.53 0.47 1.0
## pct_old65     0.23 -0.32 -0.86 0.89 0.11 1.4
## inc_capita    0.71 -0.19  0.15 0.56 0.44 1.2
## pct_elsc_sch -0.14  0.87  0.04 0.78 0.22 1.1
## pct_less3hsh -0.10 -0.91 -0.13 0.85 0.15 1.1
## pct_labor     0.43 -0.05  0.79 0.81 0.19 1.5
## pct_unempl   -0.61  0.04  0.13 0.39 0.61 1.1
##
##                        RC1  RC2  RC3
## SS loadings           4.52 2.53 1.54
## Proportion Var        0.38 0.21 0.13
## Cumulative Var        0.38 0.59 0.72
## Proportion Explained  0.53 0.29 0.18
## Cumulative Proportion 0.53 0.82 1.00
##
## Mean item complexity =  1.2
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is  0.07
##  with the empirical chi square  2092.03  with prob <  0
##
## Fit based upon off diagonal values = 0.97

These results match those of Prof Diao quite closely in Table 3 on page 45 (but our results include factors with low weights < 0.35). Try similar factor and principal component analysis for the other data table with built environment measures. Turn in a copy of the output of your Principal Components Analysis of the five BE factors (similar to the above output for the demographic factors).

Visualize data patterns using scattergrams and box plots

A useful reference regarding scatterplots in R can be found here.

Let’s plot scattergrams for pairs of the variables: pct_sch13, pct_owner, and pct_pov.

pairs(~pct_sch13 + pct_owner + pct_pov, data = demdata5, main="Simple Scatterplot Matrix")

Hmm, some of these relationships look somewhat non-linear - an issue since factor analysis is searching for linear combinations of variables that identify clusters of block groups with similar characteristics. We may want to transform some of the variables before doing the factor analysis. But, first, there are so many data points that it is hard to see what is happening in the point clouds.

Let’s use the ‘hexbin’ procedure to ‘bin’ the data into 50 groups along each dimension before we generate the scatterplots. That way, we can use different symbols on the graph depending on whether a ‘bin’ has a few or very many cases that fall within it.

bin <- hexbin(demdata5$pct_sch13, demdata5$pct_owner, xbins=50)
plot(bin, main = "Percent School by Ownership")

We see that homeownership and schooling are high together (a very common case) in the upper right. Where homeownership is not so high (80+%), the relationship is much weaker between the home ownership percentage of a block group and the percent with some 13+ years of education.

Try examining other pairs of variables to get a feel for the data and to being thinking about outliers and sub-markets where the relationships and relevant factors might be quite different. We will get back to some of these considerations in Part 3 when we consider alternative models and explore the possibility that the VMT/built-environment relationship may be different for different community types. Use the hexbin tool to plot the scattergram of the PCT_pov x PCT_owner relationship.? Comment briefly on the extent to which the relationship looks linear and what type of transformation might you suggest for one or another of the variables to make it more linear.

Modeling VMT as a function of BE and Demographics

Next, lets fit the ordinary least squares (OLS) model in Prof. Diao’s dissertation (Tables 5 and 6) that models the average annual mileage per vehicle (VMT) for vehicles registered within a 250x250m grid cell as a function of the 5 built environment (BE) factors and the 3 demographic factors.

In order to fit the models, we need to join the three tables in demdata, bedata, and vmtdata so each row (for a unique grid cell) has that grid cells mileage estimate plus the 5 BE factors and 3 demographic factors. While we are doing this join, we might as well include one additional column - the community type associated with each grid cell. Recall that, in last Thursday’s spatial data management session, we used the TAZ_attr shapefile that had the ‘community type’ designation by the Metropolitan Area Planning Council (MAPC) for every traffic analysis zone. (Actually, the type is designated for each of the 164 municipalities in the metro Boston region.) In our miniclass, we showed how to do a ‘spatial join’ in QGIS to tag each grid cell with the type of community that it was in. The results have been saved into a table called ct_grid in our PostgreSQL database. We have already read this table into an R data frame called ctdata.

We could use R tools to merge the four tables, but complex relational joins in R can be combersome and condusive to errors. Instead, let’s join the four tables in PostgreSQL, bring over to R only the 10 columns of data (plus grid cell and block group IDs) for each grid cell, and doing our modeling with that table. Actually, we will want to bring over a few more columns indicating the number of households estimated to be in each grid cell, the quality of VMT estimate, and the like. (See page 44 of the dissertation regarding which grid cells are included in the modeling.)

For your convenience, we have already built and saved the query in MS-Access that joins the 4 tables to produce a new table, t_vmt_join4, with 60377 rows. (Note that, to be a row in this table, a grid cell had to have estimates of VMT and all 8 factors.) Pull this table into a new data frame in R and use ‘complete.cases’ to omit rows with missing values:

vmt_join <- dbGetQuery(db, "SELECT * FROM vmtdata.v_vmt_join4")
dim(vmt_join)

## [1] 53429    19

vmtall <- vmt_join[complete.cases(vmt_join),]
dim(vmtall)

## [1] 53250    19

vmtall[1,]

##    g250m        bg_id  vmt_vin  f1_nw f2_connect f3_transit  f4_auto
## 1 249360 250277461002 15601.36 -0.565   -0.38752    1.45719 -0.22446
##   f5_walk f1_wealth   f2_kids    f3_work                 ct hshld_2000
## 1  0.3978 0.3406922 0.0225408 0.08234025 Developing Suburbs      6.844
##   vmt_pop vmt_hh no_vin no_good vmt_tot_mv vin_mv
## 1   29026  72946     32      28     825864     70

Note that my SQL query to join the tables also changed the names of a few columns. Now we are ready to run a simple ordinary least squares regression of VMT per vin on the 8 BE and 3 Demographic factors:

run_ols1 <- lm(formula = vmt_vin ~ f1_nw + f2_connect +
                 f3_transit + f4_auto + f5_walk +f1_wealth +f2_kids +
                 f3_work, data=vmtall)
summary(run_ols1)

##
## Call:
## lm(formula = vmt_vin ~ f1_nw + f2_connect + f3_transit + f4_auto +
##     f5_walk + f1_wealth + f2_kids + f3_work, data = vmtall)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -13434  -1438   -164   1227  60885
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13054.22      22.09 590.979   <2e-16 ***
## f1_nw         715.84      14.47  49.464   <2e-16 ***
## f2_connect   -446.43      11.89 -37.561   <2e-16 ***
## f3_transit    856.47      12.04  71.141   <2e-16 ***
## f4_auto       156.68      17.85   8.775   <2e-16 ***
## f5_walk      -126.59      12.84  -9.856   <2e-16 ***
## f1_wealth    -357.94      21.90 -16.345   <2e-16 ***
## f2_kids       220.92      17.52  12.611   <2e-16 ***
## f3_work       214.45      14.00  15.320   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2543 on 53241 degrees of freedom
## Multiple R-squared:  0.2009, Adjusted R-squared:  0.2008
## F-statistic:  1673 on 8 and 53241 DF,  p-value: < 2.2e-16

Hmm, the model coefficients have similar signs and values but much lower R-squared than reported in the dissertation. That is because vmt_vin is the estimated mileage for individual grid cells and there are many grid cells with few vehicles and noisy estimates. Instead of using individual grid cells, Prof Diao average the numbers in the 9-cell window around each grid cell. We can do this by dividing vmt_tot_mv by vmt_mv in the vmt_250m table. The vmt_tot_mv column is the total mileage for the 9-cell window and vmt_mv is the total number of vehicles in the 9-cell region. So let’s regress vmt_tot_mv / vmt_mv on the left hand side (taking care to do the division only for grid cells that had vmt_mv > 0.

run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv)
               ~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk
               + f1_wealth + f2_kids + f3_work, data = vmtall)
summary(run_ols1)

##
## Call:
## lm(formula = (vmt_tot_mv/vin_mv) ~ f1_nw + f2_connect + f3_transit +
##     f4_auto + f5_walk + f1_wealth + f2_kids + f3_work, data = vmtall)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -11701.6   -727.7    -34.7    704.2  10887.1
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12596.789     10.807 1165.63   <2e-16 ***
## f1_nw         698.115      7.080   98.60   <2e-16 ***
## f2_connect   -472.713      5.815  -81.29   <2e-16 ***
## f3_transit    858.899      5.890  145.82   <2e-16 ***
## f4_auto       164.073      8.735   18.78   <2e-16 ***
## f5_walk      -146.136      6.284  -23.26   <2e-16 ***
## f1_wealth    -246.211     10.714  -22.98   <2e-16 ***
## f2_kids       201.393      8.571   23.50   <2e-16 ***
## f3_work       198.404      6.849   28.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244 on 53241 degrees of freedom
## Multiple R-squared:  0.5183, Adjusted R-squared:  0.5182
## F-statistic:  7160 on 8 and 53241 DF,  p-value: < 2.2e-16

Okay, this looks much better, and gives an indication of how much noise there is in the individual grid cells—at least in the suburbs where the density is much lower. We still have only produced the ordinary least squares results and have not adjusted for spatial autocorrelation (so the coefficients are not directly comparable to any of the results shown in the dissertation). Rerun this OLS model separately for those grid cells that are/are-not in the suburbs (as determined by the community type of the town that they are in).? Include the R output for each OLS and discuss briefly the differences that you see in the fits of the two models.

Good references for further work with R

An online ‘Quick-R’ reference: www.statmethods.net
A paperbook (or e-book) that I find to be a useful reference for data manipulation and basic stats:
- R in a Nutshell: A Desktop Quick Reference Joseph Adler, O’Reilly 2nd edition, 2012 (Sept.) Paperbook at O’Reilly. You get a 10 day free trial period Paperbook or Kindle on Amazon.

Part 3: Experiment with alternative indicators and models

Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between QGIS, R and PostgreSQL that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)

Here are some suggestions for exploring the data further:

Use the Community Type category to subdivide the grid cells into those within the four community types. Refit the OLS model for each of the types of community. How different are the results? How might you dig into the data further to sort out what might be happening and what transform or redefinition of variables might improve the model fit?
We can use the regression model at the end of Part 2 to estimate VMT for each grid cell and compute the ‘residuals’ - that is the difference between the actual and predicted VMT values for each grid cell. Then we can import these residuals back into QGIS, join them to the grid cell layer and see whether the residuals seem to vary systematically across metro Boston. We can even do some ‘Lisa’ plots and compute spatial statistics such as Moran’s I to identify hot spots and test how strong is the evidence of spatial autocorrelation. One of the R commands that you may find helpful here is the dbWriteTable() command to write tables back to PostgreSQL. use the online help function in R to learn about the dbWriteTable() command: help(dbWriteTable). Also, here is a webpage with a tutorial on plotting residuals from a regression model.
The pairwise scattergrams that we produced seemed to suggest skewness in some of the distributions and non-linearities in some of the relationships. Select a few variables where this appears problematic and experiment with some transformations and then refit the VMT model. Do the results suggest different effects? Do they suggest ways of identifying sub-markets within which the VMT / BE / Demographic relationships might be different.
From the data, it is pretty clear that annual mileage per vehicle and per household is relatively low in the central city and relatively high in the outer suburbs. The BE factors tend to measure local effects and they do seem to explain some of the within-town variation in VMT that is evident in the suburbs. What other measures might you consider to capture built environment effects that are at a different scale from those which Prof. Diao considered? Take a look at the often-cited paper by Cervero and Kockelman on “Travel Demand and the 3Ds: Density, Diversity, and Design” (Transportation Research, part D, Vol. 2, No. 3, pp. 199-219, 1997). A copy, Cervero_1997.pdf, has been placed in the ‘readings’ folder of the 11.523 class locker for your convenience.

For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next Tuesday session on Nov. 6. Prepare a few powerpoint or PDF slides and be prepared to spend 5-10 minutes explaining and discussing your preliminary results.

What to turn in for Lab #1 (by Tuesday, Oct. 31)

Prior to the start of class on Oct. 31, turn in on the class Stellar website the three items that were requested in Italics during the exercise:

The Principal Components Analysis of BE variables.
The hexbin scatter plot of PCT_pov and PCT_owner (along with your brief discussion).
The two OLS output results of the VMT model when you separate suburban from other grid cells (along with your brief discussion).

Also, submit to Stellar your brief project proposal before the end of the lab session on Tuesday, Oct. 31.

Created by Joe Ferreira (2012); Modified by Joe Ferreira, Eric Huntley, and Roberto Ponce Lopez (2017) Last modified: October 26, 2017 [jf]

Back to DUSP | CRON | MIT | MiniClass Homepage