There are three purposes to this Lab Exercise:
Setup your computer to facilitate data exchange among QGIS / Database / Statistics tools
Reproduce some of the indicators and models in Prof. Mi Diao’s PhD dissertation on VMT and the built environment
Experiment with alternative indicators and models of the VMT / built environment relationship
During today’s lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Tuesday session (on Oct. 24), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few Powerpoint or PDF slides and plan a 5 minute talk explaining the ‘Part 3’ exploration. This presentation should be a ‘work in progress’ that explains ideas or hypotheses of interest, methods and results to date, and any questions about how to interpret the results and continue the exploration. We will have a ‘homework’ location on the miniclass Stellar site to upload your slides in advance of the presentation.
Class datasets: All the data that you will need for the miniclass is available in the class locker (http://mit.edu/11.s946/www/ClassData/data.zip). This locker contains:
./data/csv_files: Folder with plain text of four
tables in CSV (comma-separated values) format../data/msaccess/vmt_data.mdb: MS-Access database
containing four tables plus one query../data/shapefiles: folder with five Shapefiles../data_dictionary.xls: Spreadsheet with data
dictionary for tables and shapefiles../11s946_qgs_startup.qgs: QGIS project file to get
you started.This ClassData folder contains 238 MB but several tables are
duplicated in different formats and, in fact, you will only need the
data dictionary and the QGIS startup project since all the required
tables and geometry files are accessible from a PostgreSQL/PostGIS
database server. The CSV files and MS-Access database are the same
tables that we will access from Postgres but may be helpful if you are
familiar with MS-Access or, say, with reading CSV files into MS-Excel.
You may also need the shapefiles folder if you plan to
use ArcGIS instead of QGIS for geoprocessing.
At this point, we suggest that you download and open only the data\_dictionary.xls
spreadsheet so you can browse through the dictionary for the four
non-spatial data tables. They contain some Year-2000 block group
census data for Massachusetts plus the data from Prof. Mi Diao’s PhD
dissertation organized by the 250x250m grid cells for Massachusetts.
To examine the tables, we suggest that you access the Postgres tables
from a browser here: http://cronpgsql.mit.edu/phppgadmin.
Log in using your personal Postgres account for this database server
as explained in class and emails. Remember that all the class datasets
are saved in 9 tables within the vmtdata schema of the class946
database on our PostgreSQL database server cronpgsql.mit.edu.
The five tables beginning with g\_ are spatial boundary
files that we will not use today. Review the data dictionary and
browse the tables to get a feel for what data are provided for each
grid cell and census block group in the metro Boston area.
Can you find the grid cell that has the highest road density value?
See the rdden variable in the built\_environment\_250m
table. This variable measures the total length of all major road
segments that fall within the grid cell. Does the distribution of
values make sense? Can you find the grid cell in table vmt\_250m
that is associated the largest number of registered private passenger
vehicles (no\_vin)? What about the highest estimated
annual mileage per vehicle (vmt\_vin)? Note that every
grid cell in the vmt\_250m table has at least one
registered vehicle (no\_vin > 0) since the grid cells
with no vehicles are omitted - but they are included in the geometry
tables that we will map later on. Note that one grid cell (g250m\_id = 88461)
has 10602 gecoded vehicles! This is certainly an outlier (perhaps the
location of a frequently used post office box) and many times higher
than the next most populated grid cell. We will want to omit it from
VMT modeling later on. Use simple SQL queries in the SQL pane of
PHPPGADMIN to explore the tables. For example, try this:
select no_vin, count(*)
from vmtdata.vmt_250m
group by no_vin
order by no_vin;
Once you are familiar with the data dictionary and browser-based postgres access to the four tables with grid cell and block group data, you are ready to beginning using the open source statistical software, R, to examine and interpret the data in more detail.
Running R: The lab machines run the 64-bit version of Windows (so that the operating system can address arrays and memory spaces bigger than 2^32 = 4 GB). The R package installed on CRON machines runs the ‘64-bit’ version of R. That is good and forward thinking since this version can handle datasets larger than a few gigabytes.
This is the list of software that you will need to complete the labs: (Click here for tips on installation.)
ArcGIS and MS-Access are optional software, on which you can rely if you feel more comfortable working with them, instead of with QGIS and PgAdmin. However, we encourage students to rely on the open source tools listed above and to use this lab as an exercise to become more familiar with such tools.
In today’s lab, will will focus only on R and data exploration using PHPPGADMIN. Let’s start, either on a lab machine, your personal laptop using windows or Mac natively, or with the VM setup provided by CRON:
Open RStudio and create a new R script (via File > New File > RScript) in order to launch a convenient script editing pane in the upper right of your RStudio window. You can construct your command lines in this script window and run them in ‘console’ pane that is now in the lower right of your RStudio window. Results and other messages and options will appear in the right-side panes.
Before accessing the class datasets, we need to install five optional library packages that allow us to connect to the PostgreSQL database server, and run several other data processing steps that we will find useful. In order to load and use these packages, we need to do two things:
Installing R packages is tricky on CRON machines because you do not
have administrative privileges. Hence, you do not have permission to
install extra library packages in the default location that R
expects. If you try to install packages without permission to write
files to the default location, R will complain and put them instead
in a ‘personal library’ that it creates in your user workspace.
(E.g. on Windows, this will be in: C:\\Users\\jf\\AppData\\Local\\Temp\\Rtmpo3eAeW\\downloaded_packages
which appears to be on Drive C: but is actually saved in your H:
drive on the network so that it travels with you when you log onto
any CRON machine.
The five libraries that we need are: (a) DBI and RPostgreSQL
to allow R to connect to PostgreSQL databases, (b) hexbin
to support improved visualization of scattergrams, and (c) psych
and GPArotation to support additional principal
component analysis.
install.packages("DBI")
install.packages("RPostgreSQL")
install.packages("hexbin")
install.packages("psych")
install.packages("GPArotation")
You could do this installation by using the RStudio Packages tab (in the lower right pane), but we have chosen to use the command line approach.
When you enter the first command, you may see a popup window asking
you to specify which CRAN mirror site to use to get the package.
Choose a US site such as USA(IN) for Indiana University. Each of
these commands will locate and copy a zipped file of binaries from
the CRAN mirror site into a local workspace where they are unzipped
and verified to make them ready for use. If you wish, you can
install the libraries into a different location on your personal
computer. Suppose, for example, you use Windows and wish to put them
into ‘C:\temp\Rlib’; in this case, the first install command would
be: install.packages("DBI", lib=" C:/temp/Rlib") and
you will have to create the folder in advance of executing the
command or you will get an error message. If you are working on your
personal machine, you will not need to create C:\\temp\\Rlib
unless you lack administrative privileges, but you may choose to do
so anyway if you want to avoid disturbing the default R installation
directory.
You do not need to care where the binaries are installed, except to realize that, for CRON machines, it is on your networked H drive which could slow the download. However, you need to do this installation only once and the installed files are not that large, so it is worth saving them once on your H: drive and having them available from any CRON machine. If you have trouble downloading from a CRAN mirror site, use the Packages/set-CRAN-mirror… tab to choose another site.
Packages only need to be installed once. However, every time you start an R or RStudio session, you need to load these optional libraries so they can be used during your session. Here are the appropriate commands:
library("DBI")
library("RPostgreSQL")
library("hexbin")
library("psych")
library("GPArotation")
Note: If you installed the libraries into C:/temp/Rlib you may need
to specify that location as in: library("DBI",lib.loc="C:/temp/Rlib").
Note also that some commands use forward slash ‘/’ and others use
double backward slashes ‘\\’ to move up and down among directories
(or folders). Either option gets converted to the same thing
internally in R - namely, into the single forward slash ‘/’
convention for unix paths. The double ‘\\’ is needed since unix uses
the single ‘\’ as a escape character.
In this part of the exercise, you can cut-and-paste the R commands into the R-Console window.
Copy the data (S:\\11.s946\\ClassData) to a local
drive use Windows Explorer or Finder to copy the entire MiniClass
data locker (238 MB). Since we can leave the big tables on the
postgres server and access them as needed from R, QGIS, or ArcGIS,
there is no need to copy and save the tables locally but they are
available if you want to examine them while offline.
Open a data channel to PostgreSQL using the username and password that you received from the class instructor. Be sure to use double quotes for the parameters values without any spaces between the quotes and the values.
# Connect to Postgres database
db <- dbConnect("PostgreSQL", dbname = "class946",
host = "cronpgsql.mit.edu", port = 5432,
user = name, password = pw)
# List the tables in the "vmtdata" schema:
dbGetQuery(db, "SELECT table_name FROM information_schema.tables WHERE table_schema='vmtdata'")
## table_name
## 1 built_environment_250m
## 2 g_taz2727_codas
## 3 v_vmt_join4
## 4 ct_grid
## 5 demographic_bg
## 6 g_all_stategrid
## 7 g_bos_vmt_grid
## 8 g_ma_towns00
## 9 v_vmt_join_pg
## 10 g_ma2000bg_stateplane
## 11 vmt_250m
# to close close the connection, run the below command (can also exit R)
# dbDisconnect(UAchannel)
Next, use this UAchannel to pull each of the four data tables from PostgreSQL into data frames in R:
# Read the 4 tables from Postgres into R
demdata <- dbGetQuery(db, "SELECT * FROM vmtdata.demographic_bg")
bedata <- dbGetQuery(db, "SELECT * FROM vmtdata.built_environment_250m")
vmtdata <- dbGetQuery(db, "SELECT * FROM vmtdata.vmt_250m")
ctdata <- dbGetQuery(db, "SELECT * FROM vmtdata.ct_grid")
Use the matrix manipulation commands of R to check that the tables are properly read in and make sense.
# Retrieve the dimensions of 'demdata' object.
dim(demdata)
## [1] 3323 18
# Retrieve the dimensions of 'bedata' object.
dim(bedata)
## [1] 117196 33
# Retrieve the dimensions of 'vmtdata' object.
dim(vmtdata)
## [1] 60895 21
# Retrieve the dimensions of 'ctdata' object.
dim(ctdata)
## [1] 118668 6
# Display the first row of the 'ctdata' data frame.
ctdata[1,]
## join_count target_fid g250m_id community_ town_name codas
## 1 1 19 34 Regional Cities Amesbury 1
# Display descriptive statistics for the 'ctdata' data frame.
summary(ctdata)
## join_count target_fid g250m_id community_
## Min. :1 Min. : 19 Min. : 34 Length:118668
## 1st Qu.:1 1st Qu.: 75668 1st Qu.: 75118 Class :character
## Median :1 Median :174344 Median :173760 Mode :character
## Mean :1 Mean :166050 Mean :165844
## 3rd Qu.:1 3rd Qu.:261092 3rd Qu.:261147
## Max. :2 Max. :314601 Max. :314374
## town_name codas
## Length:118668 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2595
## 3rd Qu.:1.0000
## Max. :1.0000
# Display descriptive statistics for the demographic data table.
summary(demdata)
## bg_id ct_id pct_sch13 pct_owner
## Length:3323 Length:3323 Min. :0.2689 Min. :0.0000
## Class :character Class :character 1st Qu.:0.8022 1st Qu.:0.3869
## Mode :character Mode :character Median :0.8922 Median :0.6458
## Mean :0.8543 Mean :0.6125
## 3rd Qu.:0.9431 3rd Qu.:0.8759
## Max. :1.0000 Max. :1.0000
## NA's :6 NA's :8
## inc_med_hs pct_pov pct_white pct_under5
## Min. : 0 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 41544 1st Qu.:0.02464 1st Qu.:0.7586 1st Qu.:0.05283
## Median : 55284 Median :0.05382 Median :0.9027 Median :0.07476
## Mean : 59057 Mean :0.09154 Mean :0.8171 Mean :0.07528
## 3rd Qu.: 71745 3rd Qu.:0.11828 3rd Qu.:0.9633 3rd Qu.:0.09645
## Max. :200001 Max. :0.84164 Max. :1.1642 Max. :0.32477
## NA's :6 NA's :4 NA's :4
## pct_old65 pct_ov2car inc_capita pct_elsc_sch
## Min. :0.00000 Min. :0.00000 Min. : 0 Min. :0.0000
## 1st Qu.:0.07864 1st Qu.:0.05466 1st Qu.: 20045 1st Qu.:0.1329
## Median :0.11993 Median :0.11227 Median : 25548 Median :0.1821
## Mean :0.13000 Mean :0.12512 Mean : 28052 Mean :0.1803
## 3rd Qu.:0.16688 3rd Qu.:0.18182 3rd Qu.: 32804 3rd Qu.:0.2277
## Max. :0.62932 Max. :0.52863 Max. :122938 Max. :0.6148
## NA's :4 NA's :8 NA's :4
## pct_less3hsh pct_labor pct_unempl f1_wealth
## Min. :0.1385 Min. :0.0000 Min. :0.00000 Min. :-4.0844
## 1st Qu.:0.4688 1st Qu.:0.6194 1st Qu.:0.01570 1st Qu.:-0.4908
## Median :0.5632 Median :0.6844 Median :0.03187 Median : 0.2355
## Mean :0.5718 Mean :0.6705 Mean :0.04446 Mean : 0.0000
## 3rd Qu.:0.6610 3rd Qu.:0.7370 3rd Qu.:0.05609 3rd Qu.: 0.6952
## Max. :1.0000 Max. :1.0000 Max. :0.76060 Max. : 2.6100
## NA's :8 NA's :4 NA's :6 NA's :9
## f2_children f3_working
## Min. :-3.48684 Min. :-6.90175
## 1st Qu.:-0.59101 1st Qu.:-0.50150
## Median : 0.06503 Median : 0.08234
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.72343 3rd Qu.: 0.61345
## Max. : 3.43180 Max. : 3.92767
## NA's :9 NA's :9
We want to exclude certain rows and columns from our analysis. We do this both to ensure that the factor analysis works, but also so that the results are more meaningful.
For a concise discussion of Factor analysis see: Wikipedia. For a more extensive discussion of factor analysis, principal component analysis, and the interpretation of results, take a look at Chapter 6 of Sumeeta Srinivasan’s PhD dissertation (DUSP/UIS PhD, 2000), "Linking land use and transportation: measuring the impact of neighborhood-scale spatial patterns on travel behavior," Massachusetts Institute of Technology: Urban Studies and Planning, PhD Dissertation.
We should exclude the first and last few columns from the data frame before doing a factor analysis. (The first two are the block group and census tract IDs and the last three are the first three demographic factors computed in Mi Diao’s thesis.
Create list of columns to remove from consideration and apply it to
demdata to create demdata2.
outlist <- c("bg_id", "ct_id", "f1_wealth", "f2_children", "f3_working")
demdata2 <- demdata[,setdiff(names(demdata),outlist)]
Double check the dimensions and a few rows of the result in demdata2
and try a factor analysis (using varimax by default for the rotation
of the factors).
fac3 <- factanal(demdata2, factors = 3)
## Error in cov.wt(z): 'x' must contain finite values only
Oops, there are still problems with missing values in the data. Let’s only consider block groups where household income > 0.
demdata3 <- demdata2[demdata2[,'inc_med_hs'] > 0,]
dim(demdata3)
## [1] 3315 13
fac3 <- factanal(demdata2, factors = 3)
## Error in cov.wt(z): 'x' must contain finite values only
Hmm, the right number of rows and columns, but still some problems. Let’s use the ‘complete.cases’ command to retain only those rows that are complete cases:
demdata4 <- demdata2[complete.cases(demdata2),]
dim(demdata4)
## [1] 3314 13
Note: Building demdata4 from demdata3
instead of demdata2 will yield the same result since
all the complete cases happen to have inc_med_hs > 0.
#We've eliminated one more row and now the factor analysis works.
fac4 <- factanal(demdata4, factors = 3)
fac4
##
## Call:
## factanal(x = demdata4, factors = 3)
##
## Uniquenesses:
## pct_sch13 pct_owner inc_med_hs pct_pov pct_white
## 0.246 0.183 0.321 0.289 0.384
## pct_under5 pct_old65 pct_ov2car inc_capita pct_elsc_sch
## 0.688 0.005 0.389 0.518 0.375
## pct_less3hsh pct_labor pct_unempl
## 0.083 0.412 0.725
##
## Loadings:
## Factor1 Factor2 Factor3
## pct_sch13 0.815 -0.202 0.220
## pct_owner 0.818 0.380
## inc_med_hs 0.778 0.217 0.163
## pct_pov -0.842
## pct_white 0.764 -0.170
## pct_under5 0.552
## pct_old65 0.205 -0.324 -0.921
## pct_ov2car 0.623 0.470
## inc_capita 0.649 -0.216 0.116
## pct_elsc_sch -0.145 0.775
## pct_less3hsh -0.109 -0.946
## pct_labor 0.452 0.619
## pct_unempl -0.518
##
## Factor1 Factor2 Factor3
## SS loadings 4.593 2.438 1.351
## Proportion Var 0.353 0.188 0.104
## Cumulative Var 0.353 0.541 0.645
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 4112.4 on 42 degrees of freedom.
## The p-value is 0
This output is okay and looks similar to Mi Diao’s results but has
one extra variable (pct_ov2car). Create a list with
that column heading and use it to toss that column.
outlist <- c("pct_ov2car")
demdata5 <-demdata4[,setdiff(names(demdata4),outlist)]
dim(demdata5)
## [1] 3314 12
Okay, that column is eliminated, now try the factor analysis again
fac5 <- factanal(demdata5, factors = 3) # varimax is the default
fac5
##
## Call:
## factanal(x = demdata5, factors = 3)
##
## Uniquenesses:
## pct_sch13 pct_owner inc_med_hs pct_pov pct_white
## 0.241 0.193 0.315 0.284 0.395
## pct_under5 pct_old65 inc_capita pct_elsc_sch pct_less3hsh
## 0.679 0.005 0.510 0.363 0.095
## pct_labor pct_unempl
## 0.410 0.724
##
## Loadings:
## Factor1 Factor2 Factor3
## pct_sch13 0.819 -0.198 0.219
## pct_owner 0.811 0.382
## inc_med_hs 0.781 0.221 0.163
## pct_pov -0.846
## pct_white 0.756 -0.171
## pct_under5 0.560
## pct_old65 0.206 -0.323 -0.921
## inc_capita 0.659 -0.207 0.115
## pct_elsc_sch -0.144 0.783
## pct_less3hsh -0.105 -0.940 -0.102
## pct_labor 0.454 0.620
## pct_unempl -0.519
##
## Factor1 Factor2 Factor3
## SS loadings 4.210 2.225 1.350
## Proportion Var 0.351 0.185 0.113
## Cumulative Var 0.351 0.536 0.649
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 3792.42 on 33 degrees of freedom.
## The p-value is 0
This works and is similar to Prof. Diao’s (except that factors lower than 0.35 are not omitted). We can get even closer to Prof. Diao’s results by using a slightly different procedure: principal components analysis (PCA).
prin5 <- principal(demdata5, nfactors = 3, rotate = "varimax")
prin5 # print results
## Principal Components Analysis
## Call: principal(r = demdata5, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 h2 u2 com
## pct_sch13 0.82 -0.20 0.27 0.78 0.22 1.3
## pct_owner 0.82 0.39 -0.05 0.82 0.18 1.4
## inc_med_hs 0.81 0.22 0.19 0.75 0.25 1.3
## pct_pov -0.86 -0.03 -0.04 0.75 0.25 1.0
## pct_white 0.80 -0.18 -0.07 0.67 0.33 1.1
## pct_under5 -0.02 0.73 0.06 0.53 0.47 1.0
## pct_old65 0.23 -0.32 -0.86 0.89 0.11 1.4
## inc_capita 0.71 -0.19 0.15 0.56 0.44 1.2
## pct_elsc_sch -0.14 0.87 0.04 0.78 0.22 1.1
## pct_less3hsh -0.10 -0.91 -0.13 0.85 0.15 1.1
## pct_labor 0.43 -0.05 0.79 0.81 0.19 1.5
## pct_unempl -0.61 0.04 0.13 0.39 0.61 1.1
##
## RC1 RC2 RC3
## SS loadings 4.52 2.53 1.54
## Proportion Var 0.38 0.21 0.13
## Cumulative Var 0.38 0.59 0.72
## Proportion Explained 0.53 0.29 0.18
## Cumulative Proportion 0.53 0.82 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 2092.03 with prob < 0
##
## Fit based upon off diagonal values = 0.97
These results match those of Prof Diao quite closely in Table 3 on page 45 (but our results include factors with low weights < 0.35). Try similar factor and principal component analysis for the other data table with built environment measures. Turn in a copy of the output of your Principal Components Analysis of the five BE factors (similar to the above output for the demographic factors).
A useful reference regarding scatterplots in R can be found here.
Let’s plot scattergrams for pairs of the variables: pct_sch13,
pct_owner, and pct_pov.
pairs(~pct_sch13 + pct_owner + pct_pov, data = demdata5, main="Simple Scatterplot Matrix")

Hmm, some of these relationships look somewhat non-linear - an issue since factor analysis is searching for linear combinations of variables that identify clusters of block groups with similar characteristics. We may want to transform some of the variables before doing the factor analysis. But, first, there are so many data points that it is hard to see what is happening in the point clouds.
Let’s use the ‘hexbin’ procedure to ‘bin’ the data into 50 groups along each dimension before we generate the scatterplots. That way, we can use different symbols on the graph depending on whether a ‘bin’ has a few or very many cases that fall within it.
bin <- hexbin(demdata5$pct_sch13, demdata5$pct_owner, xbins=50)
plot(bin, main = "Percent School by Ownership")

We see that homeownership and schooling are high together (a very common case) in the upper right. Where homeownership is not so high (80+%), the relationship is much weaker between the home ownership percentage of a block group and the percent with some 13+ years of education.
Try examining other pairs of variables to get a feel for the data and to being thinking about outliers and sub-markets where the relationships and relevant factors might be quite different. We will get back to some of these considerations in Part 3 when we consider alternative models and explore the possibility that the VMT/built-environment relationship may be different for different community types. Use the hexbin tool to plot the scattergram of the PCT_pov x PCT_owner relationship.? Comment briefly on the extent to which the relationship looks linear and what type of transformation might you suggest for one or another of the variables to make it more linear.
Next, lets fit the ordinary least squares (OLS) model in Prof. Diao’s dissertation (Tables 5 and 6) that models the average annual mileage per vehicle (VMT) for vehicles registered within a 250x250m grid cell as a function of the 5 built environment (BE) factors and the 3 demographic factors.
In order to fit the models, we need to join the three tables in
demdata, bedata, and vmtdata so each row (for a unique grid cell)
has that grid cells mileage estimate plus the 5 BE factors and 3
demographic factors. While we are doing this join, we might as well
include one additional column - the community type associated with
each grid cell. Recall that, in last Thursday’s spatial data
management session, we used the TAZ_attr shapefile
that had the ‘community type’ designation by the Metropolitan Area
Planning Council (MAPC) for every traffic analysis zone. (Actually,
the type is designated for each of the 164 municipalities in the
metro Boston region.) In our miniclass, we showed how to do a
‘spatial join’ in QGIS to tag each grid cell with the type of
community that it was in. The results have been saved into a table
called ct_grid in our PostgreSQL database. We have
already read this table into an R data frame called ctdata.
We could use R tools to merge the four tables, but complex relational joins in R can be combersome and condusive to errors. Instead, let’s join the four tables in PostgreSQL, bring over to R only the 10 columns of data (plus grid cell and block group IDs) for each grid cell, and doing our modeling with that table. Actually, we will want to bring over a few more columns indicating the number of households estimated to be in each grid cell, the quality of VMT estimate, and the like. (See page 44 of the dissertation regarding which grid cells are included in the modeling.)
For your convenience, we have already built and saved the query in MS-Access that joins the 4 tables to produce a new table, t_vmt_join4, with 60377 rows. (Note that, to be a row in this table, a grid cell had to have estimates of VMT and all 8 factors.) Pull this table into a new data frame in R and use ‘complete.cases’ to omit rows with missing values:
vmt_join <- dbGetQuery(db, "SELECT * FROM vmtdata.v_vmt_join4")
dim(vmt_join)
## [1] 53429 19
vmtall <- vmt_join[complete.cases(vmt_join),]
dim(vmtall)
## [1] 53250 19
vmtall[1,]
## g250m bg_id vmt_vin f1_nw f2_connect f3_transit f4_auto
## 1 249360 250277461002 15601.36 -0.565 -0.38752 1.45719 -0.22446
## f5_walk f1_wealth f2_kids f3_work ct hshld_2000
## 1 0.3978 0.3406922 0.0225408 0.08234025 Developing Suburbs 6.844
## vmt_pop vmt_hh no_vin no_good vmt_tot_mv vin_mv
## 1 29026 72946 32 28 825864 70
Note that my SQL query to join the tables also changed the names of a few columns. Now we are ready to run a simple ordinary least squares regression of VMT per vin on the 8 BE and 3 Demographic factors:
run_ols1 <- lm(formula = vmt_vin ~ f1_nw + f2_connect +
f3_transit + f4_auto + f5_walk +f1_wealth +f2_kids +
f3_work, data=vmtall)
summary(run_ols1)
##
## Call:
## lm(formula = vmt_vin ~ f1_nw + f2_connect + f3_transit + f4_auto +
## f5_walk + f1_wealth + f2_kids + f3_work, data = vmtall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13434 -1438 -164 1227 60885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13054.22 22.09 590.979 <2e-16 ***
## f1_nw 715.84 14.47 49.464 <2e-16 ***
## f2_connect -446.43 11.89 -37.561 <2e-16 ***
## f3_transit 856.47 12.04 71.141 <2e-16 ***
## f4_auto 156.68 17.85 8.775 <2e-16 ***
## f5_walk -126.59 12.84 -9.856 <2e-16 ***
## f1_wealth -357.94 21.90 -16.345 <2e-16 ***
## f2_kids 220.92 17.52 12.611 <2e-16 ***
## f3_work 214.45 14.00 15.320 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2543 on 53241 degrees of freedom
## Multiple R-squared: 0.2009, Adjusted R-squared: 0.2008
## F-statistic: 1673 on 8 and 53241 DF, p-value: < 2.2e-16
Hmm, the model coefficients have similar signs and values but much
lower R-squared than reported in the dissertation. That is because
vmt_vin is the estimated mileage for individual grid cells and there
are many grid cells with few vehicles and noisy estimates. Instead
of using individual grid cells, Prof Diao average the numbers in the
9-cell window around each grid cell. We can do this by dividing vmt_tot_mv
by vmt_mv in the vmt_250m table. The vmt_tot_mv
column is the total mileage for the 9-cell window and vmt_mv
is the total number of vehicles in the 9-cell region. So let’s
regress vmt_tot_mv / vmt_mv on the left hand side
(taking care to do the division only for grid cells that had vmt_mv > 0.
run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv)
~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk
+ f1_wealth + f2_kids + f3_work, data = vmtall)
summary(run_ols1)
##
## Call:
## lm(formula = (vmt_tot_mv/vin_mv) ~ f1_nw + f2_connect + f3_transit +
## f4_auto + f5_walk + f1_wealth + f2_kids + f3_work, data = vmtall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11701.6 -727.7 -34.7 704.2 10887.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12596.789 10.807 1165.63 <2e-16 ***
## f1_nw 698.115 7.080 98.60 <2e-16 ***
## f2_connect -472.713 5.815 -81.29 <2e-16 ***
## f3_transit 858.899 5.890 145.82 <2e-16 ***
## f4_auto 164.073 8.735 18.78 <2e-16 ***
## f5_walk -146.136 6.284 -23.26 <2e-16 ***
## f1_wealth -246.211 10.714 -22.98 <2e-16 ***
## f2_kids 201.393 8.571 23.50 <2e-16 ***
## f3_work 198.404 6.849 28.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1244 on 53241 degrees of freedom
## Multiple R-squared: 0.5183, Adjusted R-squared: 0.5182
## F-statistic: 7160 on 8 and 53241 DF, p-value: < 2.2e-16
Okay, this looks much better, and gives an indication of how much noise there is in the individual grid cells—at least in the suburbs where the density is much lower. We still have only produced the ordinary least squares results and have not adjusted for spatial autocorrelation (so the coefficients are not directly comparable to any of the results shown in the dissertation). Rerun this OLS model separately for those grid cells that are/are-not in the suburbs (as determined by the community type of the town that they are in).? Include the R output for each OLS and discuss briefly the differences that you see in the fits of the two models.
A paperbook (or e-book) that I find to be a useful reference for data manipulation and basic stats:
Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between QGIS, R and PostgreSQL that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)
Here are some suggestions for exploring the data further:
dbWriteTable() command to write tables
back to PostgreSQL. use the online help function in R to learn about
the dbWriteTable() command: help(dbWriteTable).
Also, here
is a webpage with a tutorial on plotting residuals from a regression
model.For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next Tuesday session on Nov. 6. Prepare a few powerpoint or PDF slides and be prepared to spend 5-10 minutes explaining and discussing your preliminary results.
Created by Joe Ferreira (2012); Modified by Joe Ferreira, Eric Huntley, and Roberto Ponce Lopez (2017) Last modified: October 26, 2017 [jf]
Back to DUSP | CRON | MIT | MiniClass Homepage