There are three purposes to this Lab Exercise:
During today's lab, you should focus on Parts 1 and 2 to make sure that you know how to set up the computing environment and are sufficiently familiar with the data and tools to be able to work on your own during the next two weeks. Prior to the next Tuesday session (on Nov. 6), you should finish Part 2 and try at least some of Part 3. Those taking the miniclass for credit should prepare a few powerpoint or PDF slides and plan a 5 minute talk explaining the 'Part 3' exploration. This presentation should be a 'work in progress' that explains ideas or hypotheses of interest, methods and results to date, and any questions about how to interpret the results and continue the exploration. We will have a 'homework' location on the miniclass Stellar site to upload your slides in advance of the presentation.
The R package installed on lab machines in 37-312 and CRON runs the '64-bit' version of R. That is good and forward thinking since this version can handle datasets larger than a few gigabytes. However, this version does not work with the 'open database connect' (ODBC) drivers that are currently installed on the lab machines to facilitate data exchange between R and MS-Access or PostgreSQL. We can use plain text files to exchange data between R and MS-Access, but that can be less flexible and involve other complications. Since planners often need to acquire and integrate data across platforms and packages, it will be instructive to see some of the setup issues involved in downloading and installing 'open source' software on machines - including machines where you do not have administrative privileges.
The lab machines run the 64-bit version of Windows (so that the operating system can address arrays and memory spaces bigger than 2^32 = 4 GB). However, the MS-Access software on the machines is a 32-bit application. It is possible to run 64-bit R and exchange data with 32-bit MS-Access but not with the ODBC drivers that are currently setup on the lab machines. The easiest workaround is to download and install a 32-bit version of R. In order to do this, we cannot use the default installation settings since we do not have administrative privileges on the machines and cannot write to the folders where software is normally installed.
(1) Create your own workspace
In order to install R, we need to specify a few folders to which you have read/write access. Use Windows Explorer to create: c:\temp\rinstall and c:\temp\rlib32. The first sub-directory is for the software installation and the second is for additional software libraries that you will need. Alternatively, you can create the two folders by using Start/Run and then entering this command to create the directory:
mkdir c:\temp\rinstall
mkdir c:\temp\rlib32(2) Download the latest version of R for Windows-7 machines from: http://www.r-project.org/ For you convenience, we have saved the current Windows 32/64-bit version (R-2.15.1-win.exe) in the class locker. You can find it here: http://mit.edu/11.523/R-project/R-2.15.1-win.exe
- Choose any of the CRAN mirror sites in the US to download the software. For example,
- http://streaming.stat.iastate.edu/CRAN/ at Iowa State University
- Click 'Download R for Windows', choose the 'base' option, and the click on 'download R 2.15.1 for Windows' to get the 47 MB executable that contains the binaries for both the 32 and 64-bit versions of R.
(3) Install 32-bit R by running the file you just downloaded (R-2.15.1-win.exe) using the defaults except for these cases:
- Change the installation directory to be: C:\temp\rinstall
- Install only the 'core' and the '32-bit' version. (The 64-bit version is already installed on the machine and you do not need another copy.)
- Choose 'yes' to do a 'customized startup' rather than the default installation option
- Check "don't create a start menu folder" since you 32-bit R installation will only work temporarily on this one machine
- Uncheck the option to save the version number in the Registry (since you do not have permission to do this)
(4) Run 32-bit R and load some additional software packages into your personal library
- Locate and run R by either finding the executable in Windows Explorer or entering this command in the Start/Run Window:
- C:\temp\rinstall\bin\i386\Rgui.exe
- A large Window labeled 'RGui (32-bit)' will open with a smaller 'R-Console' window where you can enter commands following the '>' prompt.
(5) Run the following commands to install several optional library packages: (a) 'RODBC' to provide ODBC connectivity with database managers, (b) 'hexbin' to support improved visualization of scattergrams, and (c) 'psych' and 'GPArotation' to support additional principal component analysis. You could do this from the m Packages tab, but we need to specify a particular personal library so it is better to use the command line approach.
- install.packages("RODBC", lib="C:/temp/rlib32")
- install.packages("hexbin", lib="C:/temp/rlib32")
- install.packages("psych", lib="C:/temp/rlib32")
- install.packages("GPArotation", lib="C:/temp/rlib32")
When you enter the first command, you will see a popup window asking you to specify which CRAN mirror site to use to get the package. Choose a US site such as USA(IN) for Indiana University. Each of these commands will locate and copy a zipped file of binaries and from the CRAN mirror site into a local workspace - the default workspace on the lab machines is on your 'H' drive in:
C:\Users\jf\AppData\Local\Temp\RtmpYH6bl2\downloaded_packages
You do not need to care where the binaries are installed, except to realize that it is on your networked H drive which could slow the download. If you have trouble downloading from a CRAN mirror site, use the Packages/set-CRAN-mirror... tab to choose another site. Once all the packages are downloaded to your machine they need to be unzipped and installed as a usable library. We want these libraries to be local and in one of the writeable directories that we created on C:\. Here are the appropriate commands:
library("RODBC",lib.loc="C:\\TEMP\\Rlib32")
library("hexbin", lib="C:/temp/rlib32")
library("psych",lib.loc="C:\\TEMP\\Rlib32")
library("GPArotation",lib.loc="C:\\TEMP\\Rlib32")Note that some commands use forward slash '/' and others use double backward slashes '\\'. Either option gets converted to the same thing internally in R - namely, into the single forward slash '/' convention for unix paths. The double '\\' is needed since unix uses the single '\' as a escape character.
This part of the exercise repeats the example provided last week but uses the newly installed 32-bit R and adds some additional options and steps. You can cut-and-paste the R commands into the R-Console window.
(1) Copy the data to a local driver: Use Windows Explorer to copy the entire MiniClass data locker (180 MB)
- From: Z:\course\11\11.523\data_ua
- To: C:\TEMP
(2) Use the RODBC package to open a data channel to MS-Access via the built-in ODBC 'machine data source' named 'MS Access Database'
- library("RODBC",lib.loc="c:\\temp\\rlib")
- UAchannel <- odbcConnect("MS Access Database",uid="")
- In the popup window, navigate to your copy of the miniclass data that you have copied from the class data locker
into c:\temp\rlib and select the MS-Access database: vmt_data.mdb- To close this connection exit from R or use the command: odbcClose(UAchannel)
(3) Next, use this UAchannel to pull each of the four data tables from MS-Access into data frames in R
- demdata <- sqlFetch(UAchannel, "demographic_bg")
- bedata <- sqlFetch(UAchannel, "built_environment_250m")
- vmtdata <- sqlFetch(UAchannel, "vmt_250m")
- ctdata <- sqlFetch(UAchannel, "ct_grid")
(4) Use the matrix manipulation commands of R to check that the tables are properly read in and make sense
Command Result Comment
dim(demdata) [1] 3323 18 3323 rows and 18 columns dim(bedata)
[1] 117196 33 117196 rows and 22 columns dim(vmtdata) [1] 60895 21 117196 rows and 21 columns dim(ctdata) [1] 118668 6 118668 rows and 6 columns ctdata[1,] Join_Count TARGET_FID G250M_ID Community_
1 1 19 34 Regional Cities
TOWN_NAME CODAs
1 Amesbury 1
xxx[1,] displays all the columns of row=1 from the array 'xxx' summary(ctdata) Join_Count TARGET_FID G250M_ID
Min. :1 Min. : 19 Min. : 34
1st Qu.:1 1st Qu.: 75668 1st Qu.: 75118
Median :1 Median :174345 Median :173760
Mean :1 Mean :166050 Mean :165844
3rd Qu.:1 3rd Qu.:261092 3rd Qu.:261147
Max. :2 Max. :314601 Max. :314374
Community_ TOWN_NAME
Developing Suburbs:65288 Plymouth : 4276
Inner Core : 6125 Middleborough: 3003
Maturing Suburbs :30513 Boston : 2099
Regional Cities :16742 Taunton : 1985
Carver : 1644
Ipswich : 1514
(Other) :104147
CODAs
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2595
3rd Qu.:1.0000
Max. :1.0000(5) Get descriptive statistics for the demographic data table:
summary(demdata) ## note the missing values (NA's) for some of the variables BG_ID CT_ID PCT_sch13 Min. :2.501e+11 Min. :2.501e+10 Min. :0.2689 1st Qu.:2.502e+11 1st Qu.:2.502e+10 1st Qu.:0.8022 Median :2.502e+11 Median :2.502e+10 Median :0.8922 Mean :2.502e+11 Mean :2.502e+10 Mean :0.8543 3rd Qu.:2.502e+11 3rd Qu.:2.502e+10 3rd Qu.:0.9431 Max. :2.503e+11 Max. :2.503e+10 Max. :1.0000 NA's :6 PCT_owner INC_MED_HS PCT_pov PCT_white Min. :0.0000 Min. : 0 Min. :0.00000 Min. :0.0000 1st Qu.:0.3869 1st Qu.: 41544 1st Qu.:0.02464 1st Qu.:0.7586 Median :0.6458 Median : 55284 Median :0.05382 Median :0.9027 Mean :0.6125 Mean : 59057 Mean :0.09154 Mean :0.8171 3rd Qu.:0.8759 3rd Qu.: 71745 3rd Qu.:0.11828 3rd Qu.:0.9633 Max. :1.0000 Max. :200001 Max. :0.84164 Max. :1.1642 NA's :8 NA's :6 NA's :4 PCT_under5 PCT_old65 pct_ov2car INC_CAPITA Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. : 0 1st Qu.:0.05283 1st Qu.:0.07864 1st Qu.:0.05466 1st Qu.: 20045 Median :0.07476 Median :0.11993 Median :0.11227 Median : 25548 Mean :0.07528 Mean :0.13000 Mean :0.12512 Mean : 28052 3rd Qu.:0.09645 3rd Qu.:0.16688 3rd Qu.:0.18182 3rd Qu.: 32804 Max. :0.32477 Max. :0.62932 Max. :0.52863 Max. :122938 NA's :4 NA's :4 NA's :8 pct_elsc_sch pct_less3hsh pct_labor pct_unempl Min. :0.0000 Min. :0.1385 Min. :0.0000 Min. :0.00000 1st Qu.:0.1329 1st Qu.:0.4688 1st Qu.:0.6194 1st Qu.:0.01570 Median :0.1821 Median :0.5632 Median :0.6844 Median :0.03187 Mean :0.1803 Mean :0.5718 Mean :0.6705 Mean :0.04446 3rd Qu.:0.2277 3rd Qu.:0.6610 3rd Qu.:0.7370 3rd Qu.:0.05609 Max. :0.6148 Max. :1.0000 Max. :1.0000 Max. :0.76060 NA's :4 NA's :8 NA's :4 NA's :6 F1_wealth F2_children F3_working Min. :-4.0844 Min. :-3.48684 Min. :-6.90175 1st Qu.:-0.4908 1st Qu.:-0.59101 1st Qu.:-0.50150 Median : 0.2355 Median : 0.06503 Median : 0.08234 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 3rd Qu.: 0.6952 3rd Qu.: 0.72343 3rd Qu.: 0.61345 Max. : 2.6100 Max. : 3.43180 Max. : 3.92767 NA's :9 NA's :9 NA's :9(6) Exclude certain rows and columns so that the factor analysis works and is more meaningful
For a nice discussion of Factor analysis see: http://en.wikipedia.org/wiki/Factor_analysis For a more extensive discussion of factor analysis, principal component analysis, and the interpretation of results, take a look at Chapter 6 of Sumeeta Srinivasan's PhD dissertation (DUSP/UIS PhD, 2000), "Linking land use and transportation: measuring the impact of neighborhood-scale spatial patterns on travel behavior," Massachusetts Institute of Technology: Urban Studies and Planning, PhD Dissertation: http://mit.dspace.org/handle/1721.1/70335?show=full
We should exclude the first and last few columns from the data frame before doing a factor analysis. (The first two are the block group and census tract IDs and the last three are the first three demographic factors computed in Mi Diao's thesis.
Create list of columns to remove from consideration and apply it to demdata to create demdata2
- outlist <- c("BG_ID","CT_ID","F1_wealth","F2_children","F3_working")
- demdata2 <- demdata[,setdiff(names(demdata),outlist)]
Double check the dimensions and a few rows of the result in demdata2 and try a factor analysis (using varimax by default for the rotation of the factors)
- fac3 <- factanal(demdata2, factors=3)
Error in cov.wt(z) : 'x' must contain finite values onlyOops, there are still problems with missing values in the data. Let's only consider block groups where household income >0.
- demdata3 <- demdata2[demdata2[,'INC_MED_HS']>0,]
- dim(demdata3)
[1] 3315 13- fac3 <- factanal(demdata2, factors=3)
Error in cov.wt(z) : 'x' must contain finite values onlyHmm, the right number of rows and columns, but still some problems. Let's use the 'complete.cases' command to retain only those rows that are complete cases:
- demdata4 <- demdata2[complete.cases(demdata2),]
dim(demdata4)
[1] 3314 13- << Note: building demdata4 from demdata3 instead of demdata2 will yield the same result since all the complete cases happen to have INC_MED_HS > 0 >>
Okay, one more row is eliminated and now the factor analysis works:
- fac4 <- factanal(demdata4, factors=3)
- fac4
Call: factanal(x = demdata4, factors = 3)Uniquenesses: PCT_sch13 PCT_owner INC_MED_HS PCT_pov PCT_white 0.246 0.183 0.321 0.289 0.384 PCT_under5 PCT_old65 pct_ov2car INC_CAPITA pct_elsc_sch 0.688 0.005 0.389 0.518 0.375 pct_less3hsh pct_labor pct_unempl 0.083 0.412 0.725Loadings: Factor1 Factor2 Factor3 PCT_sch13 0.815 -0.202 0.220 PCT_owner 0.818 0.380 INC_MED_HS 0.778 0.217 0.163 PCT_pov -0.842 PCT_white 0.764 -0.170 PCT_under5 0.552 PCT_old65 0.205 -0.324 -0.921 pct_ov2car 0.623 0.470 INC_CAPITA 0.649 -0.216 0.116 pct_elsc_sch -0.145 0.775 pct_less3hsh -0.109 -0.946 pct_labor 0.452 0.619 pct_unempl -0.518Factor1 Factor2 Factor3 SS loadings 4.593 2.438 1.351 Proportion Var 0.353 0.188 0.104 Cumulative Var 0.353 0.541 0.645Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 4112.4 on 42 degrees of freedom. The p-value is 0This output is okay and looks similar to Mi Diao's results but has one extra variable (pct_ov2car). Create a list with that column heading and use it to toss that column
- outlist <- c("pct_ov2car")
- demdata5 <-demdata4[,setdiff(names(demdata4),outlist)]
- dim(demdata5)
[1] 3314 12Okay, that column is eliminated, now try the factor analysis again
- fac5 <- factanal(demdata5, factors=3) # varimax is the default
- fac5
Call: factanal(x = demdata5, factors = 3)Uniquenesses: PCT_sch13 PCT_owner INC_MED_HS PCT_pov PCT_white PCT_under5 0.241 0.193 0.315 0.284 0.395 0.679 PCT_old65 INC_CAPITA pct_elsc_sch pct_less3hsh pct_labor pct_unempl 0.005 0.510 0.363 0.095 0.410 0.724Loadings: Factor1 Factor2 Factor3 PCT_sch13 0.819 -0.198 0.219 PCT_owner 0.811 0.382 INC_MED_HS 0.781 0.221 0.163 PCT_pov -0.846 PCT_white 0.756 -0.171 PCT_under5 0.560 PCT_old65 0.206 -0.323 -0.921 INC_CAPITA 0.659 -0.207 0.115 pct_elsc_sch -0.144 0.783 pct_less3hsh -0.105 -0.940 -0.102 pct_labor 0.454 0.620 pct_unempl -0.519Factor1 Factor2 Factor3 SS loadings 4.210 2.225 1.350 Proportion Var 0.351 0.185 0.113 Cumulative Var 0.351 0.536 0.649Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 3792.42 on 33 degrees of freedom. The p-value is 0This works and is similar to Prof. Diao's (except that factors lower than 0.35 are not omitted). We can get even closer to Prof. Diao's results by using a slightly different procedure: principal components analysis (PCA).
- prin5 <- principal(demdata5, nfactors=3, rotate="varimax")
prin5 # print results
Principal Components Analysis Call: principal(r = demdata5, nfactors = 3, rotate = "varimax") Standardized loadings (pattern matrix) based upon correlation matrix PC1 PC2 PC3 h2 u2 PCT_sch13 0.82 -0.20 0.27 0.78 0.22 PCT_owner 0.82 0.39 -0.05 0.82 0.18 INC_MED_HS 0.81 0.22 0.19 0.75 0.25 PCT_pov -0.86 -0.03 -0.04 0.75 0.25 PCT_white 0.80 -0.18 -0.07 0.67 0.33 PCT_under5 -0.02 0.73 0.06 0.53 0.47 PCT_old65 0.23 -0.32 -0.86 0.89 0.11 INC_CAPITA 0.71 -0.19 0.15 0.56 0.44 pct_elsc_sch -0.14 0.87 0.04 0.78 0.22 pct_less3hsh -0.10 -0.91 -0.13 0.85 0.15 pct_labor 0.43 -0.05 0.79 0.81 0.19 pct_unempl -0.61 0.04 0.13 0.39 0.61PC1 PC2 PC3 SS loadings 4.52 2.53 1.54 Proportion Var 0.38 0.21 0.13 Cumulative Var 0.38 0.59 0.72 Proportion Explained 0.53 0.29 0.18 Cumulative Proportion 0.53 0.82 1.00Test of the hypothesis that 3 components are sufficient.The degrees of freedom for the null model are 66 and the objective function was 8.01 The degrees of freedom for the model are 33 and the objective function was 1.61 The number of observations was 3314 with Chi Square = 5308.97 with prob < 0Fit based upon off diagonal values = 0.97These results match those of Prof Diao quite closely in Table 3 on page 45 (but our results include factors with low weights < 0.35). Try similar factor and principal component analysis for the other data table with built environment measures.
(7) GRAPHICS: Visualize data patterns using scattergrams and box plots.
Here is a useful reference regarding scatterplots in R: http://www.statmethods.net/graphs/scatterplot.html
Let's plot scattergrams for pairs of the variables: pct_sch13, pct_owner, and pct_pov
- pairs(~PCT_sch13+PCT_owner+PCT_pov,data=demdata5,main="Simple Scatterplot Matrix")
Hmm, some of these relationships look somewhat non-linear - an issue since factor analysis is searching for linear combinations of variables that identify clusters of block groups with similar characteristics. We may want to transform some of the variables before doing the factor analysis. But, first, there are so many data points that it is hard to see what is happening in the point clouds.Let's use the 'hexbin' procedure to 'bin' the data into 50 groups along each dimension before we generate the scatterplots. That way, we can use different symbols on the graph depending on whether a 'bin' has a few or very many cases that fall within it.
- bin <- hexbin(demdata5$PCT_sch13,demdata5$PCT_owner,xbins=50)
- plot(bin,main="Percent School by Ownership")
We see that homeownership and schooling are high together - a very common case in the upper right. Where homeownership is not so high (80+%), the relationship is much weaker between the home ownership percentage of a block group and the percent with some 13+ years of education.
Try examining other paris of variables to get a feel for the data and to being thinking about outliers and sub-markets where the relationships and relevant factors might be quite different. We will get back to some of these considerations in Part 3 when we consider alternative models and explore the possibility that the VMT/built-environment relationship may be different for different community types.
(8) Modeling VMT as a function of BE and Demographics
Next, lets fit the ordinary least squares (OLS) model in Prof. Diao's dissertation (Tables 5 and 6) that models the average annual mileage per vehicle (VMT) for vehicles registered within a 250x250m grid cell as a function of the 5 built environment (BE) factors and the 3 demographic factors.
In order to fit the models, we need to join the three tables in demdata, bedata, and vmtdata so each row (for a unique grid cell) has that grid cells mileage estimate plus the 5 BE factors and 3 demographic factors. While we are doing this join, we might as well include one additional column - the community type associated with each grid cell. Recall that, in last Thursday's spatial data management session, we used the TAZ_attr shapefile that had the 'community type' designation by the Metropolitan Area Planning Council (MAPC) for every traffic analysis zone. (Actually, the type is designated for each of the 164 municipalities in the metro Boston region.) In our miniclass, we showed how to do a 'spatial join' in ArcGIS to tag each grid cell with the type of community that it was in. The results have been saved into a table called ct_grid in our vmt_data.mdb database. We have already read this table into an R data frame called, ctdata.
We could use R tools to merge the four tables, but complex relational joins in R can be combersome and condusive to errors. Instead, let's join the four tables in MS-Access, bring over to R only the 10 columns of data (plus grid cell and block group IDs) for each grid cell, and doing our modeling with that table. Actually, we will want to bring over a few more columns indicating the number of households estimated to be in each grid cell, the quality of VMT estimate, and the like. (See page 44 of the dissertation regarding which grid cells are included in the modeling.)
For your convenience, we have already built and saved the query in MS-Access that joins the 4 tables to produce a new table, t_vmt_join4, with 60377 rows. (Note that, to be a row in this table, a grid cell had to have estimates of VMT and all 8 factors.) Pull this table into a new data frame in R and use 'complete.cases' to omit rows with missing values:
- vmt_join <- sqlFetch(UAchannel, "t_vmt_join4")
- dim(vmt_join)
[1] 53429 19- vmtall <- vmt_join[complete.cases(vmt_join),]
- dim(vmtall)
[1] 53250 19- vmtall[1,]
g250m bg_id vmt_vin f1_nw f2_connect f3_transit f4_auto
1 65 250092671024 13270.79 0.4439 -0.16947 1.10299 -0.13811
f5_walk f1_wealth f2_kids f3_work ct hshld_2000
1 -0.55755 0.8841909 1.669427 1.004693 Developing Suburbs 8.727
vmt_pop vmt_hh no_vin no_good vmt_tot_mv vin_mv
1 20971 59306 39 33 1186400 105Note that my MS-Access query to join the tables also changed the names of a few columns to make them lower case and shorter. Now we are ready to run a simple ordinary least squares regression of VMT per vin on the 8 BE and 3 Demographic factors:
- run_ols1 <- lm(formula = vmt_vin ~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk +f1_wealth +f2_kids + f3_work, data=vmtall)
- summary(run_ols1)
Call: Call: lm(formula = (vmt_tot_mv/vin_mv) ~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk + f1_wealth + f2_kids + f3_work, data = vmtall) Residuals: Min 1Q Median 3Q Max -11701.6 -727.7 -34.7 704.2 10887.1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12596.789 10.807 1165.63 <2e-16 *** f1_nw 698.115 7.080 98.60 <2e-16 *** f2_connect -472.713 5.815 -81.29 <2e-16 *** f3_transit 858.899 5.890 145.82 <2e-16 *** f4_auto 164.073 8.735 18.78 <2e-16 *** f5_walk -146.136 6.284 -23.26 <2e-16 *** f1_wealth -246.211 10.714 -22.98 <2e-16 *** f2_kids 201.393 8.571 23.50 <2e-16 *** f3_work 198.404 6.849 28.97 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1244 on 53241 degrees of freedom Multiple R-squared: 0.5183, Adjusted R-squared: 0.5182 F-statistic: 7160 on 8 and 53241 DF, p-value: < 2.2e-16Hmm, the model coefficients have similar signs and values but much lower R-squared than reported in the dissertation. That is because vmt_vin is the estimated mileage for individual grid cells and there are many grid cells with few vehicles and noisy estimates. Instead of using individual grid cells, Prof Diao average the numbers in the 9-cell window around each grid cell. We can do this by dividing vmt_tot_mv by vmt_mv in the vmt_250m table. The vmt_tot_mv column is the total mileage for the 9-cell window and vmt_mv is the total number of vehicles in the 9-cell region. So let's regress vmt_tot_mv / vmt_mv on the left hand side (taking care to do the division only for grid cells that had vmt_mv > 0.
- run_ols1 <- lm(formula = (vmt_tot_mv / vin_mv) ~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk +f1_wealth +f2_kids + f3_work, data=vmtall)
- summary(run_ols1)
Call: lm(formula = (vmt_tot_mv/vin_mv) ~ f1_nw + f2_connect + f3_transit + f4_auto + f5_walk + f1_wealth + f2_kids + f3_work, data = vmt_join)Residuals: Min 1Q Median 3Q Max -11701.7 -727.9 -35.4 704.4 10883.1Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12596.906 10.786 1167.88 <2e-16 *** f1_nw 698.492 7.071 98.78 <2e-16 *** f2_connect -472.531 5.815 -81.26 <2e-16 *** f3_transit 859.571 5.885 146.07 <2e-16 *** f4_auto 165.860 8.723 19.01 <2e-16 *** f5_walk -145.796 6.284 -23.20 <2e-16 *** f1_wealth -245.697 10.701 -22.96 <2e-16 *** f2_kids 201.831 8.572 23.55 <2e-16 *** f3_work 196.183 6.841 28.68 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 1246 on 53412 degrees of freedom (8 observations deleted due to missingness) Multiple R-squared: 0.5177, Adjusted R-squared: 0.5177 F-statistic: 7167 on 8 and 53412 DF, p-value: < 2.2e-16
Okay, this looks much better, and gives an indication of how much noise there is in the individual grid cells - at least in the suburbs where the density is much lower. We still have only produced the ordinary least squares results and have not adjusted for spatial autocorrelation (so the coefficients are not directly comparable to any of the results shown in the dissertation).
(9) Good references for further work with R
- An online 'Quick-R' reference: www.statmethods.net
- A paperbook (or e-book) that I find to be a useful reference for data manipulation and basic stats
oreilly.com
R in a Nutshell: A Desktop Quick Reference
Joseph Adler, O'Reilly
2nd edition, 2012 (Sept.)paperbook $49.99 at O'Reilly:
http://shop.oreilly.com/product/0636920022008.do
($35.99 as Ebook)paperbook $37.05 at Amazon
http://www.amazon.com/R-Nutshell-OReilly-Joseph-Adler/dp/144931208X/ref=sr_1_2?s=books&ie=UTF8&qid=1350371375&sr=1-2&keywords=r+in+a+nutshellkindle version: $19.79
Now that we have a feel for basic R commands and model building, we can begin a more open-ended inquiry into the data. We will quickly want to learn some additional commands but, hopefully, we have enough of a feel for R and the connections between ArcGIS, R and MS-Access that we can teach ourselves many of the additional commands that we need by using the online help files and the various references that we have provided. (Depending upon our statistics background, we may also need to do some significant reading of a statistics text in addition to the references we give for the software tools!)
Here are some suggestions for exploring the data further:
- Use the Community Type category to subdivide the grid cells into those within the four community types. Refit the OLS model for each of the types of community. How different are the results? How might you dig into the data further to sort out what might be happening and what transform or redefinition of variables might improve the model fit?
- We can use the regression model at the end of Part 2 to estimate VMT for each grid cell and compute the 'residuals' - that is the difference between the actual and predicted VMT values for each grid cell. Then we can import these residuals back into ArcGIS, join them to the grid cell layer and see whether the residuals seem to vary systematically across metro Boston. We can even do some 'Lisa' plots and compute spatial statistics such as Moran's I to identify hot spots and test how strong is the evidence of spatial autocorrelation. One of the R commands that you may find helpful here is the sqlSave() command to write tables back to MS-Access. use the online help function in R to learn about the sqlSave() command: help(sqlSave). Also, here is a webpage with a tutorial on plotting residuals from a regression model: http://www.r-tutor.com/elementary-statistics/simple-linear-regression/residual-plot
- The pairwise scattergrams that we produced seemed to suggest skewness in some of the distributions and non-linearities in some of the relationships. Select a few variables where this appears problematic and experiment with some transformations and then refit the VMT model. Do the results suggest different effects? Do they suggest ways of identifying sub-markets within which the VMT / BE / Demographic relationships might be different.
- From the data, it is pretty clear that annual mileage per vehicle and per household is relatively low in the central city and relatively high in the outer suburbs. The BE factors tend to measure local effects and they do seem to explain some of the within-town variation in VMT that is evident in the suburbs. What other measures might you consider to capture built environment effects that are at a different scale from those which Prof. Diao considered? Take a look at the often-cited paper by Cervero and Kockelman on "Travel Demand and the 3Ds: Density, Diversity, and Design" (Transportation Research, part D, Vol. 2, No. 3, pp. 199-219, 1997). A copy, Cervero_1997.pdf, has been placed in the 'readings' folder of the 11.523 class locker for your convenience.
For those of you who are taking the MiniClass for credit, pick one or more of these suggestions and see what you can learn prior to the next Tuesday session on Nov. 6. Prepare a few powerpoint or PDF slides and be prepared to spend 5-10 minutes explaining and discussing your preliminary results.
Last modified: 5 November 2012 [jf]
Back to DUSP | CRON | MIT | MiniClass Homepage