Massachusetts Institute of Technology
Department of Urban Studies and Planning

11.220 Quantitative Reasoning and Statistical Methods for Planning

Computing Session #7


Topic

More statistical methods using SPSS.

  1. Testing association between two ordinal or nominal variables (Crosstabs - Chi-square test)
  2. Linear regression
  3. Linear regression with a dummy variable


Contents


SPSS > Linear Regression

Linear Regression estimates the coefficients of a linear equation, involving one or more independent variables, that best predict the value of the dependent variable.

In this example, we are interested in regressing wage/salary income on years of education. We ask the questions, "Is there a significant linear relationship between income and education? If so, how much of the variance in income can be explained by education?"

1. Go to 'Regression' under 'Analyze', then click 'Linear...'.

2. Set 'incws' as 'Dependent' and 'educ_yr' as 'Independent(s)'.

3. Interpreting the output...

The "R" in the Model Summary table above is the Correlation Coefficient. The range of "R" is from -1 to +1 and indicates a negative or positive relationship between two variables. If r = 0 then no relationship exists between the two variables. Since we are conducting a bivariate regression (only two variables, X and Y) we may interpret "R" in the usual way. When we add more explanatory variables in a multiple regression the interpretation of "R" is less clear. The "R Square" is interpreted as the percentage of the variance in Y that can be explained by X. An "R Square" of .141 in the table above indicates that only 14.2% of the variance in income can be explained by the number of years of education. We should consider adding more explanatory variables to predict income.

The "Adjusted R Square" takes into account the number of explanatory variables and the sample size, i.e., it is adjusted based on the degrees of freedom. The value of R Square is .141, while the value of Adjusted R Square is .119. Adjusted R Square is computed using the formula 1 - ( (1-R Square)(N-1 / N - k - 1) ), where N is the number of observations in the sample and k is the number of explanatory variables. From this formula, you can see that when the number of observations is small and the number of predictors is large, there will be a much greater difference between R-square and adjusted R-square (because the ratio of (N-1 / N - k - 1) will be much less than 1). By contrast, when the number of observations is very large compared to the number of predictors, the value of R-square and adjusted R-square will be much closer because the ratio of (N-1)/(N-k-1) will approach 1.

As explanatory variables are added to the model, each one (the X's) will explain some of the variance in the dependent variable (Y) simply due to chance. One could continue to add predictors to the model which would continue to improve the ability of the predictors to explain the dependent variable, although some of this increase in R Square would be simply due to chance variation in that particular sample. The adjusted R Square attempts to yield a more honest value to estimate the R Squared for the population.

The "Std. Error of the Estimate," also called the root mean square error, is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

SPSS allows you to specify multiple models in a single regression command and the "Model" column in the ANOVA table above tells you the number of the model being reported.

The next column under "Model" gives you the source of variance, Regression, Residual and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Regression) and the variance which is not explained by the independent variables (Residual, sometimes called Error). Note that the Sums of Squares for the Regression and Residual add up to the Total, reflecting the fact that the Total is partitioned into Regression and Residual variance.

Next are the Sum of Squares associated with the three sources of variance, Total, Regression and Residual. These can be computed in many ways. Conceptually, these formulas can be expressed as:
SSTotal = The total variability around the mean. S(Y - Ybar)^2.
SSResidual = The sum of squared errors in prediction. S(Y - Ypredicted)^2.
SSRegression = The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, S(Ypredicted - Ybar)2. Another way to think of this is the SSRegression is SSTotal - SSResidual. Note that the SSTotal = SSRegression + SSResidual. Also note that SSRegression / SSTotal is equal to .141, the value of R Square. This is because R Square is the proportion of the variance explained by the independent variables, hence can be computed by SSRegression / SSTotal.

The "df" column are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. In this case, there were N=41 individuals, so the df for total is 40. The model degrees of freedom corresponds to the number of predictors minus 1 or (k-1). You may think this would be 1-1 (since there were 1 independent variable in the model, i.e., education). But, the intercept is automatically included in the model (unless you explicitly omit the intercept). Including the intercept, there are 2 predictors, so the model has 2-1=1 degrees of freedom. The Residual degrees of freedom is the df total minus the df model, 40 - 1 is 39.

The Mean Squares are the Sum of Squares divided by their respective df. For the Regression, 2.34E+09 / 1 = 2335276982. For the Residual, 1.42E+10 / 39 = 363865155.6. These are computed so you can compute the F ratio, dividing the Mean Square Regression by the Mean Square Residual to test the significance of the predictors in the model.

The F-value is the Mean Square Regression (2335276982) divided by the Mean Square Residual (363865155.6), yielding F = 6.418. The p-value associated with this F-value is small (0.015). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?" The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable." In this case, you could say that years of education can be used to reliably predict income. If the p-value were greater than 0.05, you would say that the independent variable does not show a statistically significant relationship with the dependent variable, or that the independent variable does not reliably predict the dependent variable. Note that in multiple regression this is an overall significance test assessing whether all of the independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed.

Under the "Model" column is a list of the predictor variables (Constant, educ_yr). The first variable (Constant) represents the constant, also referred to in textbooks as the Y intercept, the height of the regression line when it crosses the Y axis. In other words, this is the predicted value of income when all other variables are 0.

In the column labeled "B" are the values for the regression equation for predicting the dependent variable from the independent variable. These are called unstandardized coefficients because they are measured in their natural units. As such, the coefficients cannot be compared with one another to determine which one is more influential in the model, because they might be measured on different scales. The regression equation can be presented as:

Ypredicted = b0 + b1*x1

The coefficients provide the values for b0 and b1 for this equation. Expressed in terms of the variables used in this example, the regression equation is:

incwsPredicted = -9473.852 + 3349.145*educ_yr

These estimates tell you about the relationship between the independent variable(s) and the dependent variable. The coefficient for educ_yr tells the amount of increase in income that would be predicted by a 1 unit increase in years of educatioin. Note: For independent variables which are not significant, the coefficients are not significantly different from 0, which should be taken into account when interpreting the coefficients. (See the columns with the t-value and p-value about testing whether the coefficients are significant).

In the "Std. Error" column are the standard errors associated with the coefficients. The standard error is used for testing whether the parameter is significantly different from 0 by dividing the parameter estimate by the standard error to obtain a t-value (see the column with t-values and p-values). The standard errors can also be used to form a confidence interval for the parameter, as shown in the last two columns of this table.

SPSS calls the Standardized Coefficients "Beta." These are the coefficients that you would obtain if you standardized all of the variables in the regression, including the dependent and all of the independent variables, and ran the regression. By standardizing the variables before running the regression, you have put all of the variables on the same scale, and you can compare the magnitude of the coefficients to see which one has more of an effect. You will also notice, especially when you do multiple regression, that the larger Betas are associated with the larger t-values.

The output in the columns for the t-value and 2 tailed p-value are used in testing the null hypothesis that the coefficient/parameter is 0. If you use a 2 tailed test, then you would compare each p-value to your preselected value of alpha. Coefficients having p-values less than alpha are statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0). With a 2-tailed test and alpha of 0.05, you may reject the null hypothesis that the coefficient for educ_yr is equal to 0, i.e., years of education has an effect on income.

The constant is not significantly different from 0 at the 0.05 alpha level. However, having a significant intercept is seldom interesting, and in this model perhaps if educ_yr = 0 then having an income of 0 is plausible.


Session #1     Session #2     Session #3     Session #4     Session #5     Session #6     Session #7

Created by Myounggu Kang on May 3, 2004. Modified by Rhonda Ryznar on April 28, 2005.