Reading and Using STATA Output

This handout is designed to explain the STATA readout you get when doing regression. If you need help getting data into STATA or doing basic operations, see the earlier STATA handout.

I begin with an example.

In the following statistical model, I regress 'Depend1' on three independent variables. Depend1 is a composite variable that measures perceptions of success in federal advisory committees. The 'balance' variable measures the degree to which membership is balanced, the 'express' variable measures the opportunity for the general public to express opinions at meetings, and the 'prior' variable measures the amount of preparatory information committee members received prior to meetings. I get the following readout.

. reg Depend1 balance express prior

  Source |       SS       df       MS                  Number of obs =     337
---------+------------------------------               F(  3,   333) =  101.34
   Model |  129.990394     3  43.3301314               Prob > F      =  0.0000
Residual |  142.381532   333  .427572167               R-squared     =  0.4773
---------+------------------------------               Adj R-squared =  0.4725
   Total |  272.371926   336  .810630731               Root MSE      =  .65389

------------------------------------------------------------------------------
 Depend1 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 balance |   .3410554   .0498259      6.845   0.000       .2430421    .4390686
 express |  -.3248486   .0878143     -3.699   0.000      -.4975894   -.1521079
   prior |   .4562601   .0443442     10.289   0.000       .3690301    .5434901
   _cons |  -3.047717   .2280971    -13.361   0.000      -3.496409   -2.599024
------------------------------------------------------------------------------

A quick glance at the t-statistics reveals that something is likely going on in this data. Do we know for certain that there is something going on? Yes. Look at the F(3,333)=101.34 line, and then below it the Prob > F = 0.0000. STATA is very nice to you. It automatically conducts an F-test, testing the null hypothesis that nothing is going on here (in other words, that all of the coefficients on your independent variables are equal to zero). We reject this null hypothesis with extremely high confidence - above 99.99% in fact.

So now that we are pretty sure something is going on, what now?

Generally, we begin with the coefficients, which are the 'beta' estimates, or the slope coefficients in a regression line. In this case the 'line' is actually a 3-D hyperplane, but the meaning is the same.

First, consider the coefficient on the constant term, '_cons". It is obviously large and significant. This is the intercept for the regression line (in this case, the regression hyperplane). It is the default predicted value of Depend1 when all of the other variables equal zero. Does this have any intuitive meaning? Well, consider the following chart:

. sum Depend1 balance express prior

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
 Depend1 |     359   -6.39e-09   .9367157  -4.089263   1.008503  
 balance |     584    4.239726   .7780637          1          5  
 express |     597    .5678392   .4276565          0          1  
   prior |     580    4.149425   .8828358          1          5  

Most of the variables never equal zero, which makes us wonder what meaning the intercept has. By itself, not much. In some regressions, the intercept would have a lot of meaning. Here it does not, and I wouldn't spend too much time writing about it in the paper.

I'm much more interested in the other three coefficients. How do I begin to think about them?

There are two important concepts here. One is magnitude, and the other is significance.

So what does all the other stuff in that readout mean?

The ANOVA table has four columns, the Source, the Sum of Squares, the degrees of freedom, and the Mean of the Sum of Squares.

You should note that in the table above, there was a second column. So why the second column, Model2? Because I have a fourth variable I haven't used yet.

  Source |       SS       df       MS                  Number of obs =     336
---------+------------------------------               F(  4,   331) =   86.27
   Model |  138.541532     4  34.6353831               Prob > F      =  0.0000
Residual |   132.89241   331  .401487644               R-squared     =  0.5104
---------+------------------------------               Adj R-squared =  0.5045
   Total |  271.433943   335  .810250575               Root MSE      =  .63363

------------------------------------------------------------------------------
 Depend1 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 express |  -.0285837   .1058492     -0.270   0.787      -.2368057    .1796383
 balance |   .3267551    .048381      6.754   0.000        .231582    .4219282
   prior |   .4277068   .0433962      9.856   0.000       .3423397    .5130739
openmeet |  -.1737696   .0363659     -4.778   0.000      -.2453069   -.1022322
   _cons |  -2.627165   .2374349    -11.065   0.000      -3.094236   -2.160093

This is the regression for my second model, the model which uses an additional variable - whether the committee had meetings open to the public. Note that when the openmeet variable is included, the coefficient on 'express' falls nearly to zero and becomes insignificant. In other words, controlling for open meetings, opportunities for expression have no effect. But if we fail to control for open meetings, than 'express' picks up the effect of open meetings because opportunities for expression is highly correlated with open meetings. This is an important piece of interpretation - you should point this out to the reader.

Why did I combine both these models into a single table? Because it is more concise, neater, and allows for easy comparison. Generally, you should try to get your results down to one table or a single page's worth of data. Too much data is as bad as too little data.

A word about graphs:

In your writing, try to use graphs to illustrate your work. Numbers say a lot, but graphs can often say a lot more. You might use graphs to demonstrate the skew in an interesting variable, the slope of a regression line, or some weird irregularity that may be confounding your linear model. Always keep graphs simple and avoid making them overly fancy.

Inserting Graphs Into MS Word:

As this didn't make it onto the handout, here it is in email. I'll add it to the web handout as well when I get the chance.

In STATA, when type the graph command as follows:

. graph Y X, saving mygraph

STATA will create a file "mygraph.gph" in your current directory. Unfortunately, only STATA can read this file. In order to make it useful to other programs, you need to convert it into a postscript file. To do this, in STATA, type:

. translate mygraph.gph mygraph.ps

STATA then creates a file called "mygraph.ps" inside your current directory. You can now print this file on Athena by exiting STATA and printing from the Athena prompt.

Alternatively, you could type:

. translate mygraph.gph mygraph.eps

This creates an encapsulated postscript file, which can be imported into MS Word. In MS Word, click on the "Insert" tab, go to "Picture", and then go to "*.eps" files. This stands for encapsulated postscript files. You should be able to find "mygraph.ps" in the browsing window, and insert it into your MS Word file without too much difficulty.

Final Word:

Find a professionally written paper or two from one of the many journals in Dewey library, and read these. Make sure you find a paper that uses a lot of data. You don't have to be as sophisticated about the analysis, but look how the paper uses the data and results. Get a feel for what you are doing by looking at what others have done.