An Introduction to Stata

written by Aimee Chin
February 7, 2000

Stata is a statistical analysis software package. Stata will be need to complete the empirical exercises in the problem sets.

This handout provides an introduction to Stata. It consists of five parts:

1. The Three Components of Your Stata Session
2. The Most Basic Commands: getting into Stata, getting help, opening a data set, getting out of Stata
3. The Structure of a Typical Stata do-file
4. An Example of a Stata do-file, including Stata Commands for Getting Descriptive Statistics and Running Regressions
5. A Few Tips

A. The Three Components of Your Stata Session

Although it is possible to use Stata interactively (i.e., you enter the command at the Stata prompt, Stata performs it, you enter another command, etc.), in this course you will be required to write Stata do-files. The advantage of writing a do-file is that you do not have to type the same commands again and again before you get the correct sequence of commands. Also, if you do not complete your problem set in a single session, you can easily pick up where you left off.

Note that Stata will stop running at a line that it cannot execute. When a Stata do-file stops running in the middle, you will need to fix your do-file. In your text editor, edit the part of the program that Stata stopped at. Run the program again. When Stata executes the whole program, you will have a clean log file.

B. The Most Basic Commands: getting into Stata, getting help, opening a data set, getting out of Stata

1. Getting Started

To start Stata, at the athena prompt, you type

and then start Stata by typing

stata

2. Tutorial and Help

Stata provides on-line help. For a menu of choices, type

help

and press Enter. You can obtain help on any command in Stata by typing help followed by the command's name. For example, to learn about the sortcommand, type

help sort

Stata includes an on-line tutorial which will help you learn about Stata. To run the tutorial type

tutorial intro

3. Opening a Data Set

You can call up a Stata data set for use by typing

use filename

Where Stata will automatically look for filename.dta in your current directory (if it is located elsewhere, or if the file extension is different, then you will need to specify). If there is something already in memory, you need to first type

clear

All the contents in the current Stata workspace will be erased, and Stata will be ready to load up a new data set.

I will always provide you with datasets in Stata format already. However, in the future, you may encounter data of different formats (e.g., ASCII, Excel spreadsheets) or you may have to input data yourself. You can find more information about this by typing

help infile

Suppose you have made changes to your data set and wish to save the changes. Type

save filename

and Stata will name it filename.dta. If filename.dta already exists, then Stata will not perform the save. You can either specify a new file name, or to overwrite the old data file, type:

save filename, replace

4. Exiting Stata

To exit Stata, type

exit

Stata will not let you exit if there is unsaved data. If you don't wish to save the modified data set, just type

exit, clear

C. The Structure of a Typical Stata do-file

Here is the structure of a typical Stata do-file:

***

cap log close

set more 1

clear

cd ~/your_directory/your_14.31_subdirectory

log using filename, replace

use filename

<insert stata commands here>

log close

***

The first line closes any log files that you might have accidentally left open.

Line 2 tells Stata not to wait for keyboard input before waiting to display the next screen of output -- you will not be able to read what Stata is doing as it scrolls by, but you can read the output in the log file.

Line 3 tells Stata to erase everything in the current workspace memory. A do file containing a command to open up a data file cannot be executed if there is something in current memory that has not been saved.

Line 4 tells Stata the default location of files to be used and files to be created.

Line 5 command tells Stata to start a log file named filename.log to echo the session. Appending a ",replace" overwrites the log file of the same name. Line 8 closes this log file.

Line 6 opens up a Stata data file named filename.dta.

Line 7 is the meaty part of the program, where you issue the commands for Stata to perform with the data. You can learn about specific commands in the next section of this handout.

D. An Example of a Stata do-file, including Stata Commands for Getting Descriptive Statistics and Running Regressions

Here is a more detailed example of a Stata do-file, complete with some commands you will likely be using. Note that the comments in parentheses are my comments to you; they are NOT part of the Stata do-file.

***

/* STARTING THE SESSION */

cap log close

set more 1

clear

log using sample, replace

* Sample do file

(A star at the beginning a line indicates a comment; Stata ignores this line because there is nothing to execute)

/* Created by Aimee Chin on 2/7/00 */

(You can make comments in places other than the beginning of the line by using "/* comment */" )

use data

/* COMMANDS FOR DESCRIBING THE DATA */

describe

(Gives a description of your data file, including variable names, any variable labels and way the data is sorted.)

summarize var1 var2

(Gives the summary statistics, including mean and s.d., of the variables specified. If you just type "summarize", the summary statistics for ALL variables will be given. Append ", detail" and more details about the specified variable will be given.)

tabulate var1

(Tabulates the values of a categorical variable. You can do a cross-tabulation by typing "tabulate var1 var2".)

correlate var1 var2

(Computes the correlation between var1 and var2. You can specify more variables, and the whole correlation matrix will be displayed.)

covariance var1 var2

(Computes the covariance.)

list var1 var2

(Lists var1 and var2 of each observation. If you just type "list", all variables for each observation will be listed -- basically the raw data.)

/* COMMANDS FOR QUALIFYING THE DATA */

/* Sometimes, you are only interested in a part of your sample meeting certain qualifications, e.g., observations after 1990 or individuals living in Massachusetts. Here are ways you can run the commands for a subsample.*/

list in 1/5

(Lists the first five observations. The "in" qualifier requires you knowing the observation number of the observations you are interested in. Observation number can change when you sort and re-sort the data so be careful when you rely on "in".)

summarize if year > 1990

(The "if" qualifier is very powerful. I would advise you to type "help if" to learn more.)

sort year

(Sorts all the observations by year.)

by year: summarize

(The sort, by combination is also very useful. If you only had two years, you can alternatively type "summarize if year==1990 and "summarize if year==1991" but the sort, by combination saves work when there are more years. You can sort by more than one variable and use by for more than one variable.)

/* You may put as many conditions as you want using and ("&") and or ("|"). Consider the following examples. */

keep if year > 90 & state = "MA"

(Keeps only the post-1990 observations in which the individual lives in Massachusetts; removes the rest of the observations.)

keep if year > 90 & (state = "MA" | state == "NY")

(Keeps only the post-1990 observations in which the individual lives in Massachusetts or in New York. This is different from the following example.)

keep if year > 90 & state = "MA" | state == "NY"

(Keeps the observations in which: (1) post-1990 and live in MA or (2) live in NY. This is different from the previous example.)

/* COMMANDS FOR MANIPULATING THE DATA */

generate newvar = something

(Generates a new variable called newvar which is whatever you specify. It can be a function of existing variables.)

replace var1 = something if blah == 1

(Replaces existing value of var1 with something if blah is equal to 1. Something may be an expression. Replace can also be used without the "if" qualifier. Notice there's a distinction between "=" and "==" throughout Stata.)

replace varname = 1 in 100/120

(Sets the variable varname to 1 for observations 100 to 120.)

drop var1

label newvar "description of newvar"

(Gives newvar a label so when you use the describe command, you will be reminded of what newvar is.)

/* COMMANDS FOR GRAPHING THE DATA */

graph y x, saving (filename, replace)

(Stata prints the graph to a file called filename.gph. There are many options for making your graph look pretty, including axis labels and such, that you can learn about by typing "help graph".)

gphpen filename.gph

(Stata now writes the graph saved in filename.gph to a postscript file named filename.ps that you can print by typing "lpr -Pprintername filename.ps" at the Athena prompt.)

/* COMMANDS FOR REGRESSIONS */

regress y var1 var2

(This computes the ordinary least squares estimates. y is the dependent variable, all others are independent variables. Stata automatically includes a constant in the regression unless you type ",noconstant" after the command.)

predict yhat

(Creates a variable yhat that contains the predicted values of y based on the regression just run.)

predict ehat, resid

(Creates a variable ehat that contains the residual (equals y minus yhat) based on the regression just run.)

/* WRAPPING UP THE SESSION */

save data, replace

(You may or may not want to save changes to data.dta. If you've generated a lot of new variables and expect to use these variables again, perhaps you should save it. But if you only plan to use data.dta in the context of this do file, then I wouldn't bother saving it.)

log close

(You can now find your completed masterpiece, sample.log in the current directory or the directory you specified. You can view and edit it using a text editor.)

set more 0

(This turns back on pausing after a screen's worth of information is displayed.)

***

E. A Few Tips

1. Be sure to document your programs well (e.g., write comments about what each section of the program is trying to accomplish, what question you're answering). This will help remind you what you did when you look at the program at a later point. Additionally, it will guide me, your problem set grader, through your work.

2. Stata is case sensitive. It expects commands to be in lower case.

3. Be careful with subdirectories. I would put all my 14.31 empirical files in one directory and change to that directory at the beginning of every session (or in the beginning of every do file).

4. Be careful with "=" and "==". For example, after the if command, Stata expects "==" for a test of equality; "=" produces an error in this case.

5. I generally give my program and log files the same name (of course with different extensions). This way, I'll know exactly which do file is associated with which log file.

6. Stata cannot start a new log file while one is already open. This sounds obvious but it is important. Thus, when you unsuccessfully run your program, go to Emacs to fix it and run it again, you may get an error message (this is not a problem is you open your do files with "cap log close"). You may need to type "log close" to close up the old log file (created under the bad program, and never closed because the program was not completely executed) and then type "do filename" to run your revised do file.

7. Prefer working on PCs? Stata exists in PC form in the Economics Department computer clusters. You might find that Windows NT version easier to use than the Unix version -- it has nice little windows and menus, and provides you with an Excel-like data editor.

back to 14.31 web page