Home — The Washington and Lee University Library



Washington & Lee’s

Guide to Stata

Niels-Hugo Blunch

and

Carol Hansen Karsch

Table of Contents

Stata Resources 1

Components of Stata: 1

Stata Basics: 2

Opening a Stata File 2

. use 2

. set memory – no longer needed 2

Command Syntax 2

= Versus == 3

. set more off 3

Making Stata Stop 3

“Help” in Stata 3

Interactive vs. Batch Mode 4

Interactive Mode 4

Batch Mode Using Do Files 4

Best of Both Approach 4

Documenting A Stata Session – The Log - File 4

Importing Data into Stata 5

Descriptive Statistics 6

. describe 6

. codebook varname 6

. summarize varname, detail 7

. tabulate varname 7

. tabulate varone vartwo 8

. tabstat varone vartwo, etc. 8

. histogram varname 9

. graph box varname 9

. tsline varname 10

. correlate - Checking For Correlation 11

. scatterplot 11

Verifying the Data 11

. assert 11

Modifying & Creating Variables 13

MODIFYING VARIABLES 13

. destring 13

. tostring 14

. encode 14

. rename 14

. replace 14

. recode 15

CREATING NEW VARIABLES 15

Rules for Variable Names: 15

. generate 15

Addition/Subraction 15

Division 16

Multiplication (Interaction) 16

Exponentiation 16

Lag 16

Log: 16

. egen 16

Dummy/Indicator Variables 17

The Easy Way 17

Second Method 17

Third Method 18

Dates 18

Modifying and Combining Files 19

Modifying a File 19

. sort 19

. keep & . drop 19

. duplicates list & duplicates drop 19

. move and . order: reordering the list of variables 20

. reshape (wide ( long or long ( wide) 20

Combining Files 21

. append 21

. merge 22

Time Series and Panel Data 22

Time Series Operators 24

Regression Analysis 25

OLS (Studenmund, pp. 35 – 40) 25

Restricting Stata commands to observations with certain conditions satisfied 25

Ensuring Consistency Among the Number of Observations in the various estimation samples 25

Check for Violations of the Classical Assumptions 26

Omitted Variable Test (Studenmund, pp. 202 - 204) 26

Comparing Alternative Specifications for Model (Studenmund, p. 204 - 206) 26

Detection of Multicollinearity – VIF Score (Studenmund, p. 259 - 261) 26

Detecting Serial Correlation 26

• Durbin-Watson d Test for First-Order Autocorrelation (Studenmund, p. 315) 26

• Dealing With Autocorrelation 27

o GLS Using Cochrane-Orcutt Method (Studenmund, p. 322) 27

o Correct Serial Correlation Using Newey-West Standard Errors (Studenmund, p. 324) 27

Check for Homoscedasticity 27

• Park Test for Heteroskedasticity (Studenmund, p. 356) 27

• White test for Heteroskedasticity (Studenmund, p. 360) 28

• Weighted-Least Squares (WLS) (Studenmund, p. 363) 28

• Heteroskedasticity-corrected (HC) standard errors (Studenmund, p. 365) 28

Hypothesis Testing (Studenmund p.559) 29

t-tests (Studenmund p.561) 29

f-tests 29

Tests for joint statistical significance of explanatory variables 29

Testing for equality of coefficients 29

Making “Nice” Tables of Regression Results 30

. outreg2 30

The “estimates store” and “estimates table” Commands: 30

Other Recently Used Regression Models 32

Models with Binary Dependent Variables 32

Logit with marginal effects reported 32

Probit with marginal effects reported 32

Multinomial logit 32

Appendix A 34

Do File Example 34

Appendix B 35

Importing an Excel file into Stata 35

Appendix C 36

Downloading Data from ICPSR 36

Using a Stata Setup File to Import ASCII Data into Stata 37

Stata Resources

Stata’s website is the natural starting place – the following link leads you to a wealth of info on Stata, including Tutorials, FAQs, and textbooks:



One of the most helpful links found on the Stata site is UCLA’s Stata Resources website. It includes a set of learning modules:



Components of Stata:

Stata opens several windows when it launches. A window can be opened and made active either by clicking on it with the mouse or by selecting it from the “Window” menu. Right-clicking on a window displays options for that window.

1) Results are displayed in the Results window. The results can be searched by typing Cntl + F.

2) Commands are entered into the command line in the Command window.

3) Previously entered commands are displayed in the Review window. Click the magnifying glass to filter the commands.

4) The variables for the current dataset are displayed in the Variables window.

a) The Variables window can be used to enter variable names into the Command window (simply double click on a variable name)

b) Right clicking on a variable in the window gives options to keep or drop selected variables (Ctrl-click to select discontinuous variables.)

5) The Properties window contains two areas: one pertains to Variables and the other the Data file as a whole. The Variables section displays attributes of the variable highlighted in the Variable window. If the area is unlocked, you can edit the attributes, e.g. rename the variable, create or edit a label, change its format, apply a value label or add notes. The Data section allows you to label the dataset and add notes. It also tells you the number of variables and the number of observations in the file and the amount of memory being allocated to the data.

The Stata toolbar provides access to commonly used features:

1) Holding the mouse pointer over an icon displays a description of the icon.

2) A window can be opened by clicking its icon on the toolbar.

Menus and dialogs provide access to commands in Stata until the time comes when you want to directly enter the command in the command window.

The working directory is displayed on the status bar in the lower left-hand corner.

Another important component of Stata is the Data Editor. It allows you to see the data that you are working on. It is available in two different modes: the edit mode and the browse mode. The edit mode allows you to manipulate the data. It can be accessed by clicking on the toolbar icon that looks like a pencil on a spreadsheet. All work done while in the Data Editor that changes the data is documented by commands sent to the Review window. These commands can be copied and pasted into a do file insuring that all your work is documented and reproducible. They are also captured in an open log file.

The browse mode allows you to see the data but it cannot be changed. It can be accessed by clicking on the icon in the toolbar that looks like a magnifying glass examining a spreadsheet. The edit mode is a convenient way to keep an eye on the results of your work without the danger of accidently editing something.

Stata Basics:

In a lot of Stata documentation, commands are preceded by a “.” and that convention is followed in this guide. Do NOT type the period!!!!

Stata is case sensitive. Commands are written in lower-case. Variable and file names can be in upper and/or lower case, but consistency is necessary. Stata interprets GDP and gdp as two different variables that can co-exist in the same file.

There are three basic variable formats in Stata: string, numeric and date. Strings are alphanumeric. They may consist solely of numerals but if a variable is declared a string, mathematical operations cannot be performed on it. In Stata’s Data Editor, the default color for strings variables is red. The default color for numeric data is black as are dates. Numeric variables with labels, such as “male” and “female”, but the underlying value is really a number, are blue.

Opening a Stata File

. use

The easiest way to open a Stata file is to double click on it in the directory. If you are already in the program, the menu system is the next easiest: click on the File menu, select Open and then browse to your file. Alternatively, type the command: . use filename in the command window or in a batch file. A file cannot be opened when another file is already open. Either close the first file, type .clear in the command line or use the clear option in the . use command.

. use filename, clear

. set memory – no longer needed

Stata 12 automatically takes care of allocating a sufficient amount of memory for the dataset. So, the . set memory command is no longer needed, unless you are running an older version of Stata. Problems still rise if you have more variables than Stata IC can handle. If your dataset exceeds 2,047 variables, you will need to use Stata SE which is available in a room in the Williams School. Contact: ITS Support Supervisor for the Williams School, Lloyd Goad (goadl@wlu.edu), for more information.

Command Syntax

Stata commands have a common syntax, which is written in the following manner:

. prefix_cmd: command varlist if exp in range, options

(Note: in a lot of Stata documentation, commands are preceded by a “.” and that convention is followed in this guide. Do NOT type the period when entering a command!!!!)

Commands can be extremely simple. The command:

. list

instructs Stata to list the entire file that is currently in memory. No need to use a menu to issue that command!

Obviously, commands can be more complex and powerful. Following the standard syntax, the list command presented below instructs Stata to break the list up into groups based on the rep78 variable, include only the mpg, price and weight variables, and only list observations where the price is greater than 20,000 and is in observations 1 through 100. The option “clean” instructs the program to list the observations without a box around the variables which makes the listing nice and compact.

. by rep78, sort: list mpg price weight if price > 10000, clean

The if qualifier is extremely useful and works with many commands. The if qualifier tests for equality. To specify a particular string value, enclose it in double quotes. Stata is case-sensitive, so each of the following is evaluated as a unique string. Also note the use of the double equal sign. See the next heading for details.

if gender == “male”

if gender == “Male”

if gender == “ male”

if gender == “male ”

= Versus ==

A single equal sign, =, is used in assignment expressions. Use = with the . generate and . egen commands.

A double equal sign, ==, is a logical operator that returns a 0 if an expression is false and a 1 if it is true. Use == with an if qualifier to test for equality. If you could substitute a >= (greater than or equal to) operator or a =r(mean) + 3*r(sd) & calories < . | calories =3 will replace any group with a missing value with a 3. To exclude missing values from the replacement, the command must be written as:

. replace group=3 if group>=3 & group < .

The .egen rmiss() function can be used to create a new variable that stores a count of the number of missing numeric values in each observation. An extension to the egen command rmiss2() counts the number of missing values in both numeric and string variables. It can be downloaded by typing the command: . findit rmiss2.

The . tabmiss program creates a frequency table of the number of missing values by variable. The program works for both numeric and string variables. It can be downloaded by typing, findit tabmiss.

See: for more information about tabmiss and rmiss2().

. recode

This is a useful command to collapse a continuous variable into categorical groups. While the . replace command can be used, it is quicker with . recode. To be safe, generate a new variable with the same values as the variable to be recoded, then work with the NEW variable, just in case a mistake is made. In the following commands, categories for the continuous variable “ age” are being created:

. gen AgeGrps=age

. recode AgeGrps (min/18=1) (19/21=2) (22/29=3) (30/49=4) (50/66=5) (67/max=6)

The recode command is also used to swap the values of a categorical variable. This is often done to match the coding of similar variables so an index can be created. For example, if two questions have a value of “1” for “Very Dissatisfied” and a value of “5” for “Very Satisfied.” but one question, quest3, has the coding reversed, recode can be used to change the values in quest3 to match those in questions 1 and 2. In the example code, the value of “3” remains unchanged, so it does not need to be specified. Again, for safety’s sake, work is being done with a copy of the variable.

. gen quest3_recode=quest3

. recode quest3_recode = 1=5 2=4 4=2 5=1

CREATING NEW VARIABLES

Rules for Variable Names:

1) Variable names are case-sensitive. Total, TOTAL and total are seen as three different variables by Stata. So, it is important to be consistent.

2) Names must begin with a letter.

3) Spaces are not allowed; use an underscore instead.

4) Names can be up to 32 characters in length. Shorter names (eight characters or less) are preferred, because longer names are often truncated in result outputs.

. generate

To create a new variable, use the . generate (abbreviated as . gen) or . egen command. Spacing is not important; operators can have spaces before and/or after or none at all. Constants and variables can both be used to create new variables. For a complete list of operators, type: . help operators.

Addition/Subraction

. gen sum = varone + vartwo

. gen sum2 = varone + 10 // constants can also be used to create variables

. gen net_pay = gross – deductions

Division

. gen mpg = mileage/gallons // mpg is equal to mileage divided by gallons

Multiplication (Interaction)

. gen total = price * quantity // total is equal to price times quantity

Exponentiation

. gen x_sq = x^x OR . gen x_sq = x*x

. gen x_cubed = x^3 OR . gen x_cubed=x*x*x

Lag

Useful with time-series data.

. gen price_lag1 = price[_n-1] // price_lag equals the price in the previous observation

. gen price_lag2 = price[_n-2] // lag two years and so on….

_n is a system variable that refers to the current observation. _n-1 refers to the previous observation. For a complete list of system variables, type: .help _variables in the command window.

If the dataset is panel data, then a by prefix will be needed. This is so the first observation in one group does not use a previous group’s value when creating the lag.

. bysort country: gen price_lag1=price[_n-1]

Note: If the dataset has been declared to be time-series with the . tsset command or panel data with the . xtset command, then the . L.varname operator can be used in a regression equation. This method works without creating new variables. See “Time Series” in “Modifying and Combining Files” for more information.

Log:

Often used when the relationship between the dependent variable and the independent variable is not constant. The ln function returns the natural log.

. gen log_x = ln(x)

. egen

The . egen command provides extensions to the generate command and offers very powerful capabilities for creating new variables with summary statistics. The . egen command works by observation (row) or by variable (column.) In the next two commands the new variables will have values based upon other variables in the same row/observation.

. egen AvgScore=rowmean(test1 test2 test3 test4)

. egen answered=rownonmiss(question1 – question25)

The following code works across observations. It generates a unique id number for each country in a panel dataset

. sort country

. egen cntryid=group(country) // This assigns the values of 1, 2, 3, etc to the various countries.

In the next command, a summary statistic for a column of data will be added to each observation. The same value will be added, unless subsets are created using the by option.

. egen TotalPrice = total(price)

Dummy/Indicator Variables

There are a number of different methods that can be used to create dummy (indicator) variables which take on a value of 0 if the condition evaluates to false, and 1 if it is true.

The easiest way ONLY works if there are no coding errors, so it is important to check first using either the

. tab varname, missing (tabulate with the missing option specified) or .assert commands.

The Easy Way

For a variable such as “educat” with five categories:

1) no education completed

2) primary completed

3) secondary completed

4) tertiary completed

5) technical/vocational

First, confirm that the variable really does have only those five categories .

. tab educat, missing

Second, use the following command to create the five dummy variables which by default will be named: educat_1, educat_2, educat_3, educat_4, educat_5; one for each of the “educat” categories.

. tab educat, gen(educat_)

Finally, . rename the dummy variables to something more descriptive. It is crucial to know EXACTLY how the “educat” categories were coded. If they are coded “0” for “No education”, “1” for “Primary completed”, “2” for “Secondary completed”, “3” for “Tertiary completed”, and “4” for “Technical/vocational education completed”, then issue the following (five) commands to rename the variables:

. rename educat_1 Noedu

. rename educat_2 Pri

. rename educat_3 Sec

. rename educat_4 Ter

. rename educat_5 Voc

However, if they were coded “0” for “No education”, “1” for “Technical/vocational education completed”, “2” for “Primary completed”, “3” for “Secondary completed”, and “4” for “Tertiary completed”, then you would issue the following commands:

. rename educat_1 Noedu

. rename educat_2 Voc

. rename educat_3 Pri

. rename educat_4 Sec

. rename educat_5 Ter

Note how the subsequent analysis would be completely thrown off by a renaming mistake. If the dummies were renamed the first way when the second way really corresponded to the coding of the underlying “educat” variable, then we would effectively mistake tertiary education with technical/vocational education and vice-versa – not good, especially if doing a wage regression!!!

Second Method

This way is more tedious but it is also more foolproof than the first approach. It’s foolproof because it explicitly takes into account the EXACT definition of the underlying variable from which the dummy variables were created. The dummy variables are created one at a time. Assuming that the coding of “educat” follows the first convention explained in the previous method, issue the following commands to create the dummy for “Noedu”: The dummies for “Pri” through “Voc” are created using the same approach.

generate Noedu = .

replace Noedu = 0 if educat >=0 & educat = 25 & mpg < . // missing values are not being kept

. drop if price < 10000

To drop specific observations, use the in qualifier. This command tells Stata to drop the first five observations.

. drop in 1/5

. duplicates list & duplicates drop

Use these commands to look for and potentially drop duplicate observations. This can be done by looking for duplicates for all variables, if no variable list is specified, or by duplicates for specified variables. See . help duplicates for more information.

Another way of identifying duplicates creates a variable that sequentially numbers the duplicates.

. bysort varname (or varlist): gen dup = cond(_N==1,0,_n)

This approach uses the cond logical expression programming function. The bysort prefix (if desired) sorts the file by the specified variable (or variables). Next it generates a new variable called dup which will be equal to 0 if no duplicate observations exist for the observation or 1 for the first observation that has duplicate(s). Additional duplicates for the obervation will be numbered sequentially.

. move and . order: reordering the list of variables

The order of variables in the Variables window, and thereby in the Data Editor can be changed. In the examples below, . move relocates varone to the position occupied by vartwo and shifts the variables below down one place. The . order command places the variable(s) specified at the top of the variable list.

. move varone vartwo

. order newvar oldvar

. reshape (wide ( long or long ( wide)

Frequently time-series data is obtained in a “wide” format. The identifying variable is given and the time-series data values (yearly, quarterly, monthly, etc.,) are spread across the columns of a single row. In many cases, the data needs be rewritten so that the data values are separated into multiple observations. In this example, country gdp values for the years 1980-2008 are separate variables along a single observation. (Aside: country is a string variable which, by default, is red.)

[pic]

To reshape the data, the identifying variable (i) and the stub name must be determined. In this case, country is the i variable, and gdp is the “stub name” (gdp1980, gdp1981, etc.) The j name is the name of the new variable. The command becomes:

. reshape long gdp, i(country) j(year)

and it issues this result:

[pic]

The number of observations went from 172 to 4988 and there are now 3 variables as opposed to the 30 variables there were when the data were “wide.” Open the Data Editor to see the results. Here’s an image of a few of the observations:

[pic]

Here’s the command that would change the data back from “long” to “wide”:

. reshape wide gdp, i(country) j(year)

and the result:

[pic]

It is possible to have multiple identifying variables, e.g. state and city, and the j variable can be a string. If Stata detects any mistakes in the data that affect its ability to execute the reshape command, an error message instructs you to type “reshape error” to get the details.

Combining Files

There are three ways to combine files. Additional observations from one datafile can be added to the end of another with the . append command. Additional variables contained in one file can be added to corresponding observations in another with the . merge command. The . joinby command forms all pairwise combinations within groups. It keeps only the observations that are found in both files. All three methods use a similar approach. The datafile in memory (the one that is currently open) is referred to as the “master” file. The file that is to be joined with the “master” is known as the “using” datafile. Both files must be Stata files.

. append

Append adds observations from the using file to the end of the master file. The files are stacked vertically.

. append fileinmemory using AddObsFile

. merge

Merge adds additional variables to observations. If the two datafiles are in EXACTLY the same order, a matching variable contained in both files is not needed. But usually, both files have a matching variable (or variables) that is used to associate an observation from the master file with an observation in the using file. Before files can be merged, they must be SORTED by the matching variable(s).

Merge creates a system variable named _merge that is added to the bottom of the list in the variable window. The _merge variable has five possible values, but in most cases, unless the using file is being used to update the master file, only the first three are of interest:

_merge==1 observation found in master file only

_merge==2 observation found in using file only

_merge==3 observation built from both master & using files. Normally, this is the desired value.

Before another merge can be executed the system generated _merge variable must be dropped or renamed. Otherwise, Stata will not allow you to do another merge because the _merge variable already exists.

There are a few different types of merge. In a one to one match (. merge 1:1) each observation in the master file has a corresponding observation in the using file. In a one to many merge (. merge 1: m) the using file has multiple observations per each unique key variable in the master file. There are also many to 1 (m:1) and many to many (m:m) merges.

Options available with the merge command are update and replace. Update replaces missing values in the master file with values from the using file. Replace, which is used in conjunction with update, replaces missing and non-missing values in the master file with values found in the using file.

Here is some example code for a 1:m merge:

* Load, sort and save master file

use patient.dta, clear

sort id //Note, each patient has a unique id, there are no duplicates.

save patient.dta

//

/* Load, sort and save “using” file

Note: patients may have made many visits to the doctor’s office, so there might be mulitple observations for each id. */

use visits, clear

sort id, stable // stable option maintains the relative order of observations

save visits

//

* Reload master file then merge with using file by id

use patient.dta, clear

merge 1:m id using visits // joins observations in the patient file with obs in the visit file by id.

// Stata will give you a frequency table for the system generated _merge variable.

drop _merge // If the merge worked as planned, the variable could either be dropped or renamed.

save AllInfo, replace

Time Series and Panel Data

A data file that contains observations on a single entity over multiple time periods is known as time series data, multiple years of gdp data for Bolivia is an example. If there are groups, such as multiple countries, for which there is data over a number of time periods, then it is known as panel data. To use Stata’s time-series functions and capabilities, the dataset must be declared to be time-series data. There are a number of items that need to be accomplished before that can be done. First, the dataset must contain a date variable and it must be in Stata date format. If the time variable is yearly then no additional work needs to be done. But if it is monthly, quarterly, daily or some other time period, see . help date. Second, the data must be sorted by the date variable. If it is panel data, the data must be sorted by date within the panel variable(s). Finally, Stata must be told that the file contains time-series or panel data. This is accomplished with the . tsset command for time-series data, or the

. xtset command. Actually, the . tsset command can be used for either time series or panel data; . xtset is used exclusively for panel data.

Stata can create a time variable for a dataset that does not have one, but the data must be in the correct time sequence and there can be NO gaps in the time series. To create a year variable that begins in 1975, issue the commands:

. generate year = 1974 + _n

To generate a monthly variable beginning in July of 1975 for a panel dataset, the commands would be:

. sort country

. by country: generate month = m(1975m7) +_n-1

. format month %tm

To declare a file with yearly data to be a time-series dataset and then to plot the data, issue these commands:

. sort year // Data must be sorted before the . tsset command can be used.

. tsset year, yearly // Declares data to be time series.

. tsline gdp // Plots the time series data. See “Descriptive Statistics.”

Stata’s response would be:

[pic]

To . xtset a panel data file, that contains gdp values for multiple countries where each country is identified by an identification number, the commands would be:

. sort cntryid year

. xtset cntryid year, yearly

. xtline gdp // Plots the a time series graph for each country in the dataset.

Note, the panel variable comes before the time variable and it must be numeric ! So the above commands would not work if the country variable is a string variable. First, a numeric id number for each country must be generated.

This code will generate an id number for each country and then declare that the file contains panel data.

. sort country year

. egen cntryid=group(country) // This assigns the values of 1, 2, 3, etc to the various countries.

/* or

. encode country, generate(cntryid) Use this OR the egen method-not both!!! */

. xtset cntryid year, yearly

Here are the results:

[pic]

Frequently, in time series analysis, it is desirable to “lag,” “lead,” or compute the difference between the value of a variable and adjacent observations. Since the data has been declared to be a time series or panel data, the time series operators (L., F., and D) can be used. There are several advantages to this approach, the most important being that it is less error prone than other techniques. The operator is cognizant of changes in panel. So, for example, it will not lag to the previous observation if the country has switched from Mexico to Venezuela. Another positive is that this method uses temporary variables, new variables do not have to be created.

Time Series Operators

• L. is equivalent to x[_n-1], i.e., the value of x in the previous observation. L2. is the value of x two time periods in the past. (The lag can be any desired amount.)

• F. is equivalent to x[_n+1], i.e., the value of x in the next observation.

• D. is equivalent to the difference between the current value and the previous value.

All these commands follow the same syntax, so just a lag example will be demonstrated. In a regression equation, specify the variable to lag or lead and the number of time periods desired.

. regress wage L.wage L2.wage

The command to clear time series settings is . tsset, clear.

Regression Analysis

OLS (Studenmund, pp. 35 – 40)

The Stata command to run an OLS regression where Y is the dependent variable and X1, X2 and X3 are the independent (predictor) variables is:

. regress Y X1 X2 X3

Restricting Stata commands to observations with certain conditions satisfied

You may want to calculate descriptive statistics and/or estimate a regression, for the full sample, as well as for females and males separately (micro data) or full sample, OECD-countries, Non-OECD countries, Sub-Saharan African Countries (macro data).

One thing to notice, is that when it comes to conditions like these, Stata needs two equality signs, rather than one – for the previous two examples:

.summarize Y X1 X2 X3

. summarize Y X1 X2 X3 if X3==1

. summarize Y X1 X2 X3 if X3==0

. regress Y X1 X2 X3

. regress Y X1 X2 X3 if X3==1

. regress Y X1 X2 X3 if X3==0

//

. summarize Y X1 X2 X3

. summarize Y X1 X2 X3 if oecd==1

. summarize Y X1 X2 X3 if oecd==0

. summarize Y X1 X2 X3 if ssa==1

. regress Y X1 X2 X3

. regress Y X1 X2 X3 if oecd==1

. regress Y X1 X2 X3 if oecd==0

. regress Y X1 X2 X3 if ssa==1

Note that in the regression-case, it does not matter whether the variable, you restrict upon, is included as an explanatory variable or not – Stata will detect that it is collinear (since there will be no variation in it) and automatically exclude it from the explanatory variables.

Ensuring Consistency Among the Number of Observations in the various estimation samples

In descriptive analyses, the number of observations often differs across the variables. Similarly, the number of observations will likely differ for regressions using various specifications of explanatory variables. This is due to some observations having missing values for some variables but not for others, thus creating “holes” in the number of observations.

You want:

1) The number of observations for each variable within the descriptive analysis to match up.

2) The number of observations used in the different regression specifications to match up.

3) The number of observations between (1) and (2) to match up.

To ensure this, do the following:

1) Load your Stata dataset.

2) Run the model specification that gives rise to the biggest drop in sample size:

. regress Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

3) Use . keep to retain only the observations that have non-missing values for all variables, dependent and explanatory, from the above regression:

. keep if e(sample)

4) Stata will then drop all the observations that have one or more missing (dependent and/or explanatory) variables – so that all subsequent analyses will be performed on one, consistent dataset.

Check for Violations of the Classical Assumptions

Omitted Variable Test (Studenmund, pp. 202 - 204)

using Ramsey’s Regression Specification Error Test (RESET)

Run the regression of interest, then:

. regress Y X1 X2 X3

. estat ovtest // estat => postEstimation Statistics ovtest => omitted variable test

Comparing Alternative Specifications for Model (Studenmund, p. 204 - 206)

using Akaike’s Information Criterion and Schwartz’s Bayesian Information Criterion

Run the regression of interest, then:

. regress Y X1 X2 X3

. estat ic // estat => postEstimation Statistics ic => information criterion

Detection of Multicollinearity – VIF Score (Studenmund, p. 259 - 261)

Run the regression of interest, then:

. regress Y X1 X2 X3

. estat vif // estat => postEstimation Statistics vif => variance inflation factors

Note, the . estat vif command CANNOT be used after an xtreg regression. Instead, run a standard OLS regression and include all dummies.

Detecting Serial Correlation

In order to use any time-series analysis related commands in Stata, including running a Durbin-Watson test, you must first define the dataset to be a time-series dataset in order to tell Stata which variable defines the time dimension (in the example below, it is “qtrs.”) For more information about the . tsset command, see “Time Series” in the previous section (Modifying & Combining Files.)

Durbin-Watson d Test for First-Order Autocorrelation (Studenmund, p. 315)

If it has not been done already, tell Stata it is dealing with time series data, run the regression of interest then check for serial correlation:

. tsset qtrs, quarterly //qtrs is the variable that contains the time-frequency of the dataset

. regress Y X1 X2 X3

. estat dwatson //estat => postEstimation Statistics

Dealing With Autocorrelation

7 GLS Using Cochrane-Orcutt Method (Studenmund, p. 322)

Where Y is the dependent variable and X1, X2, and X3 are explanatory variables, issue the command:

. prais Y X1 X2 X3, corc // corc option specifies that Cochrane-Orcutt transformation be used

8 Correct Serial Correlation Using Newey-West Standard Errors (Studenmund, p. 324)

Again where Y is the dependent variable and X1, X2, and X3 are explanatory variables:

. newey Y X1 X2 X3, lag(#)

where “lag(#)” specifies the order of the autocorrelation. Typically we think first-order autocorrelation is relevant, and therefore specify “lag(1)” but if you think second-order autocorrelation might be relevant, you could go on to specify “lag(2).”

Check for Homoscedasticity

One of the classical assumptions of OLS is constant variance of the error term. After the model is run, check the residuals to see if this assumption is violated. Stata has a built-in command that compares the residuals against the fitted values.

. rvfplot, yline(0) // Plots residuals vs. fitted values. The yline option puts a reference line at 0.

Look for patterns in the graph; there shouldn’t be any if the residuals are homoskedastic. If you see a trumpet shaped pattern, try to determine which of the predictor variable(s) is the cause. One way do doing this is the residuals versus predictor plot which plots the residuals against a predictor variable of your choosing.

. rvpplot X3 // Plots residuals vs. a predictor variable (X3) in the model

Another, perhaps more flexible method, to check the residuals against predictor variables is to calculate the residuals and save the results as a variable called "Residuals,” (or any other name, you like). The Residuals variable can be used in subsequent tests.

. predict Residuals, resid

Next create a scatter plot to check for the possibility of “trumpet-shape” residuals. Here are the steps to plot the residuals against a suspected proportionality factor. First run the regression of interest. Then estimate the residuals and save as “Residuals”. Finally, create the scatterplot.

. regress Y X1 X2 X3

. predict Residuals, resid

. twoway scatter X3 Residuals // “X3” is the variable with the suspected proportionality factor

Park Test for Heteroskedasticity (Studenmund, p. 356)

First run the regression of interest, then estimate the residuals and save as “Residuals.” Generate the dependent variable in the Park test regression, namely the log of the squared residuals. Finally, run the Park test regression & perform the relevant t-test.

. regress Y X1 X2 X3

. predict Residuals, resid

. generate ln_Residuals_sq = ln(Residuals*Residuals)

White test for Heteroskedasticity (Studenmund, p. 360)

Run the regression of interest, then:

. regress Y X1 X2 X3

. estat imtest, preserve white // imtest => information matrix test

//white specifies that White's original heteroskedasticity test be performed

Weighted-Least Squares (WLS) (Studenmund, p. 363)

At the time of writing, Stata does not contain a direct WLS procedure, but you can download the user-written “wls0” command and then use that from within Stata (see for more details and an example.)

Heteroskedasticity-corrected (HC) standard errors (Studenmund, p. 365)

Run the regression of interest, specifying the “robust” option:

. regress Y X1 X2 X3, robust

Hypothesis Testing (Studenmund p.559)

t-tests (Studenmund p.561)

The t-statistics and the p-value of the t-statistics are automatically provided – for the two-sided alternative. If you want to test one-sided, you’ll need to divide the p-value by two – AND check that the sign is in the expected direction!!

f-tests

The t-test is a partial test, i.e., it tests for the statistical significance of each individual regressor/explanatory variable in turn If you want to perform tests that involve a group of variables, you need something else – one possibility is the f-test.

There are two “flavors” to consider:

Tests for joint statistical significance of explanatory variables

1) Run the regression of interest:

. regress Y X1 X2 X3 X4 X5

6) Perform (an) f-test(s) of the group(s) of explanatory variables, you are interested in, say:

. test X1 X2

and/or:

. test X1 X2 X3

Stata will give you the f-Statistic and its associated p-value for a test of the null hypothesis of lack of joint statistical significance and the alternative hypothesis of joint statistical significance for the group of explanatory variables as a whole (recall the decision rule for statistical tests when using the p-value!)

Testing for equality of coefficients

Alternatively, we may be interested in testing whether the coefficients for a group of explanatory variables are equal. For example, whether the coefficients for a set of dummy variables for different ethnicities in a wage regression are the same.

1) Run the regression of interest:

. regress Y X1 X2 X3 X4 X5

7) Perform (an) f-test(s) of the group(s) of explanatory variables, you are interested in testing for identical coefficients – say:

. test X1 = X2

and/or:

. test X1 = X2 = X3

Here, the null hypothesis is that all the coefficients are equal to each other and the alternative hypothesis is that at least one of the coefficients is different from the other coefficient(s).

Making “Nice” Tables of Regression Results

The default in Stata is to report results in a “wide” format, i.e., the estimated parameters in the first column, the estimated standard errors in the second column, etc. There are two different approaches to creating “nice” tables – similar to how results are reported in journal articles, i.e., with the standard errors below the parameter estimates, along with fit measures (R2, adjusted R2, Akaike Criterion, etc.,) and the number of observations, etc., at the bottom.

The first gives you the brackets around the standard errors, as in most journal articles. The second is more comprehensive, allowing you to include t-statistics, p-values, several different fit measures (rather than merely adj-R2), etc., but you have to make the brackets around the standard errors yourself. The two methods work as follows:

. outreg2

1) Install the command by typing:

. ssc install outreg2, all replace

8) Run the first regression that you want included in the table.

. regress Y X1 X2 X3

9) Next run the . outreg2 command replacing “filename” with the name you want to give your file. The file will be saved to your current working directory, or you can specify a different location by including the entire path.

. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') replace

The word option saves your file in rich text format (rtf) which can be opened directly in MS Word. If you specify excel instead of word, the file will be saved as an xml file which can be directly opened in Excel. The replace option causes the program to overwrite any existing file with the same name.

10) Run the additional regressions that you want to include in the table, appending them to the previous results:

. regress Y X1 X2 X3 if X3 == 1

. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') append

. regress Y X1 X2 X3 if X3 == 0

. outreg2 using filename, word bdec(3) bracket addstat(Adj R-Squared, `e(r2_a)') append

2) Stata responds each time outreg2 is run with your filename in quotes in blue text. Click on it, and if you have included the word option, your file will automatically open in a copy of MS Word.

The “estimates store” and “estimates table” Commands:

1) Run the regression(s), that you want to be part of the table, saving the results after each regression:

. regress Y X1 X2 X3

. estimates store model1

. regress Y X1 X2 X3 if X3 == 1

. estimates store model2

. regress Y X1 X2 X3 if X3 == 0

. estimates store model3

11) Create the table, by calling the model results, that you saved previously, and add the statistics, etc, you want to include (here, we ask for coefficients, standard errors, t-statistics, p-values + R2, adjusted-R2, the Akaike Information Criterion, the Bayesian Information Criterion, and the number of observations) :

. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) se(%6.3g) t(%6.3g) p(%4.3f)

NOTE: if there are parts of the previous table, you don’t want to include, just modify the previous Stata command accordingly. Say, you are not really interested in the t-statistics or p-values but only want to include the coefficients and their standard errors (+ the fit measures and N from before):

. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) se(%6.3g)

Note, if you want “stars” to indicate the level of statistical significance, you CANNOT combine this with the “se”, “t”, and/or “p” options from above. An example of getting “stars” in the fashion used in economic journals, etc, similar to the previous, is:

. estimates table model1 model2 model3, stats(r2 r2_a aic bic N) b(%7.3g) star(0.1 0.05 0.01)

12) Highlight the table (i.e. from top edge of table down to and including the bottom line of the table – do NOT highlight the legend, that will mess up the formatting in Excel subsequently!!) in the results window with the cursor, right-click on it and chose “Copy Table” (NOT “Copy Text”) and then copy and paste into Excel, to create the table, then copy and paste into Word.

Other Recently Used Regression Models

Models with Binary Dependent Variables

The . probit & . logit commands are relevant whenever the dependent variable is a binary (i.e. “dummy” or (0,1) variable). Consult an econometrics textbook for details.

. logit Y X1 X2 X3

. probit Y X1 X2 X3

Since the logit and probit models are non-linear in the estimated parameters (unlike the OLS model), the estimated coefficients are not directly interpretable as marginal effects (again, unlike in the OLS model). The reason for this is that due to the non-linearity, the marginal effects of a given explanatory variable will depend on the value of ALL other explanatory variables, as well. We therefore need to evaluate the marginal effects of a given explanatory variable at some value of the other explanatory variables. Typically we’ll set all the other explanatory variables at their mean value.

To run logit and/or probit models in Stata, where the results yield a marginal effects type:

Logit with marginal effects reported

1) Run the logit of interest:

. logit Y X1 X2 X3

13) Calculate the marginal effects, evaluated at the means of all other explanatory variables:

. mfx

Probit with marginal effects reported

1) Run the probit of interest:

. probit Y X1 X2 X3

2) Calculate the marginal effects, evaluated at the means of all other explanatory variables:

. mfx

Multinomial logit

Relevant whenever the dependent variable is a qualitative variable with more than two outcomes that cannot be ordered/ranked (if it could be ordered/ranked, we would use an ordered probit – and if it had only two outcomes, we would use instead the (simple) probit and/or logit model, discussed previously.) Examples include transportation choice (car, bus, train, bike, etc), health provider (doctor, healer, etc).

NOTE: Consult an econometrics textbook for details.

In a do file, type:

mlogit Y X1 X2 X3 X4 X5, basecategory(#)

mfx, predict(outcome(#))

mfx, predict(outcome(#))

mfx, predict(outcome(#))

mfx, predict(outcome(#))

where Y is the dependent variable and the Xs are the explanatory variables.

The "#" after “basecategory” sets the category you want to be the reference category (all results are relative to this category). Which one you choose matters mostly for the interpretation, although it is better to have a reference group with a relatively high number of observations (yields relative more precise estimates for the estimated parameters (i.e. the coefficients), since the coefficients, again, are relative to the base category).

Again we have a problem with the estimated coefficients not being interpretable as marginal effects (since the multinomial logit model is non-linear in the estimated parameters – as were also the (simple) logit and probit models from before) – and again the “mfx” command calculates the marginal effects (again, here these are to be interpreted as the marginal probability, ceteris paribus, for the outcome in question – since the dependent variable is qualitative). Again, we will typically set all the other explanatory variables at their mean value (which is also the default for the “mfx” command).

Aside: because marginal effects add up to one, unlike what was the case for the coefficients, we can calculate marginal effects for ALL outcomes (the "#" after “outcome” refers to the particular outcome of the dependent variable for which you want to calculate the marginal effects.

If you want to know more about the command, type . help mlogit in Stata.

Appendix A

Do File Example

* Note: actual commands are in bold face type and all 3 types of comment markers are used as examples.

/* Three good commands to include no matter what the rest of the program is doing. */

version 12.0

/* Set the version of Stata (you can see the current version number by typing “version” in Stata’s command window.) (NOTE: Sometimes user-written commands may work only under an older

version of Stata, say, version 9 – in that case, one would type “version 9.0”, instead.)

*/

capture log close // Close any open log-files.

clear // Clear memory allocated to Stata in order to start with a clean slate.

/*

The next two commands may be necessary when working with large datasets and an older version of Stata.

The amount of memory that can be assigned depends upon the memory of the computer being used and the number of other programs that are running.

*/

set memory 300m // increase memory assigned to Stata.

set matsize 200 // increase the number of parameters allocated for dataset

/*

Open a log file to document the session. Specify the relevant path rather than the “X”s, “Y”s, and “Z”s. The replace option permits overwriting of a file with the same name.

*/

log using "C:\XXXX\YYYYY\ZZZZZ.log", replace

/*

Load the Stata file by specifying the complete path rather than the “X”s, “Y”s, and “Z”s.

There are many methods for creating a Stata datafile: copying and pasting, data entry in the Data Editor, and the . insheet command are a few examples.

*/

use " C:\XXXX\YYYYY\ZZZZZ.dta", clear

/*

Get descriptive statistics, run a regression, save residuals and plot residuals to look for heteroskedasticity.

*/

summarize // Get basic descriptive statistics for all variables. Specify specific variables, if desired.

histogram X1 // Check variable X1 for outliers.

graph box X2 // Create a box plot for variable X2 to check for outliers.

twoway scatter X1 X2 // Create scatter plot with X1 on Y axis and X2 on X axis

correlate X1 X2 // Estimate the partial correlation between X1 and X2

regress Y X1 X2 X3 // Estimate an OLS regression of Y on X1 and X2

predict Residuals, resid // Predict residuals and save as a new variable called, “Residuals”

twoway scatter Residuals X3 /* Plot residuals with suspected proportionality factor, X3,

(to detect heteroskedasticity)

*/

Wrap up work: save file with new name, close log file and exit do file.

*/

save " C:\XXXX\YYYYY\newfile.dta" // Save file with Residuals variable under new name, if desired.

capture log close // Close log file.

exit // Return control to the operating system, not necessary, but good practice.

Appendix B

Importing an Excel file into Stata

1. Prepare Excel file

a. Delete titles and footers

b. Assign short variable names, NO SPACES

i. Names can be as long as 32 characters, but long names will likely be truncated when displayed and that can cause identification problems.

ii. Use an underscore rather than a space, e. g., net_inc

2. Do a “Save as” to keep the original file intact.

3. Import into Stata

a. Use the import option under the File menu.

b. Data can usually be copied in Excel and pasted into the Stata Data Editor

i. Stata asks if the first row contains data or variable names

ii. a pitfall with this approach is that no documentation of how file was created is retained

c. open Data Editor to look at file; by default, string variables are red and numeric variables are black

4. Fixing problems

a. . destring command to change a variable from string to numeric

i. sometimes coding can cause numeric variables to be imported as string variables.

ii. examples include the use of commas as thousand separators or “n/a” to represent missing values.

iii. . destring varname, ignore("," "n/a") replace fixes this situation.

iv. The ignore option drops the “,” and/or changes the “n/a” to “.” and makes the variable a numeric.

b. Changing non-Stata missing values to Stata’s missing value

i. many data files use “-9” to indicate a missing value; this is NOT a missing value in Stata

ii. Stata uses “.” (and .a, .b, ... .z for more extensive tracking of missing values)

iii. . recode varlist ( -9 = .)

5. Run descriptive statistics: . count, . codebook, .describe, .summarize to check your work.

Appendix C

Downloading Data from ICPSR

Washington & Lee is a member institution of ICPSR, so you can download files from that site. You will be asked to log in, or if you are a first time user, to register, when you try to download data.

The download process has three steps:

Step 1:

❖ If there is a Stata System File option, use that column to select the desired file(s). These are Stata *.dta files.

[pic]

❖ If there isn’t a Stata system file, select the ASCII Data File + Stata Setup Files option if one exists.

[pic]

If neither a *.dta nor a Stata setup file is available, ask Carol Karsch, the Data & Statistical Support Specialist at Leyburn Library. She can work with both SAS and SPSS and transform the data to Stata.

Step 2: Login or register, if asked. (It’s free. The University Library maintains a subscription.)

Step 3: Select “Save File” option

You’ll be sent a “zipped” folder that needs to be decompressed before you can work with it. Usually, there is a “Codebook” , a “descriptioncitation” which gives a brief synopsis of the study’s methodology, a manifest, which gives technical details, a related_literature document, in addition to the data and setup files.

The codebook should be carefully read before work with the data begins. A codebook describes the study. It gives details about the data that are extremely useful. If the variables in the dataset are not labeled, the codebook is indispensable. The codebook describes the variables, their locations in the file, what the values mean and potential information about the variables that is necessary to using them correctly. It helps you verify that the data will be useful for the purpose you intend. It may not! The data might not have the level of detail required. If the analysis is a time series, the questions important to the research may not be asked in the years under consideration. Groups important to the analysis maybe dropped because of privacy issues. A careful reading of the codebook(s) can prevent a lot of wasted effort and much frustration.

Using a Stata Setup File to Import ASCII Data into Stata

This ICPSR webpage gives thorough instructions for using Stata setup files:

Basically, the *.do file must be edited with a text editor, e.g., Notepad. The paths and filenames of the ASCII data file (*.txt), the Stata dictionary file (*.dct) and the output file, i.e., the Stata dataset, have to be specified with COMPLETE paths, including extensions. If there is an embedded space in the filename, it must be enclosed in double quotes. Here is an example for ICPSR Study # 07644.

Here is the edited version:

[pic]

The code to replace missing values is usually commented out, so it isn’t executed. To change missing values to a “.” which Stata recognizes as a missing value, delete the comment markers ( /* and */) at the beginning and end of the section.

Finally, it is good practice to type the . exit command at the bottom of the file.

[pic]

Save and close your edited version of the *.do file. Opening the file will launch Stata, run the do file, and if all goes well, the Stata dataset will be created!

-----------------------

This is the section that needs to be edited. The program will use the filenames supplied in the sections below.

Deleting these

will change all missing values to “.”

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download