Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012
In this lab, several Stata commands are introduced that will allow you to execute some very useful data manipulations, including collapsing data, reshaping data, merging datasets, and appending datasets. The lab also introduces some Stata date functions before giving you more practice with regression analysis and producing graphs.
COLLAPSING DATA
Suppose you have a cross-section dataset with a single line of data for each of n distinct observations. For example, you might have a dataset with 3,417 observations where each observation is a different person or a different school or a different zip code. Suppose too that there is a classification variable in the dataset that groups individual observations in some way. For example, race may equal 1 if asian, 2 if black, 3 if white, and 4 if other; schooltype may equal ‘elementary’ for an elementary school, ‘middle’ for a middle school, and ‘high’ for a high school; or urbanicity might equal 1 if a zip code is in an urban center and equal 0 if a zip code is not in an urban center.
Regression analysis would likely produce coefficient estimates for a regression models by using all 3,417 observations with the classification variable being included in the model as a series of 0/1 dummy variables. Before executing the regression, however, the researcher may want to calculate statistics for subgroups of observations. For example, one might want to know the average income by race or average days of school by school type or average commute times by urbanicity. Producing such simple descriptive statistics is easy in Stata with the bysort command:
To more fully understand what is being discussed, consider wisconsin98data.dta, which is a cross-section dataset. It contains data on each of the 373 school public school districts in Wisconsin from the late 1990s. Using wisconsin98data.dta, what are the minimum and maximum 1996, 1997, and 1998 salaries paid to teachers with B.A. and M.A. degrees in the various MSA regions of Wisconsin? To do this, first save wisconsin98data.dta to the Stata folder of your U drive. Then launch Stata and type:
use wisconsin98data.dta
desc
sum
bysort msa: sum bamin96 bamin97 bamin98
bysort msa: sum bamax96 bamax98 bamax98
bysort msa: sum mamin96 mamin97 mamin98
bysort msa: sum mamax96 mamax98 mamax98
The collapse command in Stata performs this manipulation of the data for the researcher. To learn more about the collapse command, use Stata''s on-line help:
help collapse
The description of the collapse command that appears on screen describes not only the syntax for the collapse command, but it also provides a complete description of the group-level statistics that can be calculated within the collapse command.
There are two somewhat tricky issues to understand when using the collapse command. First, as soon as a collapse command has been executed, the data fundamentally change. Thus, you need to be aware of what data are in Stata’s memory at all times. Many times you will need to save the original data, collapse it, and then either drop the collapsed dataset from memory and call up the original data again or you might have to merge the collapsed data back to the original data. Second, the new dataset needs variable names. Stata has a default for this as long as there is only one statistic being created. If two or more statistics are being created, however, the researcher needs to specify variable names.
To emphasize both of these points along with the syntax of the collapse command, enter the following commands:
clear
use wisconsin98data
sum bamin96 bamin97 bamin98
collapse (mean) bamin96 bamin97 bamin98, by(county)
desc
sum
Notice how the sample sizes from the sum command was reduced from 373 school-level observations to 72 county-level observations. Notice too that any variable not specified in the collapse command is dropped from the collapsed dataset.
Looking above, notice that the syntax is to specify in parentheses the desired statistic followed by the desired variables; then, following a comma, the group which will defined the new observation level is identified in by( ). Above we included by(county), and so the county variable is included in the new dataset. Instead, we could have used by(countyid), in which case the countid variable would have appeared in the new dataset.
To emphasize the second tricky issue mentioned above, clear the data in memory and call wisconsin98data.dta back into memory:
clear
use wisconsin98data
This time, we want to create a county-level dataset that includes the county average of ma98min plus the county minimimum and maximum for ma98min. To do this, we must take steps to define all of the new variables:
desc
sum
collapse (mean) mamin98avg=mamin98 (min) mamin98min=mamin98 (max) mamin98max=mamin98, by(county)
desc
sum
Make sure you understand the difference between the new variables mamin98avg, mamin98min, and mamin98max. They all pertain to the minimum salary paid in 1998 in the county to a teacher with a masters degree. The first, mamin98avg is the average minimum MA salary paid in the county. The second, mamin98min, is the lowest of all minimum MA salaries paid in the county. The third, mamin98max, is the highest of all minimum MA salaries paid in the county.
For practice, enter the following commands into Stata. Try to anticipate what each does.
clear
use wisconsin98data
collapse (mean) enrollavg=enroll lunchavg=lunch actavg=act ppexpavg=ppexp (min) enrollmin=enroll lunchmin=lunch actmin=act ppexpmin=ppexp (max) enrollmax=enroll lunchmax=lunch actmax=act ppexpmax=ppexp, by(msa)
desc
sum
list msa enrollavg lunchavg actavg ppexpavg
list msa enrollmin lunchmin actmin ppexpmin
list msa enrollmax lunchmax actmax ppexpmax
RESHAPING DATA
The need to reshape data most often (though not exclusively) arises when some variables occur over time. For example, wisconsin98data.dta is a district-level dataset. In it is included the minimum salary paid to a teacher with a B.A. degree in 1996, 1997, and 1998, among other similar salary variables. Ignoring the other variables, therefore, there is one row of data for each unit of observation (a school district in this case) with multiple variables for the same idea over time (bamin96, bamin97, and bamin98 in this case). To see this a bit more clearly, enter the following:
clear
use wisconsin98data
list name bamin96 bamin97 bamin98 if _n<10
This form of the data (one observation per unit with multiple variables across time) is called the wide form of the data as there are "a lot" of variables (making the dataset wide) but "few" observations as there is only one observation per unit.
In contrast, we might want to condense the variables into one variable. In this case, the variable bamin would include all of the data from bamin96, bamin97, and bamin98. Of course, one cannot just do this in any meaningful way. Rather, the way to do this is to associate every unit of observation (i.e., each school district in this case) with three rows of data — one for 1996, one for 1997, and one for 1998. Thus, where there were originally 373 observations with 3 bamin variables each, this reshaped dataset will have 373 x 3 = 1,119 observations but only one bamin variable. This is called the long form of the data as there are "few" variables, but there are multiple rows of data for each observation.
To recap, consider a dataset with a district name (name), enrollment (enroll), and three bamin salary variables (bamin96, bamin97, bamin98). These data can be presented in wide or long form:
Wide Form
name enroll bamin96 bamin97 bamin98 Abbotsford 660 $24,410 $25,170 $26,130 Adams-Friendship Area 2,032 $25,296 $25,702 $26,357 Albany 495 $22,800 $23,155 $23,565 Long Form
name enroll year bamin Abbotsford 660 96 $24,410 Abbotsford 660 97 $25,170 Abbotsford 660 98 $26,130 Adams-Friendship Area 2,032 96 $25,296 Adams-Friendship Area 2,032 97 $25,702 Adams-Friendship Area 2,032 98 $26,357 Albany 495 96 $22,800 Albany 495 97 $23,155 Albany 495 98 $23,565
Notice how the long form of the data has a new variable, year, that indicates which bamin variable is being listed.
Reshaping data requires care. First, it requires that the time variables (bamin96, bamin97, bamin98 from above) include numbers in the variable name that indicate the year. The programmer then needs to inform Stata of the years. The programmer also needs to specify to Stata which variables are the year variables and which are not. Once the researcher indicates all of this to Stata, a single command (and it’s the same command in some sense) can be used to reshape the data from wide to long form or from long to wide form. To read more about this in Stata, enter
help reshape
Now consider the following command:
clear
use wisconsin98data
keep name enroll bamin96 bamin97 bamin98 bamax96 bamax97 bamax98 mamin96 mamin97 mamin98 mamax96 mamax97 mamax98
desc
sum
list if _n<4
Notice that the data is currently in wide form. Now enter:
reshape long bamin bamax mamin mamax, i(name) j(year)
The previous command is reshape long, which reshapes the data from wide form to long form. The above command also uses name to indicate each school district observation. Notice that the variable inside i() must uniquely identify each observation in the wide form of the data. Finally, the researcher indicates a variable for the different years inside j(). In the above command, this is accomplished with j(year). Stata knows from the code of the command to look for variable suffixes on bamin, bamax, mamin, and mamax and to store them in year. To make sure you know what has happened, enter the following commands:
desc
sum
sort name year
list name enroll bamin bamax mamin mamax if _n<50
Lastly, one can always return the data from the long form to the wide form:
reshape wide bamin bamax mamin mamax, i(name) j(year)
desc
sum
sort name
list name enroll bamin96 bamin97 bamin98 if _n<10
Notice that the command to return the data from long form to wide form is reshape wide.
MEGING DATA
Frequently researchers need to combine datasets in one of two ways, merging or appending. Two datasets are merged together if both datasets have rows of data for the same observation but the two datasets have different variables. For example, NFLfinances.dta might have finance data for each of the 32 NFL teams, such as revenue, ticket sales, and total salary. The dataset would also have an identifier variable for the 32 teams, such as teamid that equals 1 for the Arizona Cardinals, 2 for the Atlanta Falcons, …, equals 32 for the Washington Redskins. At the same time, NFLteamstats.dta might have team statistics for each of the 32 NFL teams, such as wins, offensive rank, defensive rank, and quarterback rating. And, again, teamstats.dta must have the same identifier variable. The researcher now wants to combine, or merge, these two datasets to form one dataset that includes finance and team statistics for each team. Of course one could do this by hand or in Excel, but that would be a huge waste of time that is prone to errors. Moreover, combining two sets like this is trivial in Stata. Don't actually type this into STata as neither NFL dataset is available, but the commands would be:
use finances
sort teamid
save finances, replace
clear
use teamstats
sort teamid
merge 1:1 teamid using finances
There are two things to note. First, Stata requires that the user actually sort both datasets before the merge. This is the reason for the first four lines of code. Second, as you can see in the merge 1:1 command, there are more complicated ideas of merging than this simple example. The example here is one in which there is one and only one observation in each dataset that needs to be matched with one and only one observation in the other dataset.
Regardless of the type of merge that takes place, following any merge command, Stata generates a new variable called _merge that indicates how the data were merged together. The researcher should always tab _merge following any merging of data just to be sure that the merge occurred as one thought it would. Also, Stata will not perform another merge until _merge has been dropped from the dataset, because Stata needs to be able to create _merge again. Therefore, the second set of commands from should actually be:
use teamstats
sort teamid
merge 1:1 teamid using finances
tab _merge
drop _merge
One common application in Stata is the combination of the collapse and merge commands. Consider again wisconsin98data.dta. Suppose one speculates that test scores depend on the maximum salary a district pays its teachers with a Masters degree. In this simple form, one would use Stata in the following way:
clear
use wisconsin98data
reg act mamax98
That was easy enough, but suppose one’s model posits that the market for teachers is not geographically broad, but rather that school districts within the same MSA compete for teachers. In this model, absolute pay does not matter, but rather relative pay within the MSA matters. To carry out this analysis:
clear
use wisconsin98data
keep name msa mamax98 act
sort msa
save WI_all, replace
collapse (mean) mamax98msa=mamax98, by(msa)
save WI_msa, replace
sort msa
merge 1:m msa using WI_all
tab _merge
drop _merge
desc
sum
sort name
list if _n<40
gen relsal=mamax98/mamax98msa
sum relsal
reg act relsal
Notice the merge 1:m command. In this case, WI_all.dta has 373 observations, one for each school district. The data in memory (after the collapse command) has just 13 observations, one for each MSA. The merge then says to use the dataset in memory and merge to WI_all.dta. Therefore, for every one observation in WI_msa.dta, there are many observations in WI_all.dta. The "m" in "1:m" stands for "many". It is up to the researcher to make sure the merge command is specified correctly.
As for the regression results, notice that relative maximium pay is a statistically predictor of ACT scores. Of course, one would want to explore the empirical relationships between salary and ACT scores much more than what we have done here, but the point of the exercise is to understand how the collapse command can be used with a merge command to allow one to regress individual variables (ACT scores) on a group average or on a variable that is defined off of a group average (salary relative to MSA average).
APPENDING DATA
In other circumstances, data from two different datasets need to be appended together rather than merged. Appending datasets extends the dataset (downward) to include more rows of data, but with exactly the same variables.
For example, ATT8796.dta contains daily close prices for ATT stock price from January 1, 1987 through December 31, 1996. It is a very simple dataset. Download the dataset and enter:
clear
use ATT8796.dta
desc
sum
list if _n<15
Likewise, ATT9711.dta contains daily close prices for ATT stock price from January 1, 1997 through December 31, 2011. This dataset has the same exact format as the previous one. Download the dataset and enter:
clear
use ATT9711.dta
desc
sum
list if _n<15
It is important to realize that ATT8796.dta and ATT9711.dta have exactly the same structure. The researcher would most likely want to combine these data into a single dataset by placing the data in ATT9711.dta beneath the data in ATT8796.dta. The following commands to this:
clear
use ATT8796.dta
append using ATT9711.dta
rename price price_A
desc
sum
sort datemdy
save ATT.dta, replace
For practice (and to allow some work with dates in Stata), append similar files for Ford and Intel share prices after saving four datasets (Ford8796.dta, Ford9711.dta, Intel8796.dta, and Intel9711.dta):
clear
use Ford8796.dta
append using Ford9711.dta
rename price price_F
desc
sum
sort datemdy
save Ford.dta, replace
clear
use Intel8796.dta
append using Intel9711.dta
rename price price_I
desc
sum
sort datemdy
save Intel.dta, replace
USING DATES IN STATA
Stata (and all software programs) have an ability to use and manipulate dates in many different ways. Whenever using dates as variables, be sure to understand how your software works and how it keeps track of dates.
To illustrate just one of Stata’s best date features, let’s first merge the three share price datasets as each dataset is an observation for each date from January 1, 1987 through December 31, 2011.
clear
use ATT
sort datemdy
merge 1:1 datemdy using Ford
sum
tab _merge
drop _merge
sort datemdy
merge 1:1 datemdy using Intel
tab _merge
drop _merge
desc
sum
label variable datemdy “Date of Close: MDY”
label variable price_A “ATT Share Price at Close”
label variable price_F “Ford Share Price at Close”
label variable price_I “Intel Share Price at Close”
desc
sum
sort datemdy
save SharePrices.dta, replace
Now let’s see what the day, month, and year commands in Stata do:
clear
use SharePrices
desc
sum
list if _n<30
gen daynum=day(date(datemdy, “MDY”, 2000))
label variable daynum “Day of Close: 1-31”
gen monthnum=month(date(datemdy, “MDY”, 2000))
label variable monthnum “Month of Close: 1-12”
gen yearnum=year(date(datemdy, “MDY”, 2000))
label variable yearnum “Year of Close: 1987-2011”
sort yearnum monthnum daynum
desc
order datemdy monthnum daynum yearnum price_A price_F price_I
sum
list if _n<30
save SharePriceDates.dta, replace
Notice the syntax of the day, month, and year commands . Each requires the date() subcommand: day(date( ___ )), month(date( ___ )), and year(date( ___ )). For each, the (___) requires three items, separated by commas. The first element is the variable that is the date. In the above case, the date variable is always datemdy. The second element of the date(____) subcommand tells Stata the form of the date variable. In the above cases, we enter “MDY” because datemdy is recorded in M-D-Y format. Finally, the third element of the date(___) subcommand indicates whether the dates in question include dates after the year 2000. Basically this allows Stata to think clearly about the difference between March 1, 1904 and March 1, 2004, both of which could be written as 3/1/04.
REGRESSION ANALYSIS AND SCATTER PLOTS
This lab ends with another analysis of the 1998 Wisconsin school data.
clear
use wisconsin98data.dta
First, regress ACT score on the minimum district salary paid in 1998:
regress act bamin98
Notice that the estimated coefficient is extremely small at 0.0000733, because the estimated coefficient is for a one dollar increase in salary. To avoid such small estimated coefficients, it is the researcher’s burden to scale the variables so that the estimated coefficients make more sense (regardless of whether they are large or small, statistically significant or not). In this case, instead of measuring salaries in dollars, the researcher should measure salaries in thousands of dollars.
gen bamin98z = bamin98/1000
regress act bamin98z
Notice that the results are exactly the same in terms of statistical significance (p-values, t-stats) and in terms of interpretation: a one thousand dollar increase in salary has the same expected effect in both sets of results!
The final model would likely include a variety of variables, but for now, let’s investigate the relationships between salaries and ACT scores.
gen mamax98z=mamax98/1000
regress act bamin98z
regress act mamax98z
regress act bamin98z mamax98z
regress act enroll lunch bamin98z
Notice again that the estimated coefficient on enrollment is extremely small: if enrollment increases by 1 student, district-average ACT scores are expected to increase by 0.000033 points. To rectify this, rescale enrollment:
gen enrollz = enroll/1000
regress act enrollz lunch bamin98z
The interpretation now is that if enrollment increases by 1,000 students, then district-average ACT scores are expected to increase by 0.033 points.
The results in the above regression are interesting. If one wants to “convince” people of a relationship, however, it is useful to do so graphically. If the data don’t show the behavior with basic tools (such as graphs), people may remain skeptical regardless of the regression output.
scatter act enroll
The above scatter plot, however, is not very useful. Apparently there are some, though not many, very large school districts, and these school districts are preventing the scatter plot from being useful.
sum enroll, detail
scatter act enroll if enroll<40000
scatter act enroll if enroll<10000
scatter act enroll if enroll<5000
Of these three, notice that the last condition (enroll < 5000) seems to be the best in terms of showing the base relationship to the reader. So let’s limit our analysis to this group, and provide a scatter plot with the regression line.
keep if enroll<5000
scatter act enroll || lfit act enroll
Notice that the single variable regression line can be put on any scatter plot by including || lfit yvariable xvariable after the scatter plot command.
Lastly, let’s make the scatter plot even more informative.
scatter act enroll || lfit act enroll, legend(off) t1title(ACT Scores vs. Enrollment) xtitle(Enrollment) ytitle(District’s 1998-99 Average ACT Score) ylabel(18 20 22 24 26, grid)