Stata Lab 3: Managing Data

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012


In this lab, several Stata commands are introduced that will allow you to execute some very useful data manipulations, including collapsing data, reshaping data, merging datasets, and appending datasets. The lab also introduces some Stata date functions before giving you more practice with regression analysis and producing graphs.

COLLAPSING DATA

Suppose you have a cross-section dataset with a single line of data for each of n distinct observations. For example, you might have a dataset with 3,417 observations where each observation is a different person or a different school or a different zip code. Suppose too that there is a classification variable in the dataset that groups individual observations in some way. For example, race may equal 1 if asian, 2 if black, 3 if white, and 4 if other; schooltype may equal ‘elementary’ for an elementary school, ‘middle’ for a middle school, and ‘high’ for a high school; or urbanicity might equal 1 if a zip code is in an urban center and equal 0 if a zip code is not in an urban center.

Regression analysis would likely produce coefficient estimates for a regression models by using all 3,417 observations with the classification variable being included in the model as a series of 0/1 dummy variables. Before executing the regression, however, the researcher may want to calculate statistics for subgroups of observations. For example, one might want to know the average income by race or average days of school by school type or average commute times by urbanicity. Producing such simple descriptive statistics is easy in Stata with the bysort command:

To more fully understand what is being discussed, consider wisconsin98data.dta, which is a cross-section dataset. It contains data on each of the 373 school public school districts in Wisconsin from the late 1990s. Using wisconsin98data.dta, what are the minimum and maximum 1996, 1997, and 1998 salaries paid to teachers with B.A. and M.A. degrees in the various MSA regions of Wisconsin? To do this, first save wisconsin98data.dta to the Stata folder of your U drive. Then launch Stata and type:

The bysort command is very useful for deriving by-group summary statistics such as the ones determined above. In other situations, however, one does not want to calculate statistics but rather wants to transform the data so that the unit of observation actually changes. It would be rare to want to do this for the above examples (race, schooltype, and urbanicity) as the categories are very limited. However, there are 72 counties in Wisconsin. As each county levies taxes, one might want to conduct an empirical analysis of some behavior in which counties (not school disticts) are the unit of observation. To carry out such an analysis, the researcher needs to collapse the dataset with 373 school-level observations to a dataset with 72 county-level observations.

The collapse command in Stata performs this manipulation of the data for the researcher. To learn more about the collapse command, use Stata''s on-line help:

The description of the collapse command that appears on screen describes not only the syntax for the collapse command, but it also provides a complete description of the group-level statistics that can be calculated within the collapse command.

There are two somewhat tricky issues to understand when using the collapse command. First, as soon as a collapse command has been executed, the data fundamentally change. Thus, you need to be aware of what data are in Stata’s memory at all times. Many times you will need to save the original data, collapse it, and then either drop the collapsed dataset from memory and call up the original data again or you might have to merge the collapsed data back to the original data. Second, the new dataset needs variable names. Stata has a default for this as long as there is only one statistic being created. If two or more statistics are being created, however, the researcher needs to specify variable names.

To emphasize both of these points along with the syntax of the collapse command, enter the following commands:

Notice how the sample sizes from the sum command was reduced from 373 school-level observations to 72 county-level observations. Notice too that any variable not specified in the collapse command is dropped from the collapsed dataset.

Looking above, notice that the syntax is to specify in parentheses the desired statistic followed by the desired variables; then, following a comma, the group which will defined the new observation level is identified in by( ). Above we included by(county), and so the county variable is included in the new dataset. Instead, we could have used by(countyid), in which case the countid variable would have appeared in the new dataset.

To emphasize the second tricky issue mentioned above, clear the data in memory and call wisconsin98data.dta back into memory:

This time, we want to create a county-level dataset that includes the county average of ma98min plus the county minimimum and maximum for ma98min. To do this, we must take steps to define all of the new variables:

Make sure you understand the difference between the new variables mamin98avg, mamin98min, and mamin98max. They all pertain to the minimum salary paid in 1998 in the county to a teacher with a masters degree. The first, mamin98avg is the average minimum MA salary paid in the county. The second, mamin98min, is the lowest of all minimum MA salaries paid in the county. The third, mamin98max, is the highest of all minimum MA salaries paid in the county.

For practice, enter the following commands into Stata. Try to anticipate what each does.


RESHAPING DATA

The need to reshape data most often (though not exclusively) arises when some variables occur over time. For example, wisconsin98data.dta is a district-level dataset. In it is included the minimum salary paid to a teacher with a B.A. degree in 1996, 1997, and 1998, among other similar salary variables. Ignoring the other variables, therefore, there is one row of data for each unit of observation (a school district in this case) with multiple variables for the same idea over time (bamin96, bamin97, and bamin98 in this case). To see this a bit more clearly, enter the following:

This form of the data (one observation per unit with multiple variables across time) is called the wide form of the data as there are "a lot" of variables (making the dataset wide) but "few" observations as there is only one observation per unit.

In contrast, we might want to condense the variables into one variable. In this case, the variable bamin would include all of the data from bamin96, bamin97, and bamin98. Of course, one cannot just do this in any meaningful way. Rather, the way to do this is to associate every unit of observation (i.e., each school district in this case) with three rows of data — one for 1996, one for 1997, and one for 1998. Thus, where there were originally 373 observations with 3 bamin variables each, this reshaped dataset will have 373 x 3 = 1,119 observations but only one bamin variable. This is called the long form of the data as there are "few" variables, but there are multiple rows of data for each observation.

To recap, consider a dataset with a district name (name), enrollment (enroll), and three bamin salary variables (bamin96, bamin97, bamin98). These data can be presented in wide or long form:

Wide Form

nameenrollbamin96bamin97bamin98
Abbotsford660$24,410$25,170$26,130
Adams-Friendship Area2,032$25,296$25,702$26,357
Albany495$22,800$23,155$23,565

Long Form

nameenrollyearbamin
Abbotsford66096$24,410
Abbotsford66097$25,170
Abbotsford66098$26,130
Adams-Friendship Area2,03296$25,296
Adams-Friendship Area2,03297$25,702
Adams-Friendship Area2,03298$26,357
Albany49596$22,800
Albany49597$23,155
Albany49598$23,565

Notice how the long form of the data has a new variable, year, that indicates which bamin variable is being listed.

Reshaping data requires care. First, it requires that the time variables (bamin96, bamin97, bamin98 from above) include numbers in the variable name that indicate the year. The programmer then needs to inform Stata of the years. The programmer also needs to specify to Stata which variables are the year variables and which are not. Once the researcher indicates all of this to Stata, a single command (and it’s the same command in some sense) can be used to reshape the data from wide to long form or from long to wide form. To read more about this in Stata, enter

Now consider the following command:

Notice that the data is currently in wide form. Now enter:

The previous command is reshape long, which reshapes the data from wide form to long form. The above command also uses name to indicate each school district observation. Notice that the variable inside i() must uniquely identify each observation in the wide form of the data. Finally, the researcher indicates a variable for the different years inside j(). In the above command, this is accomplished with j(year). Stata knows from the code of the command to look for variable suffixes on bamin, bamax, mamin, and mamax and to store them in year. To make sure you know what has happened, enter the following commands:

Lastly, one can always return the data from the long form to the wide form:

Notice that the command to return the data from long form to wide form is reshape wide.

MEGING DATA

Frequently researchers need to combine datasets in one of two ways, merging or appending. Two datasets are merged together if both datasets have rows of data for the same observation but the two datasets have different variables. For example, NFLfinances.dta might have finance data for each of the 32 NFL teams, such as revenue, ticket sales, and total salary. The dataset would also have an identifier variable for the 32 teams, such as teamid that equals 1 for the Arizona Cardinals, 2 for the Atlanta Falcons, …, equals 32 for the Washington Redskins. At the same time, NFLteamstats.dta might have team statistics for each of the 32 NFL teams, such as wins, offensive rank, defensive rank, and quarterback rating. And, again, teamstats.dta must have the same identifier variable. The researcher now wants to combine, or merge, these two datasets to form one dataset that includes finance and team statistics for each team. Of course one could do this by hand or in Excel, but that would be a huge waste of time that is prone to errors. Moreover, combining two sets like this is trivial in Stata. Don't actually type this into STata as neither NFL dataset is available, but the commands would be:

There are two things to note. First, Stata requires that the user actually sort both datasets before the merge. This is the reason for the first four lines of code. Second, as you can see in the merge 1:1 command, there are more complicated ideas of merging than this simple example. The example here is one in which there is one and only one observation in each dataset that needs to be matched with one and only one observation in the other dataset.

Regardless of the type of merge that takes place, following any merge command, Stata generates a new variable called _merge that indicates how the data were merged together. The researcher should always tab _merge following any merging of data just to be sure that the merge occurred as one thought it would. Also, Stata will not perform another merge until _merge has been dropped from the dataset, because Stata needs to be able to create _merge again. Therefore, the second set of commands from should actually be:

One common application in Stata is the combination of the collapse and merge commands. Consider again wisconsin98data.dta. Suppose one speculates that test scores depend on the maximum salary a district pays its teachers with a Masters degree. In this simple form, one would use Stata in the following way:

That was easy enough, but suppose one’s model posits that the market for teachers is not geographically broad, but rather that school districts within the same MSA compete for teachers. In this model, absolute pay does not matter, but rather relative pay within the MSA matters. To carry out this analysis:

Notice the merge 1:m command. In this case, WI_all.dta has 373 observations, one for each school district. The data in memory (after the collapse command) has just 13 observations, one for each MSA. The merge then says to use the dataset in memory and merge to WI_all.dta. Therefore, for every one observation in WI_msa.dta, there are many observations in WI_all.dta. The "m" in "1:m" stands for "many". It is up to the researcher to make sure the merge command is specified correctly.

As for the regression results, notice that relative maximium pay is a statistically predictor of ACT scores. Of course, one would want to explore the empirical relationships between salary and ACT scores much more than what we have done here, but the point of the exercise is to understand how the collapse command can be used with a merge command to allow one to regress individual variables (ACT scores) on a group average or on a variable that is defined off of a group average (salary relative to MSA average).


APPENDING DATA

In other circumstances, data from two different datasets need to be appended together rather than merged. Appending datasets extends the dataset (downward) to include more rows of data, but with exactly the same variables.

For example, ATT8796.dta contains daily close prices for ATT stock price from January 1, 1987 through December 31, 1996. It is a very simple dataset. Download the dataset and enter:

Likewise, ATT9711.dta contains daily close prices for ATT stock price from January 1, 1997 through December 31, 2011. This dataset has the same exact format as the previous one. Download the dataset and enter:

It is important to realize that ATT8796.dta and ATT9711.dta have exactly the same structure. The researcher would most likely want to combine these data into a single dataset by placing the data in ATT9711.dta beneath the data in ATT8796.dta. The following commands to this:

For practice (and to allow some work with dates in Stata), append similar files for Ford and Intel share prices after saving four datasets (Ford8796.dta, Ford9711.dta, Intel8796.dta, and Intel9711.dta):


USING DATES IN STATA

Stata (and all software programs) have an ability to use and manipulate dates in many different ways. Whenever using dates as variables, be sure to understand how your software works and how it keeps track of dates.

To illustrate just one of Stata’s best date features, let’s first merge the three share price datasets as each dataset is an observation for each date from January 1, 1987 through December 31, 2011.

Now let’s see what the day, month, and year commands in Stata do:

Notice the syntax of the day, month, and year commands . Each requires the date() subcommand: day(date( ___ )), month(date( ___ )), and year(date( ___ )). For each, the (___) requires three items, separated by commas. The first element is the variable that is the date. In the above case, the date variable is always datemdy. The second element of the date(____) subcommand tells Stata the form of the date variable. In the above cases, we enter “MDY” because datemdy is recorded in M-D-Y format. Finally, the third element of the date(___) subcommand indicates whether the dates in question include dates after the year 2000. Basically this allows Stata to think clearly about the difference between March 1, 1904 and March 1, 2004, both of which could be written as 3/1/04.


REGRESSION ANALYSIS AND SCATTER PLOTS

This lab ends with another analysis of the 1998 Wisconsin school data.

First, regress ACT score on the minimum district salary paid in 1998:

Notice that the estimated coefficient is extremely small at 0.0000733, because the estimated coefficient is for a one dollar increase in salary. To avoid such small estimated coefficients, it is the researcher’s burden to scale the variables so that the estimated coefficients make more sense (regardless of whether they are large or small, statistically significant or not). In this case, instead of measuring salaries in dollars, the researcher should measure salaries in thousands of dollars.

Notice that the results are exactly the same in terms of statistical significance (p-values, t-stats) and in terms of interpretation: a one thousand dollar increase in salary has the same expected effect in both sets of results!

The final model would likely include a variety of variables, but for now, let’s investigate the relationships between salaries and ACT scores.

Notice again that the estimated coefficient on enrollment is extremely small: if enrollment increases by 1 student, district-average ACT scores are expected to increase by 0.000033 points. To rectify this, rescale enrollment:

The interpretation now is that if enrollment increases by 1,000 students, then district-average ACT scores are expected to increase by 0.033 points.

The results in the above regression are interesting. If one wants to “convince” people of a relationship, however, it is useful to do so graphically. If the data don’t show the behavior with basic tools (such as graphs), people may remain skeptical regardless of the regression output.

The above scatter plot, however, is not very useful. Apparently there are some, though not many, very large school districts, and these school districts are preventing the scatter plot from being useful.

Of these three, notice that the last condition (enroll < 5000) seems to be the best in terms of showing the base relationship to the reader. So let’s limit our analysis to this group, and provide a scatter plot with the regression line.

Notice that the single variable regression line can be put on any scatter plot by including || lfit yvariable xvariable after the scatter plot command.

Lastly, let’s make the scatter plot even more informative.