Stata Lab 4: Regression Analysis

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012

Residual Analysis

The data for the first part of this lab are in an Excel file: residuals.xls. The data set contains four variables: x, e2, e20, and e50. The variable x can take on values between -15 and 34, with a mean of 10. The variable e2 contains randomly drawn observations from a normal distribution with mean 0 and standard deviation of 2. Likewise, e20 and e50 are randomly drawn observations from a normal distribution with mean 0 and standard deviation of 20 and 50 respectively.

Copy and paste the data into Stata. Save the dataset as residuals.dta. Summarize the data. Notice the mean, standard deviation, minimum, and maximum for each variable.
Of the three error variables e2, e20, and e50, which has a sample mean closest to its true mean? Why would you expect this?
Answer. All three variable were drawn from a distribution with a mean of zero, but the mean of e2 is closer to zero than is the mean of e20 or e50. This result is expected as e2 has the smallest standard deviation, but it wouldn't have had to turn out this way. In fact, notice that the mean of e2 (with a standard deviation of 2) is extremely close to the mean of e20 (which has a standard deviation of 20).
The true regression model that we will investigate takes the form y = 12 + 8x + e. Generate three dependent variables: y2, y20, and y50 where each conforms to the regression model (i.e., y2 = 12 + 8x + e2, y20 = 12 + 8x + e20, y50 = 12 + 8x + e50).
Summarize the newly created y variables:
Discussion. Notice the mean and standard deviation of each of the y variable is exactly 12 plus 8 * (10) plus the mean of the corresponding error term. This precise relationship must hold as averages and expected values hold across linear expressions.
Regress y2 on x.
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.9997, which means that the the x variable explains almost all of the variance in the dependent variable, y2. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 1.9626, which is close to the true standard deviation of e2, which is 2.
Regress y20 on x.
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.9690, which means that the the x variable explains over 96% of the variance in the dependent variable, y20. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 20.246, which is close to the true standard deviation of e20, which is 20.
Regress y50 on x.
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.8340, which means that the the x variable explains over 83% of the variance in the dependent variable, y50. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 48.948, which is close to the true standard deviation of e50, which is 50.
Why are some of the results different between parts 4 - 6, and why should one expect these differences?
Answer. The point is that the econometrician cannot control the variance of the error terms. The three regression models are absolutely correct as the three error terms were drawn randomly from their respective distributions and the three y variables were constructed to equal 12 + 8x plus the appropriate error term. Regression analysis, therefore, allowed for estimated coefficients to be determined, and fairly precisely at that. Still, the degree of precision depends on the variance of the underlying error terms. This is why model building and estimating coefficients is the goal of regression analysis. The goal is NOT to maximize the r-squared of the regression or to maximize t-stats or minimize p-values.
Regress y2 on x again. Create y2hat and error2 where y2hat is the predicted value from the regression for each observation and error2 is the regression error for each observation. Then plot the residuals using Stata's histogram command, and summarize all of the variables. All three tasks are easily done in Stata with the following sequence of commands:
Discussion. The hist command forces STATA to plot a histogram, while the bin(50) option tells STATA to use up to 50 bins or classes in the histogram. According to the histogram, the average error term is roughly zero, and the errors are distributed roughly normal. The mean predicted value of y2, which is the mean of y2hat, equals 92.0784, which is also the mean of y2 itself. This is always a property from linear regression. Put differently, the estimated coefficients from linear regression will put the regression line through the mean value of the dependent variable, so the mean of y equals the mean of predicted y. This also means that the average or mean error term will also equal zero. We see this is the case as well as the mean of error2 is essentially 0.
Regress y20 on x. Create y20hat and error20. Plot the residuals using Stata's histogram command, and summarize all of the variables. All three tasks are easily done in Stata with the following sequence of commands:
Discussion. According to the histogram, the average error term is zero, and the errors are distributed roughly normal. The mean predicted value of y20, which is the mean of y20hat, equals 92.08923, which is also the mean of y20 itself. The mean of error20 is also essentially 0.
Regress y50 on x. Create y50hat and error50. Plot the residuals using Stata's histogram command, and summarize all of the variables. All three tasks are easily done in Stata with the following sequence of commands:
Discussion. According to the histogram, the average error term is zero, and the errors are distributed roughly normal. The mean predicted value of y50, which is the mean of y50hat, equals 97.1069, which is also the mean of y50 itself. The mean of error50 is also essentially 0.
To get a better idea of the residual plots, replot the residuals so that each of the three graphs have the same x-axis range. Moreover, give each a title and save each graph. The three commands are:
Now graph all three together with the following command
Finally, it is frequently of interest to see the actual data plotted on the same graph as the predicted regression line. To do this for the y50 variable, type in Stata:
You could have used some graphing options as well to spice up this graph a bit, such as adding a title, controling the x-axis or y-axis values, etc.

Best Fit Lines and Rescaling Variables

The data for this problem are already in a Stata file: WI2001.dta. The data set contains information on 330 public school districts in Wisconsin for the 2001-2002 school year. In this part of the lab, we are interested in the relationship between expenditures on public education and test scores. Describe and summarize the data to learn more about the data set.

We want to define a variable (called size) that equals 1 if the district is below the 20th percentile in enrollment, equals 2 if it is between the 20th and 80th percentiles in enrollment (i.e., equal to or above the 20th percentile and equal to or below the 80th percentile), and equals 3 if it is above the 80th percentile of enrollment. To create this variable, consider the following steps:
From the centile results, we see that the 20th percentile of enrollment is 628 while the 80th percentile of enrollment is 2,852.2. So now, to create the variable size:
Creating the variable size as we did above is fine, but it does require looking at the results from the centile command in order to write the generate command. If one is programming such a variable, it would be best to not have to look at the results to define future lines of code. Stata usually has ways of doing this. Specifically, most of Stata's commands save results into something called r(). Don't do this now, but we could have defined size by the following set of commands:
For each size of district, we want to find the average enrollment, dropout rate, percent of students receiving free lunch, and ACT score.
Create a new variable called ppexp that is the district's per pupil expenditure.
Generate the descriptive statistics (mean, standard deviation, minimum, and maximum) for the four salary variables and per pupil expenditure.
Make a graph using the scatter command that plots each district's per pupil expenditure (x-axis) against their average ACT score (y-axis). Make this graph as nice as possible -- i.e., at a minimum control the numbers listed on each axis, the name given to each axis, and give the graph a title. The following command does all of these.
It is difficult to discern the relationship between per pupil expenditures and ACT scores from this graph. To make the relationship clearer, create another graph that includes the estimated regression line.
From this second graph, we see that there is a negative relationship between expenditures and test scores.
Make another graph using the scatter command that plots each maximum salary paid to a teacher with a Masters Degree(x-axis) against the district's average ACT score (y-axis). Make this graph as nice as possible, and include the estimated regression line.
In this graph we see a clear positive relationship between maximimum salaries and test scores.
Model #1. We are interested in estimating a model in which ACT score is regressed on the minimum BA salary:
act = B₀ + B₁ * bamin.

Estimate this model.
Discussion. The regression results suggest a positive and statistically significant (p-value = 0.048) relationship between starting B.A. salaries and ACT scores. Specifically, a $1 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0000848 points, with a standard error of 0.0000427. Put differently, a $1,000 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0848 points. The r-squared of the regression is 0.0119, and the estimated standard deviation of the error terms is 1.0078.
Model #2. We are interested in estimating a model in which ACT score is regressed on the per pupil expenditure:
act = B₀ + B₁ * ppexp.

Estimate this model.
Discussion. The regression results suggest a negative but statistically insignificant (p-value = 0.116) relationship between per pupil expenditures and ACT scores. Specifically, a $1 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.0001 points, with a standard error of 0.0000636. Put differently, a $1,000 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.1 points. The r-squared of the regression is 0.0075 (meaning that per pupil expenditures doesn't even explain 1 percent of the variance in average district ACT scores), and the estimated standard deviation of the error terms is 1.01.
Rescale bamin and ppexp so they are both measured in $1,000's of dollars.
Model #3. Again we are interested in estimating a model in which ACT score is regressed on the minimum BA salary:
act = B₀ + B₁ * bamin.

Estimate this model.
Discussion. Now that bamin has been rescaled by 1,000, the regression results must take this into account. Specifically, the regression results suggest a positive and statistically significant (p-value = 0.048) relationship between starting B.A. salaries and ACT scores. Specifically, a $1,000 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0848 points, with a standard error of 0.0427. The r-squared of the regression is 0.0119, and the estimated standard deviation of the error terms is 1.0078.
Model #4. Again we are interested in estimating a model in which ACT score is regressed on the per pupil expenditure:
act = B₀ + B₁ * ppexp.

Estimate this model.
Discussion. Now that ppexp has been rescaled by 1,000, the regression results must take this into account. Specifically, the regression results suggest a negative but statistically insignificant (p-value = 0.116) relationship between per pupil expenditures and ACT scores. Specifically, a $1,000 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.1 points, with a standard error of 0.0636. The r-squared of the regression is 0.0075, and the estimated standard deviation of the error terms is 1.01.