Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012
Residual Analysis
The data for the first part of this lab are in an Excel file: residuals.xls. The data set contains four variables: x, e2, e20, and e50. The variable x can take on values between -15 and 34, with a mean of 10. The variable e2 contains randomly drawn observations from a normal distribution with mean 0 and standard deviation of 2. Likewise, e20 and e50 are randomly drawn observations from a normal distribution with mean 0 and standard deviation of 20 and 50 respectively.
Of the three error variables e2, e20, and e50, which has a sample mean closest to its true mean? Why would you expect this?
Answer. All three variable were drawn from a distribution with a mean of zero, but the mean of e2 is closer to zero than is the mean of e20 or e50. This result is expected as e2 has the smallest standard deviation, but it wouldn't have had to turn out this way. In fact, notice that the mean of e2 (with a standard deviation of 2) is extremely close to the mean of e20 (which has a standard deviation of 20).
gen y2 = 12 + 8*x + e2
gen y20 = 12 + 8*x + e20
gen y50 = 12 + 8*x + e50
sum y2 y20 y50
Discussion. Notice the mean and standard deviation of each of the y variable is exactly 12 plus 8 * (10) plus the mean of the corresponding error term. This precise relationship must hold as averages and expected values hold across linear expressions.
reg y2 x
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.9997, which means that the the x variable explains almost all of the variance in the dependent variable, y2. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 1.9626, which is close to the true standard deviation of e2, which is 2.
reg y20 x
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.9690, which means that the the x variable explains over 96% of the variance in the dependent variable, y20. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 20.246, which is close to the true standard deviation of e20, which is 20.
reg y50 x
Discussion. Notice that the estimated intercept is close to 12 as 12 is in the 95% confidence interval. Likewise, the estimated slope coefficient is close to 8 as 8 is in the 95% confidence interval. The r-squared for the regression is 0.8340, which means that the the x variable explains over 83% of the variance in the dependent variable, y50. Looking at the Root MSE, we see that the regression results estimate the standard deviation of the error terms (also called the standard error of the regression) to be 48.948, which is close to the true standard deviation of e50, which is 50.
Answer. The point is that the econometrician cannot control the variance of the error terms. The three regression models are absolutely correct as the three error terms were drawn randomly from their respective distributions and the three y variables were constructed to equal 12 + 8x plus the appropriate error term. Regression analysis, therefore, allowed for estimated coefficients to be determined, and fairly precisely at that. Still, the degree of precision depends on the variance of the underlying error terms. This is why model building and estimating coefficients is the goal of regression analysis. The goal is NOT to maximize the r-squared of the regression or to maximize t-stats or minimize p-values.
reg y2 x
predict y2hat
predict error2, resid
hist error2, bin(50)
sum y2 y2hat error2
Discussion. The hist command forces STATA to plot a histogram, while the bin(50) option tells STATA to use up to 50 bins or classes in the histogram. According to the histogram, the average error term is roughly zero, and the errors are distributed roughly normal. The mean predicted value of y2, which is the mean of y2hat, equals 92.0784, which is also the mean of y2 itself. This is always a property from linear regression. Put differently, the estimated coefficients from linear regression will put the regression line through the mean value of the dependent variable, so the mean of y equals the mean of predicted y. This also means that the average or mean error term will also equal zero. We see this is the case as well as the mean of error2 is essentially 0.
reg y20 x
predict y20hat
predict error20, resid
hist error20, bin(50)
sum y20 y20hat error20
Discussion. According to the histogram, the average error term is zero, and the errors are distributed roughly normal. The mean predicted value of y20, which is the mean of y20hat, equals 92.08923, which is also the mean of y20 itself. The mean of error20 is also essentially 0.
reg y50 x
predict y50hat
predict error50, resid
hist error50, bin(50)
sum y50 y50hat error50
Discussion. According to the histogram, the average error term is zero, and the errors are distributed roughly normal. The mean predicted value of y50, which is the mean of y50hat, equals 97.1069, which is also the mean of y50 itself. The mean of error50 is also essentially 0.
hist error2, bin(50) xlabel(-100 -50 0 50 100) t1title(Residuals of Y2 Regression) saving(graph_e2, replace)
hist error20, bin(50) xlabel(-100 -50 0 50 100) t1title(Residuals of Y20 Regression) saving(graph_e20, replace)
hist error50, bin(50) xlabel(-100 -50 0 50 100) t1title(Residuals of Y50 Regression) saving(graph_e50, replace)
Now graph all three together with the following command
graph combine graph_e2.gph graph_e20.gph graph_e50.gph
scatter y50 x || lfit y50 x
You could have used some graphing options as well to spice up this graph a bit, such as adding a title, controling the x-axis or y-axis values, etc.
Best Fit Lines and Rescaling Variables
The data for this problem are already in a Stata file: WI2001.dta. The data set contains information on 330 public school districts in Wisconsin for the 2001-2002 school year. In this part of the lab, we are interested in the relationship between expenditures on public education and test scores. Describe and summarize the data to learn more about the data set.
centile enroll, centile(20 80)
From the centile results, we see that the 20th percentile of enrollment is 628 while the 80th percentile of enrollment is 2,852.2. So now, to create the variable size:
gen size=1*(enroll<628) + 2*(enroll>=628 & enroll<=2852.2) + 3*(enroll>2852.2)
label define sizes 1 Small 2 Middle 3 Large
label values size sizes
tab size
Creating the variable size as we did above is fine, but it does require looking at the results from the centile command in order to write the generate command. If one is programming such a variable, it would be best to not have to look at the results to define future lines of code. Stata usually has ways of doing this. Specifically, most of Stata's commands save results into something called r(). Don't do this now, but we could have defined size by the following set of commands:
_pctile enroll, p(20 80)
gen size = 1*(enroll < r(r1)) + 2*(enroll>=r(r1) & enroll<=r(r2)) + 3*(enroll>r(r2))
bysort size: sum enroll dropout lunch act
gen ppexp = totalexp/enroll
sum bamin bamax mamin mamax ppexp
scatter act ppexp, xlabel(5000 7500 10000 12500 15000) ylabel(16 18 20 22 24 26) t1title(ACT Scores by Per Pupil Expenditures) saving(act_v_ppexp_a, replace)
It is difficult to discern the relationship between per pupil expenditures and ACT scores from this graph. To make the relationship clearer, create another graph that includes the estimated regression line.
scatter act ppexp || lfit act ppexp, xlabel(5000 7500 10000 12500 15000) ylabel(16 18 20 22 24 26) t1title(ACT Scores by Per Pupil Expenditures) saving(act_v_ppexp_b, replace)
From this second graph, we see that there is a negative relationship between expenditures and test scores.
scatter act mamax || lfit act mamax, xlabel(30000 40000 50000 60000) ylabel(16 18 20 22 24 26) t1title(ACT Scores by Maximimum Salary to MA Teacher) saving(act_v_mamax, replace)
In this graph we see a clear positive relationship between maximimum salaries and test scores.
act = B0 + B1 * bamin.
Estimate this model.
reg act bamin
Discussion. The regression results suggest a positive and statistically significant (p-value = 0.048) relationship between starting B.A. salaries and ACT scores. Specifically, a $1 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0000848 points, with a standard error of 0.0000427. Put differently, a $1,000 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0848 points. The r-squared of the regression is 0.0119, and the estimated standard deviation of the error terms is 1.0078.
act = B0 + B1 * ppexp.
Estimate this model.
reg act ppexp
Discussion. The regression results suggest a negative but statistically insignificant (p-value = 0.116) relationship between per pupil expenditures and ACT scores. Specifically, a $1 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.0001 points, with a standard error of 0.0000636. Put differently, a $1,000 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.1 points. The r-squared of the regression is 0.0075 (meaning that per pupil expenditures doesn't even explain 1 percent of the variance in average district ACT scores), and the estimated standard deviation of the error terms is 1.01.
sum bamin ppexp
replace bamin=bamin/1000
replace ppexp=ppexp/1000
sum bamin ppexp
act = B0 + B1 * bamin.
Estimate this model.
reg act bamin
Discussion. Now that bamin has been rescaled by 1,000, the regression results must take this into account. Specifically, the regression results suggest a positive and statistically significant (p-value = 0.048) relationship between starting B.A. salaries and ACT scores. Specifically, a $1,000 increase in minimum salary paid is expected to increase a district's average ACT score by 0.0848 points, with a standard error of 0.0427. The r-squared of the regression is 0.0119, and the estimated standard deviation of the error terms is 1.0078.
act = B0 + B1 * ppexp.
Estimate this model.
reg act ppexp
Discussion. Now that ppexp has been rescaled by 1,000, the regression results must take this into account. Specifically, the regression results suggest a negative but statistically insignificant (p-value = 0.116) relationship between per pupil expenditures and ACT scores. Specifically, a $1,000 increase in per pupil expenditures is expected to decrease a district's average ACT score by 0.1 points, with a standard error of 0.0636. The r-squared of the regression is 0.0075, and the estimated standard deviation of the error terms is 1.01.