Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012
The data for this problem are in Stata format: SpecErrors.dta. The data set contains information on 952 counties in the United States.
THE LAB
reg quality abc boras costpm exitspm carmls truckmls latitude age
Look at the regression results. Which results look good to you? Which ones are concerning? For any concerning results, is your gut response to eliminate the variable from the regression or to do something else? We will handle the concerning results in parts 5 - 7.
reg quality abc boras costpm carmls truckmls latitude age
The next concerning results are the estimates on the number of car miles and the number of truck miles driven daily on the county's highways (carmls and truckmls). Unlike the estimated coefficient on the number of exits, however, the problem is not one of not belonging. We know that traffic wears down roads, so the more traffic, the lower quality the roads should be. Thus, these variables belong in the model. Both, however, are estimated to be statistically insignificant. What do you think the problem is?
These two variables, though not perfectly related, capture many of the same things. Not only do the variables measure similar things, but both are statistically insignificant. This is a classic case for potential multi-collinearity. The first thing to look at when multi-collinearity is a concern is the simple correlation coefficients. Have STATA produce these for you for all variables in the regression, including the dependent variable (be sure to put petes in the list of variables:
corr quality abc boras petes costpm carmls truckmls latitude age
Be sure you understand how to read these results. Notice the diagonal of 1.0000's. This is because each variable has a correlation coefficient equal to 1 with itself. The off-diagonal terms provide the correlation coefficients between the row variable and the column variable; e.g., corr(latitude, costpm) = 0.0509.
Although there really is no general rule for finding multi-collinearity, certainly the correlation coefficient of 0.9993 between car miles and truck miles indicates that these two variables are highly collinear. At this point we may decide to not change the model or to drop one of the two highly collinear variables. In order to help us make the decision, re-estimate the model, once with car miles and once with truck miles. That is:
reg quality abc boras costpm carmls latitude age
reg quality abc boras costpm truckmls latitude age
Notice that both regressions, in terms of car and truck miles, are very similar: both coefficients are negative and highly statistically significant. The fact that the point estimates are not the same is not surprising because there are many more car miles (for all counties) than there are truck miles. Thus, we would expect the estimate on truck miles to be larger than the estimate on car miles across the two regressions. Notice too that both coefficient estimates (in their separate equations) have become statistically significant because the standard error has been dramatically reduced (from about 3 and 12 to .12 and .44 for car and truck miles respectively.) That such a large change is seen in standard errors when either variable is removed from the specification is a strong suggestion of multi-collinearity.
A second way to investigate multi-collinearity is with a Variance Inflation Factor (VIF) test. Conducting VIF tests in Stata is very easy as it is simply a post-estimation command. Re-estimate the equation with both car and truck miles in the model, and follow this with the post-estimation command for a VIF:
reg quality abc boras costpm carmls truckmls latitude age
estat vif
Recall from class that there is no well-accepted threshold for VIFs -- under 3 is no problem; most people won't worry at 5 or even 7; greater than 11 is probably a reason to worry. The results from the estat vif post-estimation command show that carmls and truckmls are a problem as their VIFs are over 700. None of the remaining variables are a problem at all.
What should be done? Nothing? Remove a variable? If so, which one? The best answer is rarely obvious. In this case, though, as the two variables are indicating the same thing -- amount of traffic, and because the VIFs are so extraordinarily high, and because the coefficients become much more precisely estimated when either variable is removed, most econometricians would probably opt to remove one of the two variables. So which one?
One might omit car miles, because trucks are known to do more damage on a per-vehicle basis than are cars. On the otherhand, one might omit trucks, because many, many more miles are due to cars than to trucks. (To be honest, it probably doesn't matter which variable is removed from the specification.) To keep everyone on the same page, let's remove truck miles and keep car miles in the regression.
At this point, the estimated specification is:
reg quality abc boras costpm carmls latitude age
The results from this regression look really good. All of the variables are statistically significant at the 5% level, and all but one are significant at the 1% level. Things are looking good, but ....
We have dealt with an irrelevant variable (exitspm) and highly collinear variables (carmls and truckmls), but we might still have mis-specification if we have omitted a variable or if any of our variables are correlated with the error term.
Ramsey's Regression Specification Error Test (RESET for short), provides a test that can indicate the presence of an omitted variable.
To execute this test, do the following:
reg quality abc boras costpm carmls latitude age
predict yhat
gen yhat2 = yhat*yhat
gen yhat3 = yhat2*yhat
gen yhat4 = yhat3*yhat
reg quality abc boras costpm carmls latitude age yhat2 yhat3 yhat4
Notice that RSSM = 538,473.538 from the first (restricted) regression as it is restricted to not include the fitted values, and that RSS = 531,058.712 from the second (unrestricted) regression. Notice too from this second regression that N - K - 1 = 943. Calculating the F-statistic using the above formula, therefore, yields F-stat = [(538,473.538 - 531,058.515) / 3] / (531,058.515 / 943) = 4.38. The p-value associate with this F-stat can be found in Excel by entering =FDIST(4.38, 3, 942) = 0.0045. Thus, the conclusion is that the null hypothesis that there are no omitted variables is rejected in favor of the hypothesis that there are omitted variables.
All of this can be done much easier in Stata. Re-estimate the model (without fitted values), and follow this with a post-estimation command of ovtest, meaning "omitted variable test":
reg quality abc boras costpm carmls latitude age
ovtest
Notice that the ovtest post-estimation command produces exactly the same F-stat and p-value that we calculated above. And thus, the result again is to reject the claim that there are no omitted variables. Note: the way Stata gives its results to the ovtest command can be confusing. The results let the user determine if the test should be rejected or not rejected based on the p-value. However, the results always present the null hypothesis in words "Ho: model has no omitted variables". Be sure to read this as the null hypothesis, and not as the test's judgement of the null hypothesis.
For this lab, we will try to address the second issue. The age variable is suspect. Older highways should be of worse quality, but the causation can run both ways -- worse highways are unlikely to get much older as counties repair them. What we would like is an unbiased measure of age without taking into account policy makers's decisions to repair roads. Such a variable would be a valid instrumental variable as it is correlated with the problem variable (age) but is not correlated with the error term (or the y-variable, road quality). Coming up with valid instruments is a big part of econometrics. For our purposes, let me suggest the percent of mileage signs that exhibit damage. Mileage signs are damaged by cars and trucks that run astray as well as from debris that flies off the road as well as from vandals. They are also replaced when roads are repaired, but a road would never be repaired because of damaged signs. Thus, it would seem that Corr(age, signdam) > 0 but that Corr(signdam, errors) = 0. There is no way to test this second claim, but we can investigate the first:
corr age signdam
The two variables are highly collinear with Corr(age, signdam) = 0.9407. Believing in the argument given above, we are now willing to use signdam as an instrument for age. In Stata:
ivreg quality abc boras costpm carmls latitude (age = signdam)
This last set of results, called IV results, is the best empirical model for this question, or at least that is what we are claiming.
THE PROGRAM
#delimit;
set more 1;
log using program.log, replace;
* Rob Lemke;
* Fall 2012;
use SpecErrors.dta;
* PROBLEM ONE;
desc;
sum;
* PROBLEM TWO;
tab firm;
gen abc=(firm=="ABC Construction");
gen boras=(firm=="Boras Bros");
gen petes=(firm=="Pete's Paving");
* PROBLEM THREE;
reg quality abc boras costpm exitspm carmls truckmls latitude age;
* PROBLEM FOUR;
* No Stata work;
* PROBLEM FIVE;
* No Stata work;
* PROBLEM SIX;
*Sample Correlations in Stata;
reg quality abc boras costpm carmls truckmls latitude age;
corr quality abc boras petes costpm carmls truckmls latitude age;
reg quality abc boras costpm carmls latitude age;
reg quality abc boras costpm truckmls latitude age;
* VIF test in Stata;
reg quality abc boras costpm carmls truckmls latitude age;
estat vif;
* RESET results using Stata output;
reg quality abc boras costpm carmls latitude age;
predict yhat;
gen yhat2=yhat*yhat;
gen yhat3=yhat2*yhat;
gen yhat4=yhat3*yhat;
reg quality abc boras costpm carmls latitude age yhat2 yhat3 yhat4;
* RESET results in Stata;
reg quality abc boras costpm carmls latitude age;
ovtest;
* PROBLEM SEVEN;
corr age signdam;
ivreg quality abc boras costpm carmls latitude (age=signdam);
save SpecErrors_edited, replace;
log close;