Stata Lab 6: Specification Errors and Multi-Collinearity

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012

The data for this problem are in Stata format: SpecErrors.dta. The data set contains information on 952 counties in the United States.

THE LAB

Summarize and describe the data. Make sure you understand each of the variables. The quality variable is measured by an independent agency. Higher numbers (max theoretical score = 300) means that the county's highway roads are in better shape. Lower numbers (min theoretical score = 0) means the county's roads are in worse shape.
The only classification variable in the data set is the firm's name that the county uses to lay its highway pavement. Tabulate the firm names. Make three dummy variables (called abc, boras, and petes) to capture the three firms's potential influence on highway road quality.
The purpose of this regression project is to estimate highway road quality as a function of the relevant explanatory variables. Without giving much thought to this question, regress quality on all of the variables in the dataset except signdam (for now, forget that signdam exists). Also, to keep everyone on the same page, include abc and borjas as the two firms, omitting petes. That is:
Look at the regression results. Which results look good to you? Which ones are concerning? For any concerning results, is your gut response to eliminate the variable from the regression or to do something else? We will handle the concerning results in parts 5 - 7.
What does the latitude variable measure? Do you know what the value would be for Lake County, IL? How about Miami-Dade County, FL? Does the estimated coefficient on latitude make sense to you? Do you think latitude is a good variable for the model?
The first concerning result is the estimate on the number of exits per 10 miles of highway roads in the county (exitspm). This estimate is concerning for several reasons:
1. The estimate is highly statistically insignificant with a p-value of almost 0.87.
2. The point estimate is extreme low economically. Looking at the summary statistics, this variable ranges from 1 to 5.9. The quality variable, recall, can theoretically range from 0 to 300 and in fact does range from 90 to 288 in our sample. Therefore, given the point estimate of -0.09 on exitspm, the county with the least number of exits sees its expected quality fall by 0.09 while the county with the most number of exits sees its expected quality fall by 0.531 (which equals 5.9 x 0.09). This is an extremely small difference given the range of the quality variable.
3. What is the reason the number of exits would affect highway road quality? Is it clear whether this varible should be positive or negative? It seems as if this variable made it's way into the equation because it was in the dataset, not because there was a theoretical reason to include it. This leads us to conclude that exitspm should be dropped from the specification.
Rerun the new specification, this time without exits per 10 miles of road:
The next concerning results are the estimates on the number of car miles and the number of truck miles driven daily on the county's highways (carmls and truckmls). Unlike the estimated coefficient on the number of exits, however, the problem is not one of not belonging. We know that traffic wears down roads, so the more traffic, the lower quality the roads should be. Thus, these variables belong in the model. Both, however, are estimated to be statistically insignificant. What do you think the problem is?
These two variables, though not perfectly related, capture many of the same things. Not only do the variables measure similar things, but both are statistically insignificant. This is a classic case for potential multi-collinearity. The first thing to look at when multi-collinearity is a concern is the simple correlation coefficients. Have STATA produce these for you for all variables in the regression, including the dependent variable (be sure to put petes in the list of variables:
Be sure you understand how to read these results. Notice the diagonal of 1.0000's. This is because each variable has a correlation coefficient equal to 1 with itself. The off-diagonal terms provide the correlation coefficients between the row variable and the column variable; e.g., corr(latitude, costpm) = 0.0509.
Although there really is no general rule for finding multi-collinearity, certainly the correlation coefficient of 0.9993 between car miles and truck miles indicates that these two variables are highly collinear. At this point we may decide to not change the model or to drop one of the two highly collinear variables. In order to help us make the decision, re-estimate the model, once with car miles and once with truck miles. That is:
Notice that both regressions, in terms of car and truck miles, are very similar: both coefficients are negative and highly statistically significant. The fact that the point estimates are not the same is not surprising because there are many more car miles (for all counties) than there are truck miles. Thus, we would expect the estimate on truck miles to be larger than the estimate on car miles across the two regressions. Notice too that both coefficient estimates (in their separate equations) have become statistically significant because the standard error has been dramatically reduced (from about 3 and 12 to .12 and .44 for car and truck miles respectively.) That such a large change is seen in standard errors when either variable is removed from the specification is a strong suggestion of multi-collinearity.
A second way to investigate multi-collinearity is with a Variance Inflation Factor (VIF) test. Conducting VIF tests in Stata is very easy as it is simply a post-estimation command. Re-estimate the equation with both car and truck miles in the model, and follow this with the post-estimation command for a VIF:
Recall from class that there is no well-accepted threshold for VIFs -- under 3 is no problem; most people won't worry at 5 or even 7; greater than 11 is probably a reason to worry. The results from the estat vif post-estimation command show that carmls and truckmls are a problem as their VIFs are over 700. None of the remaining variables are a problem at all.
What should be done? Nothing? Remove a variable? If so, which one? The best answer is rarely obvious. In this case, though, as the two variables are indicating the same thing -- amount of traffic, and because the VIFs are so extraordinarily high, and because the coefficients become much more precisely estimated when either variable is removed, most econometricians would probably opt to remove one of the two variables. So which one?
One might omit car miles, because trucks are known to do more damage on a per-vehicle basis than are cars. On the otherhand, one might omit trucks, because many, many more miles are due to cars than to trucks. (To be honest, it probably doesn't matter which variable is removed from the specification.) To keep everyone on the same page, let's remove truck miles and keep car miles in the regression.
At this point, the estimated specification is:
The results from this regression look really good. All of the variables are statistically significant at the 5% level, and all but one are significant at the 1% level. Things are looking good, but ....
We have dealt with an irrelevant variable (exitspm) and highly collinear variables (carmls and truckmls), but we might still have mis-specification if we have omitted a variable or if any of our variables are correlated with the error term.
Ramsey's Regression Specification Error Test (RESET for short), provides a test that can indicate the presence of an omitted variable.
1. Estimate your model, and call the residual sum of squares RSS_M.
2. Create fitted values, and from this create the 2nd, 3rd, and 4th powers of the fitted values.
3. Re-estimate your model, this time including the 2nd, 3rd, and 4th powers of the fitted values. Call the residuals from this regression RSS.
4. The F-statistic = [(RSS_M - RSS) / 3] / (RSS / N - K - 1). This statistic is distributed according to an F-distribution with 3 and N - K - 1 degrees of freedom if there are no omitted variables. If the F-statistic is too large (so that the p-value is small), then there is evidence of omitted variable bias.
To execute this test, do the following:
Notice that RSS_M = 538,473.538 from the first (restricted) regression as it is restricted to not include the fitted values, and that RSS = 531,058.712 from the second (unrestricted) regression. Notice too from this second regression that N - K - 1 = 943. Calculating the F-statistic using the above formula, therefore, yields F-stat = [(538,473.538 - 531,058.515) / 3] / (531,058.515 / 943) = 4.38. The p-value associate with this F-stat can be found in Excel by entering =FDIST(4.38, 3, 942) = 0.0045. Thus, the conclusion is that the null hypothesis that there are no omitted variables is rejected in favor of the hypothesis that there are omitted variables.
All of this can be done much easier in Stata. Re-estimate the model (without fitted values), and follow this with a post-estimation command of ovtest, meaning "omitted variable test":
Notice that the ovtest post-estimation command produces exactly the same F-stat and p-value that we calculated above. And thus, the result again is to reject the claim that there are no omitted variables. Note: the way Stata gives its results to the ovtest command can be confusing. The results let the user determine if the test should be rejected or not rejected based on the p-value. However, the results always present the null hypothesis in words "Ho: model has no omitted variables". Be sure to read this as the null hypothesis, and not as the test's judgement of the null hypothesis.
So we have an omitted variable problem, what can we do? Addressing omitted variable problems is not easy. One can rethink the theory behind the model to see if a new variable comes to mind that should be included in the regression. Alternatively, one may think harder about the error term and wonder if any of the independent variables are correlated with the error term because of omitted variable bias (e.g., recall the standard wage example where schooling is correlated with the error term because the model specification omits motivation).
For this lab, we will try to address the second issue. The age variable is suspect. Older highways should be of worse quality, but the causation can run both ways -- worse highways are unlikely to get much older as counties repair them. What we would like is an unbiased measure of age without taking into account policy makers's decisions to repair roads. Such a variable would be a valid instrumental variable as it is correlated with the problem variable (age) but is not correlated with the error term (or the y-variable, road quality). Coming up with valid instruments is a big part of econometrics. For our purposes, let me suggest the percent of mileage signs that exhibit damage. Mileage signs are damaged by cars and trucks that run astray as well as from debris that flies off the road as well as from vandals. They are also replaced when roads are repaired, but a road would never be repaired because of damaged signs. Thus, it would seem that Corr(age, signdam) > 0 but that Corr(signdam, errors) = 0. There is no way to test this second claim, but we can investigate the first:
The two variables are highly collinear with Corr(age, signdam) = 0.9407. Believing in the argument given above, we are now willing to use signdam as an instrument for age. In Stata:
This last set of results, called IV results, is the best empirical model for this question, or at least that is what we are claiming.

THE PROGRAM

#delimit;
set more 1;
log using program.log, replace;

* Rob Lemke;
* Fall 2012;

use SpecErrors.dta;

* PROBLEM ONE;
desc;
sum;

* PROBLEM TWO;
tab firm;
gen abc=(firm=="ABC Construction");
gen boras=(firm=="Boras Bros");
gen petes=(firm=="Pete's Paving");

* PROBLEM THREE;
reg quality abc boras costpm exitspm carmls truckmls latitude age;

* PROBLEM FOUR;
* No Stata work;

* PROBLEM FIVE;
* No Stata work;

* PROBLEM SIX;

*Sample Correlations in Stata;
reg quality abc boras costpm carmls truckmls latitude age;
corr quality abc boras petes costpm carmls truckmls latitude age;
reg quality abc boras costpm carmls latitude age;
reg quality abc boras costpm truckmls latitude age;

* VIF test in Stata;
reg quality abc boras costpm carmls truckmls latitude age;
estat vif;

* RESET results using Stata output;
reg quality abc boras costpm carmls latitude age;
predict yhat;
gen yhat2=yhat*yhat;
gen yhat3=yhat2*yhat;
gen yhat4=yhat3*yhat;
reg quality abc boras costpm carmls latitude age yhat2 yhat3 yhat4;

* RESET results in Stata;
reg quality abc boras costpm carmls latitude age;
ovtest;

* PROBLEM SEVEN;

corr age signdam;
ivreg quality abc boras costpm carmls latitude (age=signdam);

save SpecErrors_edited, replace;
log close;