Stata Lab 7: Heteroskedasticity, Dummy Dependent Variables, and Panel Data

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2013

PART ONE. HETEROSKEDASTICITY

The data for this problem are in Stata format: weighting.dta. The data set contains 7 variables and 119 observations.

Summarize and describe the data.
desc
sum
Regress y1 on the four x varibles. Calculate the error terms (call them err1). Plot the error terms against the population variable "pop". What does the scatter plot tell you about the error structure of the data generating process?
reg y1 x1 x2 x3 x4
predict err1, resid
scatter err1 pop

The variance of the error terms increases as pop increases. Thus, the y variable likely captures a total value (rather than an averaged value). Put differently, because the error terms are more and more spread out as pop increases, it is likely that the data generating process for y1 is proportionally related to population.
Generate popsqinv. Using weighted least squares with popsqinv as the weight, re-estimate the y1 equation. Generate the new residuals (called err1wgt). And plot the residuals against the population variable.
gen popsqinv=1/(pop^2)
reg y1 x1 x2 x3 x4 [aweight=popsqinv]
predict err1wgt, resid
scatter err1wgt pop
Regress y2 on the four x varibles. Calculate the error terms (call them err2). Plot the error terms against the population variable "pop". What does the scatter plot tell you about the error structure of the data generating process?
reg y2 x1 x2 x3 x4
predict err2, resid
scatter err2 pop

The variance of the error terms decreases as pop increases. Thus, the y variable likely captures an average value (rather than a total value). Put differently, because the error terms are less and less spread out as pop increases, it is likely that the data generating process for y2 is inversely related to population.
Using weighted least squares with pop as the weight, re-estimate the y2 equation. Generate the new residuals (called err2wgt). And plot the residuals against the population variable.
reg y2 x1 x2 x3 x4 [aweight=pop]
predict err2wgt, resid
scatter err2wgt pop
In what ways are the OLS and WLS regression results for y1 and for y2 similar or wildly different? Did this have to be the case? How do the scatter plots of residuals against population compare? Why?
The WLS results are different from the OLS results--they are a bit different in estimated coefficients, but they can be wildly different in terms of standard errors. This is the point of WLS. The data points are still the data points, so the best fit line is not going to change a lot. This is why the coefficient estimates are close (though not identical) between OLS and WLS. However, the purpose of WLS is to produce better estimates of the standard errors, and the results show substantial differences here (not always for the better). Lastly, because the point estimates don't change a lot, the scatter plots of the error terms look similar across the two types of regressions. Again, WLS cannot really change the error terms themselves (so don't expect the errors to have a constant variance after WLS). The WLS errors will, more or less, replicate the OLS errors. The point, again, is that WLS provides for better standard erros, so hypothesis testing is better under WLS.

PART TWO. DUMMY DEPENDENT VARIABLES

The data for this problem are in Stata format: CCmetrics.dta. The data set contains 379 completed rides in the Cash Cab, a game show that airs on the Discovery Network. At the end of each completed ride, the contestants are given the option to gamble all of their winnings on a single bonus question for double-or-nothing. The variable risk equals 1 if the contestants accept the gamble, and it equals 0 if they keep their current winnings without taking on the gamble.

Describe and summarize the data.
desc
sum
To begin, we want to use the linear probability model to see which characteristics of the contestants are correlated with the risk variable. Consider the following linear probability model regression, but don't estimate it yet.
reg risk male white riders avgage pcorradj sbvbc hconp lstreakc

What sign do you expected on the estimated coefficient for each variable?
Positive on male as men are more risk-loving than women.

Unclear on white as I don't know if whites are more risk-loving than blacks.

Positive on riders as more riders are more likely to know the answer.

Negative on avgage as older people are more risk-averse than younger people.

Positive on pcorradj as better performance during the game makes one more confident.

Negative on sbvbc as people are less risk-loving when more is at risk.

Positive on hconp as people are more willing to gamble the more confident they are in their ability.

Positive on lstreakc as people are more willing

Now estimate the following linear probability model:
reg risk male white riders avgage pcorradj sbvbc hconp lstreakc

The regression indicates that risk attitudes do not vary by race. Drop this variable from the regression. Although it is not statistically significant, keep the percent of questions answered correctly in the regression. Estimate the "preferred" linear probability model:
reg risk male riders avgage pcorradj sbvbc hconp lstreakc

Make sure you can interpret each estimated coefficient. Think again whether the signs match your intution. For example: when the number of riders increases by 1, the riders are 8.5 percentage points more likely to risk their winnings; and a group with a primary respondent who is male is 6.67 percentage points more likely to risk their winnings, though this estimate is not statistically different from zero.

The regression results from the preferred model look good, but as knowledgable econometricians we know that there are problems with the linear probability model.
One problem is that the standard errors of the linear probability model suffer from heteroskedasticity. This is easily addressed by having Stata produce robust standard errors:
reg risk male riders avgage pcorradj sbvbc hconp lstreakc, robust

Notice that the robust option did not change any of the estimated coefficients. Rather, the heteroskedasticity embedded in the linear probability model confounds only the standard deviations, so correcting for the problem leaves estimated coefficients unchanged while it can increase or decrease standard errors.
When one chooses to use the linear probability model, one must include the robust option. But the more important question is whether a different (non-linear) model should be estimated. The two leading contenders are the binary logistic regression and the binary probit regression. Moveover, when using Stata, one can report estimated marginal effects for the median observation by using the 'efx' post-estimation command. Estimate each of these three models in Stata:
logit risk male riders avgage pcorradj sbvbc hconp lstreakc
mfx

probit risk male riders avgage pcorradj sbvbc hconp lstreakc
mfx

Notice several things about these results:
- The signs and statistical significance of each variable is the same across all 3 regressions--OLS, logit, and probit.
- The magnitudes of the estimated from the logit regressions are roughly four times that of the linear probability model estimates.
- The magnitudes of the estimated coefficients from the probit regression are meaningless.
- The magnitude of the mfx results from both the logit and probit are in the neighborhood of the magnitude of the estimated coefficients from the linear probability model.

PART THREE. PANEL DATA - FIXED EFFECTS

The data for this problem are in Stata format: MLBwins.dta. This baseball dataset lend itself to panel data estimation as the homeid variable uniquely identifies each of the 30 different home teams.

Describe and summarize the data.
desc
sum
Estimate the number of homeruns hit by the home team as a function of its hits and walks. Repeat this where each home team receives its own fixed effect.
reg hhrs hhits hwalks
areg hhrs hhits hwalks, absorb(homeid)
Using a linear probability model, estimate whether the home team won the game as a function of its hits, walks, and homeruns. Repeat this where each home team receives its own fixed effect.
reg hwin hhits hwalks hhrs, robust
areg hwin hhits hwalks hhrs, robust absorb(homeid)

Notice that the regression results did not vary much across the OLS and Fixed-Effects models. This is because, in this case, hits, walks, and homeruns are strong predictors of wins rather than the fixed-effect. Put differently, being the New York Yankees does not generate wins. But rather, the fact that the New York Yankees spend money on good players, they are in fact buying hits, walks, and homeruns, and these things lead to wins. In other examples, the fixed-effect is going to capture almost all of the action, in which case the coefficient estimates would move a lot (all toward zero, and to be statistically insignificant).