Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2013
PART ONE. HETEROSKEDASTICITY
The data for this problem are in Stata format: weighting.dta. The data set contains 7 variables and 119 observations.
desc
sum
reg y1 x1 x2 x3 x4
predict err1, resid
scatter err1 pop
The variance of the error terms increases as pop increases. Thus, the y variable likely captures a total value (rather than an averaged value). Put differently, because the error terms are more and more spread out as pop increases, it is likely that the data generating process for y1 is proportionally related to population.
gen popsqinv=1/(pop^2)
reg y1 x1 x2 x3 x4 [aweight=popsqinv]
predict err1wgt, resid
scatter err1wgt pop
reg y2 x1 x2 x3 x4
predict err2, resid
scatter err2 pop
The variance of the error terms decreases as pop increases. Thus, the y variable likely captures an average value (rather than a total value). Put differently, because the error terms are less and less spread out as pop increases, it is likely that the data generating process for y2 is inversely related to population.
reg y2 x1 x2 x3 x4 [aweight=pop]
predict err2wgt, resid
scatter err2wgt pop
The WLS results are different from the OLS results--they are a bit different in estimated coefficients, but they can be wildly different in terms of standard errors. This is the point of WLS. The data points are still the data points, so the best fit line is not going to change a lot. This is why the coefficient estimates are close (though not identical) between OLS and WLS. However, the purpose of WLS is to produce better estimates of the standard errors, and the results show substantial differences here (not always for the better). Lastly, because the point estimates don't change a lot, the scatter plots of the error terms look similar across the two types of regressions. Again, WLS cannot really change the error terms themselves (so don't expect the errors to have a constant variance after WLS). The WLS errors will, more or less, replicate the OLS errors. The point, again, is that WLS provides for better standard erros, so hypothesis testing is better under WLS.
PART TWO. DUMMY DEPENDENT VARIABLES
The data for this problem are in Stata format: CCmetrics.dta. The data set contains 379 completed rides in the Cash Cab, a game show that airs on the Discovery Network. At the end of each completed ride, the contestants are given the option to gamble all of their winnings on a single bonus question for double-or-nothing. The variable risk equals 1 if the contestants accept the gamble, and it equals 0 if they keep their current winnings without taking on the gamble.
desc
sum
reg risk male white riders avgage pcorradj sbvbc hconp lstreakc
What sign do you expected on the estimated coefficient for each variable?
Positive on male as men are more risk-loving than women.
Unclear on white as I don't know if whites are more risk-loving than blacks.
Positive on riders as more riders are more likely to know the answer.
Negative on avgage as older people are more risk-averse than younger people.
Positive on pcorradj as better performance during the game makes one more confident.
Negative on sbvbc as people are less risk-loving when more is at risk.
Positive on hconp as people are more willing to gamble the more confident they are in their ability.
Positive on lstreakc as people are more willing
Now estimate the following linear probability model:
reg risk male white riders avgage pcorradj sbvbc hconp lstreakc
The regression indicates that risk attitudes do not vary by race. Drop this variable from the regression. Although it is not statistically significant, keep the percent of questions answered correctly in the regression. Estimate the "preferred" linear probability model:
reg risk male riders avgage pcorradj sbvbc hconp lstreakc
Make sure you can interpret each estimated coefficient. Think again whether the signs match your intution. For example: when the number of riders increases by 1, the riders are 8.5 percentage points more likely to risk their winnings; and a group with a primary respondent who is male is 6.67 percentage points more likely to risk their winnings, though this estimate is not statistically different from zero.
The regression results from the preferred model look good, but as knowledgable econometricians we know that there are problems with the linear probability model.
reg risk male riders avgage pcorradj sbvbc hconp lstreakc, robust
Notice that the robust option did not change any of the estimated coefficients. Rather, the heteroskedasticity embedded in the linear probability model confounds only the standard deviations, so correcting for the problem leaves estimated coefficients unchanged while it can increase or decrease standard errors.
logit risk male riders avgage pcorradj sbvbc hconp lstreakc
mfx
probit risk male riders avgage pcorradj sbvbc hconp lstreakc
mfx
Notice several things about these results:
PART THREE. PANEL DATA - FIXED EFFECTS
The data for this problem are in Stata format: MLBwins.dta. This baseball dataset lend itself to panel data estimation as the homeid variable uniquely identifies each of the 30 different home teams.
desc
sum
reg hhrs hhits hwalks
areg hhrs hhits hwalks, absorb(homeid)
reg hwin hhits hwalks hhrs, robust
areg hwin hhits hwalks hhrs, robust absorb(homeid)
Notice that the regression results did not vary much across the OLS and Fixed-Effects models. This is because, in this case, hits, walks, and homeruns are strong predictors of wins rather than the fixed-effect. Put differently, being the New York Yankees does not generate wins. But rather, the fact that the New York Yankees spend money on good players, they are in fact buying hits, walks, and homeruns, and these things lead to wins. In other examples, the fixed-effect is going to capture almost all of the action, in which case the coefficient estimates would move a lot (all toward zero, and to be statistically insignificant).