Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2013
The data for this problem are in Stata format: wages.dta. The data set contains five variables on 704 individuals. The variables are race (1=hispanic, 2=black, 3=white), age, school (years of schooling), sex (F=female, M=male), and annual labor income.
There are 10 questions to the lab. To best learn, try to work through all 10 questions by providing Stata commands and answers. If you get stuck, however, all 10 questions with Stata commands are repeated below. And following that, a Stata program is included that would execute the commands for all 10 questions.
Lab Instructions Without Stata Commands
Lab Instructions With Stata Commands and Answers
desc
sum
tab race
tab school
tab sex
gen hispanic=(race==1)
gen black=(race==2)
gen white=(race==3)
gen female=(sex=="F")
gen male=(sex=="M")
gen age2=age*age
gen lnwage=ln(wage)
keep wage lnwage age age2 school hispanic black white female male
order wage lnwage age age2 school hispanic black white female male
save wages_edited, replace
reg wage age age2 school hispanic black female
predict errors1, resid
sum errors1
scatter errors1 age, ti(Wage Regression) saving(e1_age, replace)
Given what you know about wages, do the results generally make sense? Explain why the residuals should make one question the model specification.
These results make sense generally: there is an increasing then decreasing return to age, a positive effect from schooling, a negative racial effect but less so for Hispanics, and a substantial gender differential going against women.
The summary statistics of the residuals are immediately troublesome as the lowest differential is about -$45,000 while the greatest differential is almost six times that at almost $260,000. The worry, of course is possible heteroskedasticity. To look at this further, consider the graph of the residuals against age. One can see that the errors are much more erratic at some ages and the positive errors are much more erratic than the negative errors. This is troublesome.
reg lnwage age age2 school hispanic black female
predict errors2, resid
scatter errors2 age, ti(Ln(Wage) Regression) saving(e2_age, replace)
Explain why the residuals might give one more confidence in this model over the previous problem.
The summary statistics of the residuals are much improved as the minimum error is -2.2 while the maximum error is 2.73. One can further see that the errors are distributed roughly with the same variance across all age groups (as well as for positive and negative errors) by looking at the graph.
Describe the predicted relationship between age and ln(wages) as completely as possible.
Because of the negative sign on age-squared, log wages increase with age at a decreasing rate, up to a point. After that, log wages start decreasing at an increasing rate. Specifically, log wages increase up to age 44.6 years old where 44.6 = 0.290922 / (2 x 0.003262).
reg lnwage age age2 school hispanic black female
test hispanic=black
test hispanic=black=0
test school=.09
test female=-0.10
test age=age2=0
gen new_y=lnwage-.09*school
reg lnwage age age2 school hispanic black female
gen rssu=_result(4)
gen dendf=_result(5)
gen numdf=1
reg new_y age age2 hispanic black female
gen rssr=_result(4)
gen fstat=((rssr-rssu)/numdf)/(rssu/dendf)
gen pval=Ftail(numdf,dendf,fstat)
list fstat pval if _n==1
reg lnwage age age2 school hispanic black female
replace rssu=_result(4)
replace numdf=_result(3)
replace dendf=_result(5)
reg lnwage
replace rssr=_result(4)
replace fstat=((rssr-rssu)/numdf)/(rssu/dendf)
replace pval=Ftail(numdf,dendf,fstat)
list fstat pval if _n==1
reg lnwage age age2 school female
replace rssr=_result(4)
replace numdf=(3-1)*(_result(3)+1)
replace dendf=_result(1)-3*(_result(3)+1)
reg lnwage age age2 school female if white==1
gen rssw=_result(4)
reg lnwage age age2 school female if hispanic==1
gen rssh=_result(4)
reg lnwage age age2 school female if black==1
gen rssb=_result(4)
replace fstat=(
(rssr-rssw-rssh-rssb)/numdf)/((rssw+rssh+rssb)/dendf)
replace pval=Ftail(numdf,dendf,fstat)
list fstat pval if _n==1
gen schoolb=school*black
gen schoolh=school*hispanic
gen schoolw=school*white
reg lnwage age age2 school schoolb schoolh hispanic black female
test schoolh=schoolb=0
test schoolh=schoolb
Alternatively:
reg lnwage age age2 schoolw schoolb schoolh hispanic black female
test schoolw=schoolh=schoolb
test schoolh=schoolb
gen agew=age*white
gen ageb=age*black
gen ageh=age*hispanic
gen age2w=age2*white
gen age2b=age2*black
gen age2h=age2*hispanic
gen femalew=female*white
gen femaleb=female*black
gen femaleh=female*hispanic
reg lnwage age ageb ageh age2 age2b age2h school schoolb schoolh female femaleb femaleh hispanic black
test femaleb=0
test femaleh=0
test femaleb=femaleh=0
Alternatively:
reg lnwage agew ageb ageh age2w age2b age2h schoolw schoolb schoolh femalew femaleb femaleh hispanic black
test femalew=femaleb
test femalew=femaleh
test femalew=femaleb=femaleh
save wages_edited, replace
Stata Program to Execute All Commands
# delimit;
set more 1;
log using lab5.log, replace;
* STATA LAB FIVE;
use wages;
* Question 1;
desc;
sum;
tab race;
tab school;
tab sex;
* Question 2;
gen hispanic=(race==1);
gen black=(race==2);
gen white=(race==3);
gen female=(sex=="F");
gen male=(sex=="M");
gen age2=age*age;
gen lnwage=ln(wage);
keep wage lnwage age age2 school hispanic black white female male;
order wage lnwage age age2 school hispanic black white female male;
save wages_edited, replace;
*Question 3;
reg wage age age2 school hispanic black female;
predict errors1, resid;
sum errors1;
scatter errors1 age, ti(Wage Regression) saving(e1_age, replace);
* Question 4;
reg lnwage age age2 school hispanic black female;
predict errors2, resid;
scatter errors2 age, ti(Ln(Wage) Regression) saving(e2_age, replace);
* Question 5;
reg lnwage age age2 school hispanic black female;
test hispanic=black;
test hispanic=black=0;
test school=.09;
test female=-0.10;
test age=age2=0;
* Question 6;
gen new_y=lnwage-.09*school;
reg lnwage age age2 school hispanic black female;
gen rssu=_result(4);
gen dendf=_result(5);
gen numdf=1;
reg new_y age age2 hispanic black female;
gen rssr=_result(4);
gen fstat=((rssr-rssu)/numdf)/(rssu/dendf);
gen pval=Ftail(numdf,dendf,fstat);
list fstat pval if _n==1;
* Question 7;
reg lnwage age age2 school hispanic black female;
replace rssu=_result(4);
replace numdf=_result(3);
replace dendf=_result(5);
reg lnwage;
replace rssr=_result(4);
replace fstat=((rssr-rssu)/numdf)/(rssu/dendf);
replace pval=Ftail(numdf,dendf,fstat);
list fstat pval if _n==1;
* Question 8;
reg lnwage age age2 school female;
replace rssr=_result(4);
replace numdf=(3-1)*(_result(3)+1);
replace dendf=_result(1)-3*(_result(3)+1);
reg lnwage age age2 school female if white==1;
gen rssw=_result(4);
reg lnwage age age2 school female if hispanic==1;
gen rssh=_result(4);
reg lnwage age age2 school female if black==1;
gen rssb=_result(4);
replace fstat=(
(rssr-rssw-rssh-rssb)/numdf)/((rssw+rssh+rssb)/dendf);
replace pval=Ftail(numdf,dendf,fstat);
list fstat pval if _n==1;
* Question 9;
gen schoolb=school*black;
gen schoolh=school*hispanic;
gen schoolw=school*white;
reg lnwage age age2 school schoolb schoolh hispanic black female;
test schoolh=schoolb=0;
test schoolh=schoolb;
* Alternatively;
reg lnwage age age2 schoolw schoolb schoolh hispanic black female;
test schoolw=schoolh=schoolb;
test schoolh=schoolb;
* Question 10;
gen agew=age*white;
gen ageb=age*black;
gen ageh=age*hispanic;
gen age2w=age2*white;
gen age2b=age2*black;
gen age2h=age2*hispanic;
gen femalew=female*white;
gen femaleb=female*black;
gen femaleh=female*hispanic;
reg lnwage age ageb ageh age2 age2b age2h school schoolb schoolh female femaleb femaleh hispanic black ;
test femaleb=0;
test femaleh=0;
test femaleb=femaleh=0;
* Alternatively;
reg lnwage agew ageb ageh age2w age2b age2h schoolw schoolb schoolh femalew femaleb femaleh hispanic black;
test femalew=femaleb;
test femalew=femaleh;
test femaleb=femaleh;
test femalew=femaleb=femaleh;
save wages_edited, replace;
clear;