Stata Project 2
Econ 330: Econometrics
Professor Lemke
Fall 2008
Due: Thursday, October 2
Direction: Everyone must write their own Stata program that produces answers to the following two questions. You can talk to your classmates, but everyone's work must be his or her own. You may find it useful or necessary to learn new Stata commands. The lab write-up is due at the start of class on Oct. 2. At that time, turn in a hard copy of your program and a separate hard copy of your answers to the specific questions. Your answers to the questions need to be written extremely clearly without relying on Stata commands or including unnecessary Stata output.
The data for this problem are in Stata format: project2.dta. The data set contains information on 330 public school districts in Wisconsin for the 2001-2002 school year. Be sure to label each variable that you are asked to create.
- Summarize and describe the data. Make sure you understand each of the variables.
- Define a variable, size, that equals 1 if the district is at or below the 30th percentile in enrollment, equals 2 if it is between the 30th and 70th percentiles in enrollment, and equals 3 if it is at or above the 70th percentile of enrollment. (Make sure every district is in exactly one of these three categories.) Label the values of size: Small, Medium, and Large. Tab size to see how many districts are in each group. You should have that 99 districts are small, 132 are medium, and 99 are large. Create three dummy variables (called small, medium, and large) for the three classes of size. Describe and summarize your four newly created variables.
- Create a new variable called ppexp that is the district's per pupil expenditure. Generate five new variables that are rescaled versions of the four salary variables and per pupil expenditures. In particular, rescale these five variables so they are measured in $1,000's of dollars. Give these variables the same name as they currently have, but with a z at the end. Thus, the rescaled version of bamin will be called baminz. Describe and summarize the five original variables plus the five rescaled variables. Confirm with yourself that you have done the rescaling correctly.
- We are interested in estimating a model in which ACT score is explained by district expenditures (use ppexpz), the dropout rate, and the percent of students in poverty (as measured by the percent of students eligible for free or reduced price lunch. Estimate this model. Which variables are statistically significant predictors of ACT? How should the coefficient on ppexpz be interpreted? Economically, what does this say about the relationship between test scores and public expenditures on education?
- A union advocate sees the results from question 4 and responds by saying that total expenditures may not be related to test scores, but teacher salaries are. Thus, we are now interested in estimating a model in which ACT score is explained by minimum salary for a teacher with a BA degree, maximum salary for a teacher with a MA degree, enrollment, the dropout rate, and the percent of students in poverty. Estimate this model using bamin and mamax. Which variables are statistically significant predictors of ACT?
- Re-estimate the previous model, but include the rescaled salary variables, baminz and bamaxz. How do the coefficient estimates, standard errors, and t-values for each of the variables change? Has the r-squared or adjusted r-squared changed?
- Continue to estimate the same basic model (using bamin & mamax or baminz & mamaxz, your choice), however, instead of including enrollment linearly, allow for enrollment to have a non-linear effect by including its square. Characterize the shape of the relationship between enrollment and ACT scores as suggested by the regression results as best as you can. Do the results support the claim that a squared term in enrollment is warranted?
- Now, instead of including enrollment and its square, include the dummy variables on size. First, include all three size dummies, and notice that Stata omits one of the variables. Second, repeat the regression, but omit medium from the regression. Interpret the coefficients on small and large. Comment on their statistical significance. From these results, what is the expected effect on average ACT score between a large and small school? Confirm this effect with the results from another regression. (Include the results from all three regressions in your answers.)
- In all of the regressions so far, the general result has been that the coefficient on bamin (or baminz) is negative, while the coefficient on mamax (or mamaxz) is positive. Most people think that spending money on teachers is a waste (in which case the coefficients should be zero) or that spending more on teachers creates better students and higher test scores (in which case the coefficients should be positive). What is the best explanation as to why one coefficient is negative and one is positive?
- Repeat twice the regression in problem 8 in which small is the omitted size variable, but omit the MA salary variable in the first regression and omit the BA salary variable in the second regression. Are the coefficients of the same sign across regressions? Are they statistically significant?
- Suppose a labor economist proposes that what matters in terms of producing high quality students who do well on standardized tests is keeping teachers motivated to teach at a high level. Moreover, this is done in two stages. First, it requires paying a high salary to the best teachers coming out of college. It also requires giving current teachers high annual pay increases. To capture pay increases, generate a variable called payinc which equals the difference between the maximum salary paid to a teacher with an MA degree and the minimum salary paid to a teacher with a BA degree. (Generate payincz to be measured in $1,000's as well.) Estimate the relationship between ACT scores and baminz, payincz, medium, large, dropout, and lunch. Is there any evidence supporting the labor economists theory? Quantify the estimated effect on payincz. If you were going to report these results in a paper, would you report the results from the model that included bamin and payinc or from the model that included baminz and payincz?