PROJECT 3: REGRESSION ANALYSIS
Due: Thursday, November 13
PART I.
The data for the first part of this lab are in an Excel file: residuals.xls. The data set contains four variables: x, e2, e20, and e50. The variable x can take on values between -15 and 34, with a mean of 10. The variable e2 contains randomly drawn observations from a normal distribution with mean 0 and standard deviation of 2. Likewise, e20 and e50 are randomly drawn observations from a normal distribution with mean 0 and standard deviation of 20 and 50 respectively. Put the data into Stata.
- Summarize the four variables. What is the mean and standard deviation of each? Of the three error variables e2, e20, and e50, which has a sample mean closest to its true mean? Why would you expect this?
- The true regression model that we will investigate takes the form y = 12 + 8x + e. Generate three dependent variables: y2, y20, and y50 where each conforms to the regression model (i.e., y2 = 12 + 8x + e2, y20 = 12 + 8x + e20, y50 = 12 + 8x + e50). What is the mean and standard deviation of each of the y variables? Do they make sense? Explain.
- Regress y2 on x. Is the estimated intercept close to 12? Is the estimated slope close to 8? What is the r-squared for the regression? Does the regression explain most of the variance in the dependent variable, y2? How large is the standard error of the regression? (That is, what is the regression estimate of the standard deviation of the error terms?)
- Regress y20 on x. Is the estimated intercept close to 12? Is the estimated slope close to 8? What is the r-squared for the regression? Does the regression explain most of the variance in the dependent variable, y20? How large is the standard error of the regression?
- Regress y50 on x. Is the estimated intercept close to 12? Is the estimated slope close to 8? What is the r-squared for the regression? Does the regression explain most of the variance in the dependent variable, y50? How large is the standard error of the regression?
- Explain why some of the results are different between parts 3 - 5 and why one should expect these differences.
- Again, regress y2 on x. Create y2hat and error2 where y2hat is the predicted value from the regression for each observation and error2 is the regression error for each observation. This is easily done in Stata with the following sequence of commands:
reg y2 x
predict y2hat
predict error2, resid
Plot the residuals using Stata's command histogram command. In particular:
hist error2, bin(50)
The hist command forces STATA to plot a histogram, while the bin(50) option tells STATA to use up to 50 bins or classes in the histogram. According to the histogram, what is the average error term, and are the errors distributed roughly normal? Summarize the variables y2, y2hat, and error2. What is the mean predicted y value? What must it equal? What is the mean error? What must it equal?
- Regress y20 on x. Create y20hat and error20. Plot the residuals. According to the histogram, roughly what is the average error term, and are the errors distributed roughly normal? Summarize the variables y20, y20hat, and error20. What is the mean predicted y value? What is the mean error?
- Regress y50 on x. Create y50hat and error50. Plot the residuals. According to the histogram, roughly what is the average error term, and are the errors distributed roughly normal? Summarize the variables y50, y50hat, and error50. What is the mean predicted y value? What is the mean error?
- To get a better idea of the residual plots, replot the residuals so that each of the three graphs have the same x-axis range. Moreover, give each a title and save each graph. The commands are:
hist error2, bin(50) xlabel(-100 -50 0 50 100) title ("Residuals of Y2 Regression") saving(graph_e2, replace)
hist error20, bin(50) xlabel(-100 -50 0 50 100) title ("Residuals of Y20 Regression") saving(graph_e20, replace)
hist error50, bin(50) xlabel(-100 -50 0 50 100) title ("Residuals of Y50 Regression") saving(graph_e50, replace)
Now graph all three together with the following command
graph combine graph_e2.gph graph_e20.gph graph_e50.gph
Some of the answers in parts 7 - 9 are the same, while some are different. Explain why.
- Finally, it is frequently of interest to see the actual data plotted on the same graph as the predicted regression line. To do this for the y50 variable, type
scatter y50 x || lfit y50 x
You could have used any of the other options as well to spice up this graph a bit, such as adding a title, controling the x-axis or y-axis values, etc.
PART II.
The data for this problem are already in a Stata file: coefficients.dta. The data set contains information on 330 public school districts in Wisconsin for the 2001-2002 school year. In this assignment, we are interested in the relationship between expenditures on public education and test scores. Describe and summarize the data to learn more about the data set. For this assignment, write a program that executes all of the tasks below. Turn in a hard copy of your program and a hard copy of the actual answers.
- Define a variable, size, that equals 1 if the district is in below the 20th percentile in enrollment, equals 2 if it is between the 20th and 80th percentiles in enrollment (i.e., equal to or above the 20th percentile and equal to or below the 80th percentile), and equals 3 if it is above the 80th percentile of enrollment. (Make sure every district is in one, and only one, of these three categories.) Label the values of size: Small, Middle, and Large. Tab size to see how many districts are in each group. For each size of district, what is the average enrollment, dropout rate, percent of students receiving free lunch, and ACT score?
- Create a new variable called ppexp that is the district's per pupil expenditure. What are the descriptive statistics (mean, standard deviation, minimum, and maximum) for the four salary variables and per pupil expenditure?
- Make a graph using the scatter command that plots each district's per pupil expenditure (x-axis) against their average ACT score (y-axis). Make this graph as nice as possible -- i.e., at a minimum control the numbers listed on each axis, the name given to each axis, and give the graph a title. Do you see a clear relationship between expenditures and test scores? Create another graph that includes the estimated regression line. What is implied by the estimated regression line?
- Make another graph using the scatter command that plots each maximum salary paid to a teacher with a Masters Degree(x-axis) against the district's average ACT score (y-axis). Make this graph as nice as possible, and include the estimated regression line. What is implied by the estimated regression line?
- MODEL 1. We are interested in estimating a model in which ACT score is regressed on the minimum BA salary:
act = B0 + B1 * bamin.
Estimate this model. What is the estimated coefficient on minimum BA salary? Interpret this coefficient. What is its standard error, t-stat, and p-value? What is the R-squared of the regression?
- MODEL 2. We are interested in estimating a model in which ACT score is regressed on the per pupil expenditure:
act = B0 + B1 * ppexp.
Estimate this model. What is the estimated coefficient on per pupil expenditure? Interpret this coefficient. What is its standard error, t-stat, and p-value? What is the R- squared of the regression?
- Rescale the four salary variables and per pupil expenditures so they are in $1,000's of dollars. What are the descriptive statistics (mean, standard deviation, minimum, and maximum) for these five variables? How are they different from those in Question 2?
- MODEL 3. Again we are interested in estimating a model in which ACT score is regressed on the minimum BA salary:
act = B0 + B1 * bamin.
Estimate this model. What is the estimated coefficient on minimum BA salary? What is its standard error, t-stat, and p-value. What is the R-squared for the regression? Interpret the t-stat. Would you reject or fail to reject the claim that ACT score is unrelated to minimum BA salary at the 5 percent level? Interpret the p-value. At what level of confidence would you reject the claim that ACT score is unrelated to minimum BA salary? How do the regression results compare to those in MODEL 1? Can you make any conjectures about how regression results should/do change when the scale of the independent variable is changed?
- MODEL 4. Now, estimate the reverse model from MODEL 3. That is, estimate:
bamin = B0 + B1 * act.
How do the coefficient estimate on ACT, its t-stat and its p-value compare to the results in MODEL 3? How does the R-squared of the regression compare to that for MODEL 3? Can you make any conjectures about how regression results should/do or should not/do not change when the dependent variable replaces the independent variable and vice versa?
- MODEL 5. Now we are interested in estimating a model in which ACT score is regressed on per pupil expenditures:
act = B0 + B1 * ppexp.
Estimate this model. What is the estimated coefficient on per pupil expenditures? What is its standard error, t-stat, and p-value? At what level of confidence would you reject the claim that ACT score is unrelated to expenditures?
- Using the results from Models 3 and 5, what might anti-tax organizations conclude from these regressions? How might the teachers' unions and pro-education spending advocates respond by using economic and/or education arguments? How might an econometrician respond - i.e., how might the regressions in Models 3 and 5 violate the Classical Regression Model?