Stata Lab 2: Introduction to the Data Ferrett

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012


The data ferrett is a data portal sponsored by the Bureau of Labor Statistics that allows people to download a vast amount of data. Many of you may find one of the many data sets available to you via the data ferrett to be a good source of data for your econometrics project. This page will walk you through the data ferrett as well as getting the data into STATA and manipulating it a bit once you have it there.

You may want to print this webpage before beginning. By the end of this lab, you will have created and saved a dataset called cps_may_2011_workers.dta which you will use in your first Stata project.

  1. To begin, the data ferrett must be loaded on your machine. Go to http://dataferrett.census.gov.

  2. You should now see the Data Ferrett Icon on your desktop (if so, double-click on it) or the Data Ferrett should have launced automatically.

  3. You could download all 67 variables, but this is unnecessary and would result in an enormous data set. Instead:

  4. You are now ready to start downloading your data.

  5. The window should change so that you are now given a link to your data set. Right click on the link.

  6. At this point, the data could be read into Excel. (YOU DON'T NEED TO DO THIS, BUT YOU SHOULD BE AWARE THAT IT IS AN OPTION.)

  7. To read your data directly into STATA, open STATA.

  8. What is the difference between marital status (pemaritl) and the marital status recode (prmarsta)? There are a couple of ways to go about looking for the difference, but the first step is to always look at the codebook. Open your codebook in a wordprocessing program (I prefer WordPad), and you will see that the difference concerns some extra categories for spouses. To determine how big of a difference there is between the variables, tabulate both variables:

    Using this tabulation along with the codebook, we see that the difference is that the recoded variable discerns between a person with a civilian spouse vs. a non-civilian (armed forces) spouse. Suppose we are interested in distinguishing between people who get and stayed married from those who get married and then separate from those who never marry. To do this, we could use either marital status variable. Enter the following commands (and try to predict what each will do to the data set):

    Notice that we have generated four variables: status takes on one of three values to indicate married, divorced, or single. We also created three dummy variables for married, divorced, and single. With these variables defined, we then dropped the original variables of pemaritl and prmarst. For the record, notice that the first drop command (i.e., drop if pemaritl==-1) drops observations with "strange" data on marital status while the second drop command (i.e., drop pemaritl prmarst) drops variables. You always need to understand if you are dropping variables or dropping observations.

  9. We now want to adjust our variable for sex. Notice that if you tabulate pesex, you can't tell which observations are male and which are female. This is why you must have a codebook. The codebook tells us that males are classified with a 1 while females are classified with a 2. Enter the following commands:

    Notice that the dummy variables for male and female represent identical information.

  10. Now consider the education variable. According to the codebook, there are five classifications: less than a high school diploma, high school graduates with no college, high school graduates with some college, associate degree holders, and bachelor degree holders. We don't want to treat associate degree holders like college graduates, so we will group them with the some college folks. And once the 4's (associate degrees) have been recoded as 3's (some college), we might as well recode the 5's (college degree holders) to have a value of 4. To do all of this, enter the following commands:

  11. If you tab ptdtrace, you will notice that there are a bunch of different categories for race. We will keep things simple and classify people as white, black, native american, asian, or other. To do this, however, we must look to the codebook for guidance. Enter the following commands:

    Notice that you only need quotes around the actual label when there is a space in the label.

  12. The final variable we want to create is the hourly wage of the respondent. By summarizing the wage variable with detail:

    one can immediately see that there are a lot of non-respondents to this question (for various reasons). Actually, one immediately sees that there are a lot of negative numbers reported. Checking with the codebook suggests that these are simply people who didn't report an hourly wage. Enter the following commands:

  13. At this point, save your current data set, but change its name so that you don't over-write your original data.

  14. Lastly, we will do some data analysis. Enter the following: