Stata Lab 2: Introduction to the Data Ferrett

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012

The data ferrett is a data portal sponsored by the Bureau of Labor Statistics that allows people to download a vast amount of data. Many of you may find one of the many data sets available to you via the data ferrett to be a good source of data for your econometrics project. This page will walk you through the data ferrett as well as getting the data into STATA and manipulating it a bit once you have it there.

You may want to print this webpage before beginning. By the end of this lab, you will have created and saved a dataset called cps_may_2011_workers.dta which you will use in your first Stata project.

To begin, the data ferrett must be loaded on your machine. Go to http://dataferrett.census.gov.
- Left click on "Get Data ~ Run" at the top of the right side pane.
- Hold down the CTRL button and then left click on Data Ferrett in the middle of the page. You might be asked something about trusting the website. Please check that you do.
- If you are given options of loading the program or removing the program, this means that an ECON 330 student loaded last year's program on your machine already. Click on the remove tag, run the program to remove what is there. Then begin again. This time when you click on run you will not be asked any questions as it is clear you want to load the program to your machine.
You should now see the Data Ferrett Icon on your desktop (if so, double-click on it) or the Data Ferrett should have launced automatically.
- You will need to enter your email address.
- When the program is fully loaded, the main window will give you access to many helpful options, including a tutorial, examples, descriptions of data, etc. We will skip these features today, but if you use the data ferrett to obtain data for your project, you should play around with them enough to be certain that you are doing things correctly.
- Click on Get Data Now in the bottom right corner.
- On the left hand side is a list of all of the data surveys that the ferrett gives access to. You can expand that box by clicking and dragging on the right hand side border to more fully read all of the survey names.
- Click on the + tag to the right of Current Population Survey.
- You will immediately see more folders, starting with Basic and continuing through Work Schedules. The Current Population Survey (CPS) collects data each month. Some of the questions are the same each month (the Basic survey). Other months and years the CPS focuses on particular behavior (such as fertility) or on a particular population of people (such as veterans). Click on the + tag to the right of Basic.
- The monthly lists of data, from July 2012 through January 1994, are now listed. The May version of each survey contains detailed information on the labor market. For our purposes, therefore, right click on May 2011 and then left click on View Variables.
- At this point, you can have access to all of the variables contained in the May 2011 CPS by clicking on Select All Topics. If you do this, a list of almost 300 variables will pop up. In order to keep things managable, check the boxes next to Demographic Variables and Earnings Variables. Then click on Search Variables.
- A list of 67 variables, along with a variable name, brief description, and the months for which they are avaiblable, appear in the window.
You could download all 67 variables, but this is unnecessary and would result in an enormous data set. Instead:
- Click on the Name tag to sort the variables alphabetically.
- Then, while holding down the control button, click on PEMARITL, PESEX, PREDUCA5, PRMARSTA, PTDTRACE, and PTERNH10.
- Once these six variables are highlighted, release the control button and click on Browse/Select Variables & Values.
- At the top-left of the window, you will see a list of the six variables. Below that is an opportunity to check a box that says Select ALL Variables. Check that box, and then click on the OK box to the right of the list of variables. A box will appear saying: You have Added 6 variables for your Data Basket. Click on OK. Upon doing this, you will return to the list of variables.
You are now ready to start downloading your data.
- At the top of the page, click on Step 2: Data Basket / Download / Make a Table.
- The page will change to include a list of all of your variables as well as some tab options on the right hand side of the page. The first thing to do is to click on: Save Selected Variable(s) Codebook.
- After clicking on the tab, you will be given an option as to what to call your codebook and where to save it. I would save it on your U: drive, and I would call it cps_may_2011_codebook.txt. Wherever you save it and whatever you name it, don't forget it.
- Next, click on the Download icon toward the top-center of the window. A new box will pop up.
- Make sure the box for Download Data is checked as this is what you want to do.
- You can check the box for STATA data if you wish, however I would recommend against it. Rather, whenever I download data, I always try to save it as a comma delimited text file. The benefit of this choice is that all software programs understand ASCII. So check the box next to Comma Delimited.
- If you were downloading a huge data set, then I would select one of the zip options. As this one won't be too big, have No Compression checked.
- Make sure you are not running the download in batch mode. (Again, this is something you might consider if the data set was going to be absolutely huge.)
- Click on Get Extract.
The window should change so that you are now given a link to your data set. Right click on the link.
- Click on Save Target As.
- Using the browse features, save the data somewhere. I would change the name and extension to something like cps_may_2011.txt and save it on your U: drive.
- When the data is saved, kill the IE/Ferrett windows that have been opened. The ferrett will ask you if you want to save your work. I always check No. If you need to re-do this process, I find it is always better to begin anew rather than trying to rely on an extraction that you have already begun.
- Everyone should be able to use the Data Ferrett. Just in case someone is having severe problems during the in-class lab experience, however, one can right click here on the text data file and on the codebook and save them to one's U drive.
At this point, the data could be read into Excel. (YOU DON'T NEED TO DO THIS, BUT YOU SHOULD BE AWARE THAT IT IS AN OPTION.)
- Open Excel.
- Open a file, remembering to change to all file types. Search for your downloaded data (e.g., I would find cps_may_2011.txt).
- A window will pop-up. Make sure Delimited is checked. Click Next.
- Make sure the delimit symbol is a comma. Click Next.
- Click Finish.
- Kill your Excel window.
To read your data directly into STATA, open STATA.
- Change the directory so that you are in the same directory as the data. Type "dir" to make sure that you see your data (e.g., in my case, when I type dir I should see cps_may_2011.txt as one of the files.
- Open a log file:
- Enter in the command line of STATA:
- After the data has been loaded, enter describe and summarize in the command line of STATA to see the format and summary statistics of your data. You should notice something odd. Before addressing the oddity, save your data. In my case, I would type:
- The oddity is that there are now 10 variables, not 6. Looking at your data, you should notice four new variables: hrhhid, hrhhid2, occurnum, and yyyymm. It turns out that the data ferrett exports CPS data, it always includes the household id number(s), the occurance of the observation within the household, and the month and year of the observation. In some cases, these are important variables. As we only care about the first observence, enter into STATA:
  Notice the double equal sign as this is not generating a new variable. Moreover, of the original 134,703 observations, only 53,512 are kept while 81,191 are deleted as they are not associated with the first occurance. Now enter:
  Describe the data again, and you will see that we have 6 variables and 53,512 observations.
- Save the current data set with a new name just in case you need to return to the original data sometime:
What is the difference between marital status (pemaritl) and the marital status recode (prmarsta)? There are a couple of ways to go about looking for the difference, but the first step is to always look at the codebook. Open your codebook in a wordprocessing program (I prefer WordPad), and you will see that the difference concerns some extra categories for spouses. To determine how big of a difference there is between the variables, tabulate both variables:
Using this tabulation along with the codebook, we see that the difference is that the recoded variable discerns between a person with a civilian spouse vs. a non-civilian (armed forces) spouse. Suppose we are interested in distinguishing between people who get and stayed married from those who get married and then separate from those who never marry. To do this, we could use either marital status variable. Enter the following commands (and try to predict what each will do to the data set):
Notice that we have generated four variables: status takes on one of three values to indicate married, divorced, or single. We also created three dummy variables for married, divorced, and single. With these variables defined, we then dropped the original variables of pemaritl and prmarst. For the record, notice that the first drop command (i.e., drop if pemaritl==-1) drops observations with "strange" data on marital status while the second drop command (i.e., drop pemaritl prmarst) drops variables. You always need to understand if you are dropping variables or dropping observations.
We now want to adjust our variable for sex. Notice that if you tabulate pesex, you can't tell which observations are male and which are female. This is why you must have a codebook. The codebook tells us that males are classified with a 1 while females are classified with a 2. Enter the following commands:
Notice that the dummy variables for male and female represent identical information.
Now consider the education variable. According to the codebook, there are five classifications: less than a high school diploma, high school graduates with no college, high school graduates with some college, associate degree holders, and bachelor degree holders. We don't want to treat associate degree holders like college graduates, so we will group them with the some college folks. And once the 4's (associate degrees) have been recoded as 3's (some college), we might as well recode the 5's (college degree holders) to have a value of 4. To do all of this, enter the following commands:
If you tab ptdtrace, you will notice that there are a bunch of different categories for race. We will keep things simple and classify people as white, black, native american, asian, or other. To do this, however, we must look to the codebook for guidance. Enter the following commands:
Notice that you only need quotes around the actual label when there is a space in the label.
The final variable we want to create is the hourly wage of the respondent. By summarizing the wage variable with detail:
one can immediately see that there are a lot of non-respondents to this question (for various reasons). Actually, one immediately sees that there are a lot of negative numbers reported. Checking with the codebook suggests that these are simply people who didn't report an hourly wage. Enter the following commands:
At this point, save your current data set, but change its name so that you don't over-write your original data.
Lastly, we will do some data analysis. Enter the following: