Introduction to the Data Ferrett
Professor Robert J. Lemke
Department of Economics and Business
Lake Forest College
Fall 2008
The data ferrett is a data portal sponsored by the Bureau of Labor Statistics that allows people to download a vast amount of individual data. Many of you may find one of the many data sets available to you via the data ferrett to be a good source of data for your econometrics project. This page will walk you through the data ferrett as well as getting the data into STATA and manipulating it a bit once you have it there.
You may want to print this webpage before beginning.
- To begin, the data ferrett must be loaded on your machine. Go to http://dataferrett.census.gov.
- Right click on "Install Data Ferrett?".
- Click on Run.
- If you are given options of loading the program or removing the program, this means that an ECON 330 student loaded last year's program on your machine already. Click on the remove tag, run the program to remove what is there. Then begin again. This time when you click on run you will not be asked any questions as it is clear you want to load the program to your machine.
- While running the download, you will have to click on Run, Next, and Finish a number of times.
- Kill your IE window(s)
- You should now see the Data Ferrett Icon on your desktop. Double-click on it.
- You will need to enter your email address.
- When the program is fully loaded, the main window will give you access to many helpful options, including a tutorial, examples, descriptions of data, etc. We will skip these features today, but if you use the data ferrett to obtain data for your project, you should play around with them enough to be certain that you are doing things correctly.
- Click on Get Data Now in the bottom right corner.
- On the left hand side is a list of all of the data surveys that the ferrett gives access to. You can expand that box by clicking and dragging on the right hand side boarder to more fully read all of the survey names.
- Click on the + tag to the right of Current Population Survey.
- You will immediately see more folders, starting with Basic and continuing through Work Schedules. The Current Population Survey (CPS) collects data each month. Some of the questions are the same each month (the Basic survey). Other months and years the CPS focuses on particular behavior (such as fertility) or on a particular population of people (such as veterans). Click on the + tag to the right of Basic.
- The monthly lists of data, from June 2008 through January 1994, are now listed. The May version of each survey contains detailed information on the labor market. For our purposes, therefore, right click on May 2006 and then right click on View Variables.
- At this point, you can have access to all of the variables contained in the May 2006 CPS by clicking on Select All Topics. If you do this, a list of almost 300 variables will pop up. In order to keep things managable, check the boxes next to Demographic Variables and Earnings Variables. Then click on Search Variables.
- A list of 62 variables, along with a variable name, brief description, and the months for which they are avaiblable, appear in the window.
- You could download all 62 variables, but this is unnecessary and would result in an enormous data set. Instead:
- Click on the Name tag to sort the variables alphabetically.
- Then, while holding down the control button, click on PEMARITL, PESEX, PREDUCA5, PRMARSTA, PTDTRACE, and PTERNH10.
- Once these six variables are highlighted, release the control button and click on Browse/Select Variables & Values.
- At the top-left of the window, you will see a list of the six variables. Below that is an opportunity to check a box that says Select ALL Variables. Check that box, and then click on the OK box to the right of the list of variables. A box will appear saying: You have Added 6 variables for your Data Basket. Click on OK. Upon doing this, you will return to the list of variables.
- You are now ready to start downloading your data.
- At the top of the page, click on Step 2: Data Basket / Download / Make a Table.
- The page will change to include a list of all of your variables as well as some tab options on the right hand side of the page. The first thing to do is to click on: Save Selected Variable(s) Codebook.
- After clicking on the tab, you will be given an option as to what to call your codebook and where to save it. I would save it on your u: drive, and I would call it cps_may_2006_codebook.txt. Wherever you save it and whatever you name it, don't forget it.
- Next, click on the Download icon toward the top-center of the window. A new box will pop up.
- Unclick the check on Display data on the Screen. Sometimes it is helpful to see the first 50 rows of the data, but usually this just causes problems, so uncheck the box.
- Check the box for Download Data as this is what you want to do.
- You can check the box for STATA data if you wish, however I would recommend against it. Rather, whenever I download data, I always try to save it as a comma delimited text file. The benefit of this choice is that all software programs understand ASCII. So I would check the box next to Comma Delimited.
- If you were downloading a huge data set, then I would select one of the zip options. As this one won't be too big, have No Compression checked.
- Make sure you are not running the download in batch mode. (Again, this is something you might consider if the data set was going to be absolutely huge.)
- Click on Get Extract.
- The window should change so that you are now given a link to your data set. Right click on the link.
- Click on Save Target As.
- Using the browse features, save the data somewhere. I would change the name and extension to something like cps_may_2006.txt and save it on your u: drive.
- When the data is saved, kill the IE/Ferrett windows that have been opened. The ferrett will ask you if you want to save your work. I always check No. If you need to re-do this process, I find it is always better to begin anew rather than trying to rely on an extraction that you have already begun.
- At this point, the data could be read into Excel.
- Open Excel.
- Open a file, remembering to change to all file types. Search for your downloaded data (e.g., I would find cps_may_2006.txt).
- A window will pop-up. Make sure Delimited is checked. Click Next.
- Make sure the delimit symbol is a comma. Click Next.
- Click Finish.
- You will encounter an error that says: File not fully loaded. The problem is that this data set has more than 65,536 observations, which is all that Excel can hold. With a smaller data set, this would have worked just fine. So, we can either find another way to import so many observations into Excel, or we can try to read the data directly into STATA. We will do the later.
- Kill your Excel window.
- To read your data directly into STATA, open STATA.
- Change the directory so that you are in the same directory as the data. Type "dir" to make sure that you see your data (e.g., in my case, when I type dir I should see cps_may_2006.txt as one of the files.
- Open a log file: "log using cps_may_2006.log, replace".
- Enter in the command line of STATA: set mem 100000k. This will give STATA enough working memory to handle your data.
- Enter in the command line of STATA: insheet using cps_may_2006.txt, or whatever your file is named.
- After the data has been loaded, enter describe and summarize in the command line of STATA to see the format and summary statistics of your data. You should notice something odd. Before addressing the oddity, save your data. In my case, I would type: save cps_may_2006.dta, replace.
- The oddity is that there are now 9 variables and not 6. Looking at your data, you should notice three new variables: hrhhid, hrhhid2, and occurnum. It turns out that the data ferrett exports CPS data, it always includes the household id number and the occurance of the observation within the household. In some cases, these are important variables. As we only care about the first observence, enter into STATA: keep if occurnum==1. Notice the double equal sign as this is not generating a new variable. Moreover, of the original 136,109 observations, only 53,710 are kept while 82,399 are deleted as they are not associated with the first occurance. Now enter: drop hrhhid hrhhid2 occurnum. Describe the data again, and you will see that we have 6 variables and 53,710 observations.
- What is the difference between marital status (PEMARITL) and the marital status recode (PRMARSTA)? There are a couple of ways to go about looking for the difference, but the first step is to always look at the codebook. Open your codebook in a wordprocessing program (I prefer WordPad), and you will see that the difference concerns some extra categories for spouses. To determine how big of a difference there is between the variables, tabulate both variables: tab pemaritl prmarst. Using this tabulation along with the codebook, we see that the difference is that the recoded variable discerns between a person with a civilian spouse vs. a non-civilian (armed forces) spouse. Suppose we are interested in distinguishing between people who get and stayed married from those who get married and then separate from those who never marry. To do this, we could use either marital status variable. Enter the following commands (and try to predict what each will do to the data set):
- drop if pemaritl==-1
- gen status=1*(pemaritl<=3)+2*(pemaritl==4|pemaritl==5)+3*(pemaritl==6)
- label define marstats 1 Married 2 Divorced 3 Single
- label values status marstats
- tab status
- gen married=(status==1)
- gen divorced=(status==2)
- gen single=(status==3)
- drop pemaritl prmarsta
Notice that we have generated four variables: status takes on one of three values to indicate married, divorced, or single. We also created three dummy variables for married, divorced, and single. With these variables defined, we then dropped the original variables of pemaritl and prmarst. For the record, notice that the first drop command (i.e., drop if pemaritl==.) drops observations with missing data on marital status while the second drop command (i.e., drop pemaritl prmarst) drops variables. You always need to understand if you are dropping variables or dropping observations.
- We now want to adjust our variable for sex. Notice that if you tabulate pesex, you can't tell which observations are male and which are female. This is why you must have a codebook. The codebook tells us that males are classified with a 1 while females are classified with a 2. Enter the following commands:
- tab pesex
- gen male=(pesex==1)
- gen female=(pesex==2)
- tab male female
- gen sex=male
- label define sexes 1 Male 0 Female
- label values sex sexes
- drop pesex
Notice that the dummy variables for male and female represent identical information.
- Now consider the education variable. According to the codebook, there are five classifications: less than a high school diploma high school graduates with no college, high school graduates with some college, associate degree holders, and bachelor degree holders. Enter the following commands:
- gen educ=preduca5
- recode educ 4=3
- recode educ 5=4
- label define edclass 1 "No HSD" 2 "HSD" 3 "Some Col" 4 "Col Grad"
- label values educ edclass
- tab educ
- gen nohsd=(educ==1)
- gen hsd=(educ==2)
- gen somecol=(educ==3)
- gen college=(educ==4)
- drop preduca5
- If you tab ptdtrace, you will notice that there are a bunch of different categories for race. We will keep things simple and classify people as white, black, native american, asian, or other. To do this, however, we must look to the codebook for guidance. Enter the following commands:
- gen race=1*(ptdtrace==1)+2*(ptdtrace==2)+3*(ptdtrace==3)+4*(ptdtrace==4)+5*(ptdtrace>=5)
- tab race
- gen white=(race==1)
- gen black=(race==2)
- gen natam=(race==3)
- gen asian=(race==4)
- gen othrc=(race==5)
- label define races 1 White 2 Black 3 "Nat Am" 4 Asian 5 Other
- label values race races
- drop ptdtrace
- The final variable we want to create is the hourly wage of the respondent. By summarizing the wage variable with detail -- "sum pternh1o, detail" -- one can immediately see that there are a lot of non-respondents to this question (for various reasons). Actually, one immediately sees that there are a lot of negative numbers reported. Checking with the codebook suggests that these are simply people who didn't report an hourly wage. Enter the following commands:
- gen wage=pternh1o
- sum wage, d
- keep if wage>0
- gen lnwage=log(wage)
- drop pternh1o
- order wage lnwage status married single divorced sex male female educ nohsd hsd somecol college race white black natam asian othrc
- describe
- summarize
- At this point, save your current data set, but change its name so that you don't over-write your original data.
- save cps_may_2006_workers, replace
- Lastly, we will do some data analysis. Consider the following:
- tab educ sex, col
- tab educ race, col
- tab educ status, col
- tab race status, row
- tab race sex
- reg wage divorced single female nohsd somecol college black natam asian othrc
- reg lnwage divorced single female nohsd somecol college black natam asian othrc
- log close