Stata Project #2: Merging and Moving Datasets

Robert J. Lemke
Department of Economics and Business
Lake Forest College
Copyright 2012

Due: Start of class on Thursday, October 10

Consider the following situation. Professor Lemke has STATA SE on his office computer. STATA SE can handle datasets with as many as 20,000 variables. Students, however, have access to STATA IE on the computers in room 237 of the library. STATA IE, unfortunately, can only handle datasets with at most 1,600 variables.

Katie is working on her senior thesis under the direction of Professor Lemke. Katie’s research requires her to use the 2007 Survey of Consumer Finances (called the 2007 SCF for short). The 2007 SCF contains over 5,000 variables. About 3,000 variables are called Jzzzz where zzzz denotes up to a 4-digit number (such as J5, J43, J703, and J9102). The 2007 SCF also contains about 2,000 variables called Xzzzz where zzzz denotes up to a 4-digit number (such as X3, X15, X507, and X9913). The distribution of variable names is essentially uniform across the numbers 0 to 9999–that is, although only 30% of the potential zzzz extensions for the J variables and only 20% of the potential extensions for the X variables are used, the actual values are distributed evenly across the range of potential numbers. The 2007 SCF contains one more variable called YY which is a unique identification number for each observation.

Notice that Professor Lemke’s version of Stata can handle all 5,000 variables of the 2007 SCF raw data called SCF2007.dta, but the version of Stata available to Katie in the computer lab cannot handle all 5,000 variables of the 2007 SCF. Luckily, Katie’s project does not require all 5,000 variables. In fact, her analysis will only use about 200 variables, unfortunately she isn’t quite sure which variables she will need. She does know that she won’t need any of the Jzzzz variables.

Not wanting to bother Professor Lemke every time she finds a new Xzzzz variable she needs to use, Katie devised a two step plan. She will write two STATA programs called programA.do and programB.do. Once written, she will run programA.do on Professor Lemke’s computer. ProgramA.do will produce two new (smaller) datasets called SCF2007A1.dta and SCF2007A2.dta. After they are created, Professor Lemke will then email both files (SCF2007A1.dta and SCF2007A2.dta) to Katie.

The second step of Katie’s plan is to write a STATA program called programB.do that creates the data set SCF2007EXT.dta which contains all of the Xzzzz variables that Katie will need in her research. Moreover, this file will be easily editable so that if Katie later finds that she needs more Xzzzz variables, she will be able to edit programB.do and re-run it to get a new version of SCF2007EXT.dta. Note too that in the re-running of programB.dta, she will not need to rerun programA.do on Professor Lemke's computer either. Thus, every Xzzzz variable must be in either SCF2007A1.dta or SCF2007A2.dta when Professor Lemke emails both files to Katie. At the same time, neither of these files can contain more than 1,600 variables, otherwise Katie will not be able to use STATA IE in the computer lab in the library.

The requirement of this Stata project is for you two write both programs. That is:

Write a program called programA.do to be run on Professor Lemke's computer that uses SCF2007.dta to create two Stata datasets called SCF2007A1.dta and SCF2007A2.dta that can be used on the library’s computers. There are two requirements here. First, neither dataset can contain more than 1,600 variables so that they can be used on the library’s computers. And second, all Xzzzz variables must be included in at least one of the two data sets and it must be easily known which variables are in which data set so that Katie can easily extract them at a future date.
Write a program called programB.do that uses SCF2007A1.dta and SCF2007A2.dta to create a single dataset called SCF2007EXT.dta that contains any and all of the Xzzzz variables that Katie decides she needs. (As mentioned above, Katie will never need more than 200 of the Xzzzz variables, but at this stage in her research she doesn’t know which of the Xzzzz variables she will need.)