Design Options for SESTAT for the Current Decade: Statistical Issues
Overview of 1990s SESTAT Design
This section describes the SESTAT suite of surveys of the 1990s. It starts with a description of the target population and the three survey components and then reviews the coverage gaps and response rates that have been achieved. The section closes with a brief review of the weighting and variance estimation procedures used in SESTAT.
The target population for the SESTAT database of the 1990s is residents of the United States who have at least a bachelor's degree and who, as of a specified reference date, were noninstitutionalized, age 75 or younger, and either had a degree in science and engineering or were working as a scientist or an engineer. S&E is defined as the broad categories of computer and mathematical sciences, life and related sciences, physical and related sciences, social and related sciences, and engineering (for a complete definition, see NSF/SRS 1999). The SESTAT definition includes the following two groups:
The definition of a scientist or engineer requires a college degree (i.e., a bachelor's degree or higher). Those working in S&E occupations who do not have college degrees (e.g., individuals with associate's degrees in any field) are not covered in the SESTAT surveys and database.
Some analysts may prefer to restrict their definition to a subset of the target population, such as individuals with S&E degrees. As discussed below, coverage gaps will be less of a concern when the definition is restricted to those with S&E degrees. The occupational mobility of college graduates without S&E degrees into and out of S&E occupations causes significant coverage problems.
Components of SESTAT
The SESTAT database includes three components, each designed to represent different parts of the target population. The NSCG represents U.S. scientists and engineers existing at the previous round of data collection, the NSRCG represents new S&E bachelor's and master's graduates from U.S. institutions since the last round of data collection, and the SDR represents the population of those who earned U.S. doctorates. The three components are described briefly in the following sections. Detailed descriptions can be found on the NSF/SRS website (http://www.nsf.gov/statistics/survey.cfm).
National Survey of College Graduates
The largest component of the SESTAT database in the 1990s was the NSCG, which was conducted in 1993. The mode of collection for the NSCG was mostly mail with computer-assisted telephone interviewing (CATI) and personal interviewing followup. The sample for the 1993 NSCG was drawn from the 1990 census long form records, which contained information on degree level, field of occupation, country of origin, and date of entry into the United States, along with demographic information. The sample from the long form was restricted to individuals who had at least a bachelor's degree. However, because the long form did not include a question about degree field, the 1993 NSCG sample included individuals with S&E degrees and those with only non-S&E degrees. The sample thus consisted of three components: those with S&E degrees, whether or not they were working in an S&E occupation; those with degrees only in non-S&E fields who were working in S&E occupations; and those with degrees only in non-S&E fields who were not working in S&E occupations. The third group is not part of the S&E target population. In the 1993 NSCG, there were 148,932 respondents plus an additional 19,244 cases deemed to be ineligible (e.g., deceased, over 75 years old, or no longer in the United States) out of a total sample of 214,643. Of the 148,932 responses, 74,693 (50%) fit the definition of the S&E target population.
The NSCG was also conducted in 1995, 1997, and 1999 (the NSCG was not fielded in 2001). The 1995 NSCG was administered to all 1993 NSCG respondents who were classified in 1993 as being in the target population and to a sample of individuals from the 1993 NSRCG sample (as described below) to represent recent graduates. Individuals without S&E degrees who were working in non-S&E occupations were eliminated from the 1995 NSCG sample along with individuals who aged out of the sample when they became older than 75 and those who could have been selected for the current NSRCG or SDR components of SESTAT—the latter called overlap cases. The 1997 and 1999 NSCG samples were similar to the 1995 sample except they consisted of only a sample of the respondents from the previous NSCG cycle.
National Survey of Recent College Graduates
The NSRCG is designed to identify recent college graduates who are not represented in the previous round of the SESTAT surveys. The NSRCG was conducted in 1993, 1995, 1997, 1999, and 2001. The sample for each cycle covered individuals who graduated in the previous 2 academic years with a bachelor's or master's degree and a major in an S&E field. Subsamples of NSRCG respondents from a given round were carried forward for inclusion in the following NSCG survey. The mode of collection for the NSRCG was mostly CATI with mail followup.
The NSRCG employed a two-stage sample design for the 1993 through 2001 surveys. The first stage was a stratified sample of colleges and universities in the United States that awarded bachelor's and/or master's degrees in science or engineering. The institutions were selected with probability proportional to the number of graduates in the S&E reporting categories from a sampling frame constructed from the Integrated Postsecondary Education Data System (IPEDS), which is maintained by the National Center for Education Statistics. The 1993 NSRCG institution sample design was based on the institution sample used during the previous decade. The sample design was revised for the 1995 cycle; it used a new institution sample selected using the 1991–92 IPEDS Completions file. The 1995 first-stage sample was used again in 1997. During the 1999 survey cycle, this sample was evaluated and supplemented with additional institutions from the 1994–95 IPEDS. Similarly, in the 2001 survey cycle, the sample was supplemented with institutions from the 1996–97 IPEDS. The second sampling stage involved the selection of S&E bachelor's and master's graduates from lists provided by the sampled institutions for each cycle of the survey.
Survey of Doctorate Recipients
The SDR represents individuals with earned doctorates from a U.S. degree-granting institution. Doctorate recipients are sampled separately in the SESTAT surveys because of a desire to increase the sample of earned doctorates and to maintain comprehensive information on this group. The primary source of information for the frame of doctorate recipients is the Survey of Earned Doctorates (SED), which is a census of newly granted research doctorates in the United States that has been collected each year since 1957. Before 1957, the National Academy of Sciences maintained a register of highly qualified scientists and engineers assembled from a variety of sources, and such recipients are represented in the SDR panel until they age out of the sample. During the 1990s, the SDR sample included prior samples (with small maintenance reductions each cycle) and a stratified sample of recent doctorate recipients. The SDR was conducted in 1993, 1995, 1997, 1999, and 2001. The mode of collection for the SDR has mostly been mail with CATI followup.
The SESTAT target population includes residents of the United States who, as of the survey reference period, were noninstitutionalized, age 75 or younger, had at least a bachelor's degree, and either had a bachelor's degree or higher in an S&E field or were working as a scientist or an engineer. However, certain groups that are intended to be in the target population are either covered only partially or not covered at all in the SESTAT database.
One main group only partially covered is referred to as the "foreign degreed." Those not covered include individuals who were not residents of the United States as of 1 April 1990 (except those serving in the U.S. Armed Forces overseas) and who received a degree from a foreign degree-granting institution but not from a U.S. institution. Also not covered are those who were residents of the United States at the time of the 1990 decennial census and at that time had no degree but later received a degree from a foreign institution. Under the 1990s design, the foreign degreed are included in the SESTAT database only if they were included in the 1990 decennial census and already had at least a bachelor's degree. The undercoverage of the foreign-degreed group increased over the decade because of immigration and because of individuals who received foreign degrees after the census was conducted. However, some of these individuals then obtain U.S. S&E degrees and become part of the sampled population in the NSRCG or the SDR.
Another group that is only partially covered comprises individuals working in an S&E occupation who do not have an S&E degree. Individuals working in S&E occupations who initially graduated after 1 April 1990 and who have only non-S&E degrees are not covered. Also, among those who obtained a degree before 1 April 1990 and who have only non-S&E degrees, only those who were working in an S&E occupation in 1993 are covered. Again, for this group, the undercoverage increased over the course of the decade.
In general, the populations that are partially covered by the SESTAT database fall into one of the two groups described above. Additional details about the undercoverage can be found on the SESTAT website (http://sestat.nsf.gov/docs/techinfo.html#targetpop). These details are briefly summarized in the following sections.
Groups Not Covered in 1993
Within the coverage defined for the SESTAT integrated database, the following individuals with bachelor's and master's degrees were not included in the 1993 surveys.
Doctorate-level individuals with S&E degrees who were not surveyed in 1993 were predominately U.S. residents who (1) received an S&E doctorate after June 1992 or (2) earned that degree at a foreign institution and
Groups Not Covered in 1995
The following individuals with bachelor's and master's degrees were not surveyed and therefore are not represented in the 1995 SESTAT integrated database.
Doctorate-level individuals with S&E degrees who were not surveyed in 1995 were predominately U.S. residents who (1) received an S&E doctorate either after June 1994 or (2) earned that degree at a foreign institution and
Groups Not Covered in 1997
The following individuals with bachelor's and master's degrees were not surveyed and therefore are not included in the 1997 SESTAT integrated database.
Doctorate-level individuals with S&E degrees who were not surveyed in 1997 were predominately U.S. residents who (1) received an S&E doctorate either after June 1996 or (2) earned that degree at a foreign institution and
It is important to consider the impact of attrition on sample size when designing the surveys that populate the SESTAT database. Even when there is a relatively high response rate in any given survey year, the cumulative response rates of a longitudinal survey will deteriorate over time. Table 1 summarizes unweighted published response rates for the three survey components. The response rates in the table are not directly comparable. The NSCG response rate for 1993 is the response rate for the initial (full coverage) sample as selected from the census long form records, whereas the response rates for the later years of NSCG are "conditional" response rates pertaining to the sample of respondents from previous cycles (including supplemental cases from the NSRCG). The response rates for the NSRCG, on the other hand, are "unconditional" response rates pertaining to the cross-sectional samples that were selected for the particular cycles (years). The response rates for the SDR are also unconditional but, as noted in the table, a subsample of prior nonrespondents was selected for followup in the 1995 SDR (unlike the procedures for other years, in which all nonrespondents were followed up).
The three components show somewhat different patterns of response. The initial response rate for the NSCG was the response rate for the sample selected from the 1990 census long form. However, in succeeding years the NSCG response rate shown in the table is the rate of response among those who responded in the previous cycle. In the three NSCG cycles shown in the table, assume the response rates are multiplicative. Then with an initial response rate of 80%, the overall response rate in the second year would be 76% (80% times 95%). After another cycle, the overall response rate would decrease to 71% in 1997 and, carrying the computations forward, would continue to decrease in subsequent rounds.
The NSRCG response rates in Table 1 , which are based only on the new sample of recent graduates for a given cycle, declined slightly from 1993 to 1997 but were 82% or higher in all years surveyed. Because this sample "feeds" into the NSCG sample in the following round, the response rate for the NSRCG has a direct bearing on the overall response rate for the NSCG.
The SDR response rates in Table 1 are the overall (unweighted) response rates for the given cycle. In 1993, the unweighted SDR response rate was 87%. The unweighted SDR response rate of 77% in 1995 was due to subsampling of nonrespondents as the end of the survey approached. The 1995 SDR subsampling took about 60% of the mail nonrespondents prior to CATI followup. Thus, the unweighted response rate for 1995 was calculated using the original sample size as a base, even though 40% of the mail nonrespondents were not included in the CATI data collection. The weighted response rate that reflects the subsampling of nonrespondents was 85%. This is consistent with the 84% response rate achieved in the 1997 SDR. Thus, it may be reasonable to speculate that SDR response rates will remain at about 80% in subsequent rounds. Such subsampling was conducted in the 1995 cycle only.
Although response rates are generally high for both the NSRCG and the SDR, the NSCG is the largest component of SESTAT and thus has the greatest impact on overall response rates. Despite the high conditional response rates achieved for the NSCG, over time the cumulative effects of attrition will inevitably lead to very low overall response rates.
Weighting in SESTAT
The initial purpose of weighting is to compensate for differential probabilities of selection. This compensation is achieved through the "base weight," which is defined as the reciprocal of the probability of selecting a person for the study. The final weights used for the survey analysis may include one or more adjustments. For example, nonresponse-weighting adjustments are often used to inflate the base weights to compensate for unit nonresponse. Poststratification adjustments are used to make the weighted sample counts conform to known population totals, and thus provide a way of adjusting for possible undercoverage of the target population.
Although some aspects of weighting have varied from year to year, the main features of the weighting procedures used in SESTAT can be summarized as follows.
Variance Estimation in SESTAT
Replication methods have been used to produce estimates of sampling errors for all three components of SESTAT. Because the variance estimates were developed by different survey organizations, different methods have been used. For example, for the NSCG, balance repeated replication (BRR) was used for variance estimation. For the NSRCG, a jackknife method for a paired selection sample design, referred to as "JK2," was used (see Westat 2000); for the 1993 NSRCG 50 replicate weights were produced for variance estimation, whereas for the 1995–99 NSRCG 86 replicate weights were produced. Additional details of the methods used in the NSRCG are given in Westat (1999). For the 1993 SDR, BRR was used with 16 replicate weights for variance estimation. For later SDR samples, the number of replicate weights was increased to 48. Details of the methods used in the SDR for variance estimation are given in U.S. Census Bureau (2001).
The variance of an estimated total obtained by pooling the various SESTAT samples can be calculated by simply adding the variance of the individual components. As long as the individual variance estimates are approximately unbiased (and the samples can be assumed to be roughly independent), the resulting total variance is also approximately unbiased no matter what methods are used to obtain the individual variances. Because there is a large number of questionnaire items in the components of SESTAT, generalized variance curves are also used to provide approximate variances for broad classes of statistics based on observed relationships between the weighted estimate and the calculated sampling error.
 The U.S. Census Bureau was required to perform the data collection for the NSCG sample derived from the 1990 census. However, for the 1997 and 1999 SESTAT cycles, the NSCG panel subsamples originally selected as part of the NSRCG were not collected by Census but as part of the NSRCG panel survey. The data for these cases were integrated with the NSCG and SDR to form the full SESTAT integrated database.