nsf.gov - NCSES Design Options for SESTAT for the Current Decade: Statistical Issues - US National Science Foundation (NSF)
text-only page produced automatically by LIFT Text
Transcoder Skip all navigation and go to page contentSkip top navigation and go to directorate navigationSkip top navigation and go to page navigation
National Science Foundation National Center for Science and Engineering Statistics
Design Options for SESTAT for the Current Decade: Statistical Issues

SESTAT Redesign Options


NSF proposed four options for the redesign of the SESTAT suite of surveys. Each option is outlined in this chapter. One of the options was essentially a repeat of the design implemented during the 1990s. For all options, the target population is theoretically the same—residents of the United States with at least a bachelor's degree and who, as of the survey reference period, were noninstitutionalized, age 75 years or younger, and either have an S&E degree or are working in an S&E occupation. Each option has some coverage gaps; these gaps are different for each option.

Figure 1 Figure. is the options chart, which graphically shows information about each option for the 2003 data collection. The chart has a separate row for each population component included in the target population. The last three rows are for population components that are currently undercovered or not covered in the SESTAT surveys but may be part of the target population. The first three columns indicate the survey that currently covers each population component, the population component involved, and the frame from which the survey sample is currently selected. For example, the first row shows that the doctorate population is currently surveyed by the SDR, which uses the SED as the sampling frame. The population components currently included in the NSRCG are described by the academic year of degree receipt. For example, "1991–92 Bac/Master" includes degrees received from July 1990 through June 1992 (2 academic years). The next set of columns shows how each population component would be sampled and surveyed under each of the four options. Thus, in option 1, the doctorates are covered by the SED/SDR, the bachelor's/master's from pre-1990 through March 2000 are included in a 2000 postcensal survey, the bachelor's/master's from April through June 2000 are a panel selected from the 2001 NSRCG survey, and the bachelor's/master's from July 2000 through June 2002 are included in the 2003 NSRCG.

The four options have some characteristics in common. First, the SDR would continue its current system of sampling and data collection for the U.S. doctorate population under all of the options because a more efficient frame has not been identified. Furthermore, the design of the SDR would not be affected by the choice among the four options. Second, the NSRCG would not change. Third, under all four options the sample would be updated for new entrants into the S&E population every 2 years using new S&E graduates from the NSRCG. Fourth, the SESTAT survey would be conducted about every 2 years from 2003 to 2009 under each of these options.

Figure 1 Figure. also discusses the differences between the options with regard to coverage gaps, sample attrition, and screening requirements.

Option 1: A Replication of Design of the 1990s

Option 1 is a replication of the 1990s design. Under this option, the Census Bureau would conduct a postcensal survey in 2003 based on the 2000 census. This survey would include college graduates who, in April 2000, had received at least a bachelor's degree and were age 72 years or younger (and thus would be age 75 years or younger in 2003), noninstitutionalized, and living in the United States or serving in the Armed Forces overseas. In 2003, this sample would be contacted and interviewed for the NSCG. On the basis of the interview, those individuals in the 2003 sample who (a) were college graduates with S&E degrees or (b) had a college degree and were working in S&E occupations would be screened into the 2005 NSCG sample. Foreign-degreed individuals who were in the 2000 census as well as those with non-S&E degrees but who were working in S&E occupations in 2003 would be included in the followup NSCG sample.

Under this option two major groups in the target population would not be covered or would be poorly covered in 2003: (a) individuals eligible for the SESTAT integrated database who lived abroad as of the 2000 decennial census who later came to live in the United States and who did not earn a bachelor's or higher S&E degree from a U.S. institution after April 2000 and (b) individuals with only non-S&E degrees obtained after April 2000 who held S&E occupations in the survey reference period. In addition, individuals with only non-S&E degrees (who had obtained at least one degree before April 2000) who did not hold S&E occupations in 2003 but held such occupations in a later survey reference period would not be covered in the followup NSCG surveys after 2003.

As in 1993, this option for the NSCG is essentially a large screening effort. As an indication of how much screening might be needed, in the 1993 NSCG 214,643 individuals were selected from the 4,728,000 people who were reported on the 1990 census long form as meeting the degree and eligible age requirements. There were 148,932 eligible respondents for the 1993 NSCG from this sample and 19,224 others who were classified as ineligible for reasons such as being deceased, over age 75, or no longer living in the United States. The NSCG eligible respondents yielded a total of 74,693 respondents who were eligible for inclusion in the SESTAT database. About half of those who were interviewed in the 1993 NSCG were not eligible for inclusion in SESTAT because they did not have an S&E degree and were not working in an S&E occupation. This would imply that if the same response and eligibility rates occur in 2003, less than 35% of those individuals selected from the 2000 census would respond to the 2003 NSCG and be eligible for inclusion in SESTAT. This corresponds to a screening rate of almost 3 to 1 (i.e., the ratio of the initial sample size to the number of SESTAT-eligible respondents), which is the maximum screening ratio among the four options considered.

Under option 1, and indeed for each option, the coverage of foreign-degreed individuals and those with non-S&E degrees who have S&E occupations in 2003 will obviously be much greater than in the 1999 SESTAT database because there were no attempts during the 1990s to update the original NSCG sample with these subgroups. This can cause a "discontinuity" when comparing results from the new design with those from the previous design. However, if the subgroups of foreign-degreed individuals and those with non-S&E degrees are removed from the analysis, then the resulting population of inference will be similar to that represented by the 1999 SESTAT sample and time trends will be maintained—although differences will be subject to larger sampling errors than if the samples were overlapping. It should also be noted that time-trend comparisons may be affected by differential nonsampling errors, such as nonresponse and panel conditioning, in the surveys being compared. In addition, small domain estimates (e.g., demographic groups with small numbers in particular degree fields) may be unstable.

Because of the declining unconditional response rates that can be expected in the NSCG at each successive round (see the section "Response Rates"), the sample under option 1 (which is a "new" sample) will begin with an estimated 15% smaller nonresponse than the existing SESTAT sample, provided the 1993 initial response rate is maintained in the 2003 postcensal survey. Also, to the extent that any conditioning biases are introduced into the panel over time, the new sample will be subject to smaller panel effects than the existing sample.

A methodological approach that NSF also considered for option 1 is the use of decentralized CATI collection for followup of mail nonrespondents in the postcensal survey. This approach could be more cost effective and may result in higher response rates than centralized CATI collection for this survey. (The decentralized CATI option, which was named option 5 by NSF, is not considered separately in this report because any of the four options discussed here could have a decentralized CATI version.)

Top of page. Back to Top

Option 2: A Continuation of Current Panels

Under option 2, the current sample based on the 1990 census would continue with some attempt to fill gaps in coverage, where possible and cost effective. Targeted samples would be screened from the 2000 census to update the sample of foreign-degreed individuals and college graduates with only non-S&E degrees who work in S&E occupations. To decrease the nonresponse rate, NSF could also go back to the original samples from 1993 as well as the later panels and try to trace the nonrespondents.

Individuals with at least a bachelor's degree who were born and educated abroad can be sampled directly from the 2000 census by stratifying according to country of origin and year of entry into the United States, but the foreign degrees would need to be screened to identify those in S&E fields. U.S. citizens educated abroad because the 1990 census would not be covered. Individuals with only non-S&E degrees who are working in S&E occupations cannot be identified in advance of sampling. Thus, if these individuals are to be included in the sample at the same rate as in option 1, the initial sample size from the census would have to be the same as in option 1. In this case, option 2 would offer no advantage over option 1 in terms of reducing screening costs.

On the other hand, if it is possible to identify occupational groups in which large numbers of individuals with only non-S&E degrees work (e.g., the computer science field), then it may be possible to either target (restrict) the sample of individuals with non-S&E degrees to selected occupations (however, this would result in some undercoverage in the occupations not sampled) or, alternatively, to sample the occupations differentially to reduce the total screening effort. NSF could conduct a study of S&E occupational fields to identify those occupations in which individuals with non-S&E degrees represent a sizable proportion of workers, and this information could then be used for sample design purposes to determine how to sample the various occupational groups.

Under this option (where the 2000 census would only be used to sample individuals born and educated abroad and those targeted as having non-S&E degrees), the coverage gaps would grow through the decade similarly to option 1. Although the existing SESTAT surveys do not cover foreign-degreed individuals and those with only non-S&E degrees after April 1990, "refreshing" the NSCG sample with the foreign degreed and those with only non-S&E degrees from the 2000 census would produce coverage similar to option 1.

A variant of the above approach would be to use the 2000 census to sample only the foreign-degreed (who can more readily be identified from the census long form than can those working in S&E occupations without S&E degrees). Individuals with only non-S&E degrees in S&E occupations would not be covered except, if desired, to the extent that they are represented in the 1990 census sample. This variant would allow inferences to be made to the restricted population of individuals with S&E degrees. However, even with this restricted definition there would still be some undercoverage, namely individuals born in the United States who received only foreign degrees between April 1990 and April 2000.

The level of effort for screening under option 2 depends on the desired sample size for individuals with non-S&E degrees screened from a sample of college graduates identified in the 2000 census. If the intent is to maintain the same expected sample size for this subgroup as in option 1, then the level of screening would be roughly the same as in option 1. On the other hand, if the SESTAT population was restricted to individuals with S&E degrees (so those without S&E degrees would be excluded from sampling), then screening the census population would essentially be eliminated. Between the two extremes are intermediate positions in which the population of individuals with only non-S&E degrees who are working in S&E are (a) restricted to certain occupations or (b) sampled at varying rates depending on occupational group. Position (a) would lead to undercoverage bias whereas position (b) would avoid undercoverage but lead to increased sampling errors. Either of these positions could plausibly lead to screening rates that are one-half to three-quarters the size of those for option 1. Further analysis would be required to determine the approximately optimum sampling rate under the various scenarios.

This option allows for only limited flexibility in terms of allocating sample size in the old sample. Furthermore, unless the tracing efforts are successful, the response rate that is a concern in the current design would not be improved. Weighting adjustments needed to compensate for nonresponse tend to increase the variation in the weights, which in turn tends to increase sampling errors.

The primary advantage of this option (apart from the cost savings associated with retaining previous respondents) is that it maintains the longitudinal aspects of the original SESTAT design and small domain estimates are more stable over time. Specifically, option 2 will permit analysis of change among individuals who were in the study for two or more rounds in different decades. The potential for longitudinal analysis would be lost under option 1 until subsequent data collection rounds are completed, and would be nonexistent for comparisons involving a time interval of more than 10 years (e.g., spanning different decades).

Top of page. Back to Top

Option 3: A Split Design Combining Options 1 and 2

This dual-frame option combines features of option 1 and option 2. In this dual-frame design, a portion of the sample would be selected using the frame from option 1 and the remainder using the 1999 NSCG panel frame from option 2. A 50-50 split of the sample is roughly optimum for making comparisons between two options in most such designs; however, a 50-50 split is not necessary and in this case is not the optimum allocation for producing estimates from the combined sample.

Under this option, the sample drawn from the 2000 census would be a smaller sample of college graduates than if the whole sample was used for the sample under option 1. An advantage of using the 2000 census for part of the sample is that the subpopulations with coverage problems would be represented. These subpopulations consist of foreign-degreed individuals and those with non-S&E degrees who have moved into S&E occupations since the 1990 census. The remaining portion of the option 3 sample would be selected from the existing NSCG panel. The advantage of using the 1999 NSCG panel for part of the sample is that the screening for S&E cases has already been done, so each sample case yields an eligible S&E case for analysis, whereas samples from the 2000 census yield only about one eligible S&E case for every three selected.

It should be noted that the 1999 NSCG panel (option 2) has only about 46,000 cases, thus limiting the option 2 allocation. Therefore, if the total number of samples approaches 200,000, as it did in past redesigns, the 1999 panel samples would make up the minority of cases, even if all 46,000 were used. The remaining sample of over 150,000 cases would be drawn from the 2000 census. Assuming further that the one-in-three screening rate for finding S&E cases obtained in past postcensal surveys still holds, the 2000 census (option 1 portion) would yield about 50,000 in-scope cases. Thus, the overall S&E sample from the option 3 design would be about 96,000 cases (46,000 + 50,000) compared with 67,000 cases (one-third of 200,000) from the option 1 design. Subsampling of the old panel cases could be used to bring the "old" sample size to a smaller portion of the total sample size if desired.

The coverage gaps under option 3 would be the same as those under option 1. However, the sample sizes for the undercovered subgroups may not be the same. For example, because it is not possible to identify individuals with non-S&E degrees who are working in S&E occupations in the census frame, the screening required for the option 1 portion of the sample in the hypothetical example above will yield about three-quarters of the sample size for the subgroup of individuals in S&E without an S&E degree (assuming a roughly 50,000/150,000 sample split and no other targeting, as described below). However, if it is cost effective to target selected occupations related to workers of interest with a non-S&E degree (as described previously under option 2), the sample for the targeted occupations can be increased accordingly. On the other hand, because it is probable that foreign-degreed individuals can be identified in the census files in advance of sampling, it will be feasible to sample this subgroup at the full rate without unduly increasing screening levels.

Option 3 permits an assessment of possible "panel effects" in the existing NSCG sample. If the two samples produce comparable results, they can be combined with relatively little loss in efficiency and potentially a net increase in the size of the S&E sample. On the other hand, if the comparisons indicate that estimates from the existing panel are markedly different from those based on the new sample, then it may be presumed that there is differential bias. In this case, the new sample can be used to make cross-sectional estimates as in the past (although with slightly reduced precision due to the smaller sample). Analysis of the differences might also provide improved nonresponse adjustment methodology that would bring the estimates from both frames back into effective joint use.

Top of page. Back to Top

Option 4: A Variant of Option 2 Supplemented by the NSRCG

Option 4 is a variation of option 2 supplemented by the NSRCG. The old panels would continue as in option 2. Assuming that it is feasible to select new samples from the old NSRCG panel lists, additional samples would be taken to supplement the old panels. There could also be an attempt to recontact old nonrespondents. The difficulty is that there is no viable frame for the 1993 postcensal sample or for the 1993 NSRCG sample (the 1993 NSRCG frame is no longer available).

As in option 2, the 2000 census would be used for the limited purpose of refreshing the sample with the "missing" subpopulation of the foreign degreed and, perhaps, those with non-S&E degrees who have moved into S&E occupations.

The coverage gaps would be the same as in option 2. Sampling from the old NSRCG frames would not cover any missing subgroup but would only be an attempt to increase sample size in response to attrition. The resampling will not eliminate any existing nonresponse bias in the old cohorts but will merely provide another independent sample for comparison. As with option 3, the nonresponse bias, if any, can be studied.

This option would require extensive tracing to locate individuals selected from the old NSRCG lists and may not have any cost advantage. It adds very little as compared with option 2.

Top of page. Back to Top

Other Issues Pertaining to Design Options

Other issues that affect decisions about the SESTAT redesign include the following:

  • frequency of the surveys
  • relationships to other surveys
  • methods of sample selection
  • maintenance of trends with earlier data

There appears to be very little difference between the four options on most of these issues. All four options would allow the same frequency of surveying. All four options have exactly the same relationships to other surveys (e.g., Current Population Survey and American Community Survey). The four options are based on similar data sources and there is nothing in any option that would give it an advantage in terms of sample selection methodology.

The maintenance of trends within a longitudinal sample is the one area where some difference exists. Because the NSCG is a fresh sample in option 1, there may be a discontinuity in the trends from the 1990s and there is no basis for estimating individual gross changes across the decades. The other three options all allow some overlap of sample for investigation of gross differences over time. Each of the four options continues to have gaps in coverage; the types of gaps are conceptually the same but the extent varies. The greatest difference is in the way in which the options attempt to address screening levels (and costs), attrition, and nonresponse bias.

Design Options for SESTAT for the Current Decade: Statistical Issues
Working Paper | SRS 07-201 | June 2007