Design Options for SESTAT for the Current Decade: Statistical Issues
The target population for the SESTAT surveys includes residents of the United States who, as of the survey reference period, were noninstitutionalized, age 75 years or younger, had at least a bachelor's degree, and either had a bachelor's degree or higher in an S&E field or were working as a scientist or an engineer. However, certain groups that are intended to be in the target population are only partially covered in the current designs for the SESTAT surveys. The two main groups that are partially covered by the SESTAT surveys are those referred to as the "foreign degreed" and individuals who are working in an S&E occupation but do not have an S&E degree. Although these groups were mostly included in the NSCG, the sample has not been refreshed for these groups in later cycles.
Another important consideration for the design of the SESTAT surveys is the impact of attrition on sample size. Even with relatively high response rates in any given survey year, the cumulative response rates will decline over time. The three SESTAT surveys show somewhat different patterns of response. For the NSCG, the initial response rate was 80% in 1993 and the overall response rate in the second cycle was 76% (80% times 95%). The overall response rate decreased to 71% in 1997 and would be expected to decrease similarly in further rounds of the survey. The NSRCG response rates have declined slightly from 1993 to 1997, but were 82% or more in all years surveyed. In 1993, the unweighted SDR response rate was 87%. The unweighted SDR response rate of 77% in 1995 was due to the subsampling of nonrespondents, which was conducted in that cycle only. The 1995 weighted response rate that reflects the subsampling of nonrespondents was 85%. This is consistent with the 84% response rate achieved in the 1997 SDR. Thus, it may be reasonable to speculate that SDR response rates will remain at 80% or higher in subsequent rounds.
Although response rates generally have been high for both the NSRCG and the SDR, the NSCG is the largest component of the SESTAT integrated database and thus has had the greatest impact on overall response rates. Despite the high conditional response rates achieved for the NSCG, over time the cumulative effects of attrition inevitably have led to low overall response rates.
NSF has proposed four options for the redesign of the samples for the SESTAT surveys. Each option has strengths and weaknesses with regard to coverage gaps, sample attrition, and screening requirements. The four options also have some characteristics in common. First, the SDR would continue its current system of sampling and data collection under all of the options and the design of the SDR would not be affected by the choice among the four options. Second, under all four options the sample of experienced scientists and engineers would be updated every 2 years with new graduates from the NSRCG. Third, the NSRCG would not change. Fourth, the SESTAT surveys would be conducted about every 2 years from 2003 through 2009 regardless of the option chosen.
An important feature of probability sampling methods is that they permit the calculation of the sampling errors associated with the survey estimates. None of the four design options has a significant advantage over the others in regard to variance estimation. As long as appropriate variance stratum and unit identifiers are available in the data and related survey control files, reasonably good variance estimators can be developed using any of a number of well-known techniques. Because the four options all involve the same basic design elements (e.g., use of unclustered census samples plus what are essentially independent samples of recent college graduates and doctorate recipients), the same general approach can be used (with minor modifications) for all four options.
Option 1 is a repetition of the current design. Under this option, the Census Bureau would conduct a postcensal NSCG survey in 2003 based on the 2000 census. Because the census data do not include the level of detail necessary to identify graduates eligible for the SESTAT database, a significant screening process would be needed. If the same response and eligibility rates occur in 2003 as occurred in 1993, less than 35% of those individuals selected from the 2000 census long form records would respond to the 2003 NSCG and be eligible for inclusion in the SESTAT database. This is the maximum screening rate among the four options considered.
Individuals with foreign degrees who were included in the 2000 census as well as those with non-S&E degrees who are working in S&E occupations at the time of the NSCG would be included as eligible sample members. Under this option, coverage of the foreign degreed and individuals with only non-S&E degrees who hold S&E occupations in 2003 would obviously be much greater than in the 1999 NSCG because there have been no attempts to update the original NSCG sample with these subgroups.
Under this option three major groups would be not covered or poorly covered in the SESTAT surveys: (a) eligible individuals who lived abroad as of the 2000 decennial census who later came to live in the United States and did not earn a bachelor's or higher S&E degree from a U.S. institution after April 2000; (b) individuals with only non-S&E degrees obtained after April 2000 who hold S&E occupations in the survey reference period; and (c) individuals with only non-S&E degrees, with at least one degree obtained before April 2000, who did not hold an S&E occupation in 2003 but held such an occupation in a later survey reference period.
If a new NSCG sample is drawn under option 1 and the response rate in 2003 is about the same as the 1993 response rate, that rate would be an estimated 15% higher than the response rate for the existing NSCG sample, which declined during the decade. Also, if any conditioning biases have been introduced into the panel over time, the new sample will be subject to smaller panel effects than the existing sample.
Under option 2, the current NSCG sample based on the 1990 census would continue; some attempt will be made to try to update gaps in coverage where possible and cost effective. To decrease the nonresponse rate, NSF would go back to the original samples from 1993 as well as the later panels and try to trace the nonrespondents.
The 2000 census would be used in a limited way to augment the sample with the "missing" subpopulation of the foreign degreed and, perhaps, those with non-S&E degrees in April 2000 who held S&E occupations in the survey reference period in 2003. An efficient screening could be done for individuals who were born and educated abroad, who can be sampled almost directly from the census by stratifying by country of origin and year of entry into the United States, but the foreign degrees would need to be screened to identify S&E degrees. U.S. citizens educated abroad since the 1990 census would not be covered. Individuals with only non-S&E degrees who are working in S&E occupations cannot be identified in advance of sampling, so a screening activity would be needed. The level of effort for screening under option 2 depends on the desired sample size for individuals with non-S&E degrees screened from a sample of college graduates identified in the 2000 census. If the expected sample size for this subgroup was the same as for option 1, then the level of screening would be roughly the same as for option 1. On the other hand, if the SESTAT population was restricted to individuals with S&E degrees (so those without S&E degrees would be excluded from sampling), then screening the census population would essentially be eliminated. Between the two extremes are intermediate positions in which the population of individuals with only non-S&E degrees who are working in S&E occupations are (a) restricted to certain occupations or (b) sampled at varying rates depending on occupational group. Position (a) would lead to undercoverage bias and position (b) would avoid undercoverage but lead to increased sampling errors. Either of these positions could plausibly lead to screening rates that are one-half to three-quarters of the rates for option 1. Further analysis would be required to determine the approximate optimum sampling rate under the various scenarios.
Under this option (in which the 2000 census would only be used to sample the foreign degreed and individuals with only non-S&E degrees), the coverage gaps would grow throughout the decade in the same way as with option 1. The refreshment of the NSCG sample with the foreign degreed and individuals without S&E degrees from the 2000 census would produce coverage similar to option 1.
This option allows limited flexibility in terms of allocating sample size in the old sample. Furthermore, unless the tracing efforts are successful, the response rate that is a concern in the current design would not be improved. Weighting adjustments needed to compensate for nonresponse tend to increase the variation in the weights, which in turn tend to increase sampling errors.
The primary advantages of this option (apart from the cost savings associated with retaining previous respondents) are (1) it maintains the longitudinal aspects of the original SESTAT design and (2) small domain estimates are more stable over time. Specifically, option 2 would permit analysis of change among individuals who were in the study for two or more rounds in different decades. This potential for longitudinal analysis would be lost under option 1 until subsequent data collection rounds are completed, and would be nonexistent for comparisons involving a time interval of more than 10 years (e.g., spanning different decades).
Option 3 combines features of options 1 and 2. Part of the NSCG sample would be selected using option 1 and the remainder would be selected using option 2. Under this option, the 2000 census would be used to draw a sample of college graduates that is larger than half the size of the sample planned under option 1. The 2000 census sample size can be set to yield sufficient S&E cases to feed into the followup panel about 2 years later with the balance available from samples from the 1999 NSCG panel. The subpopulations consisting of the foreign degreed and those with non-S&E degrees who have moved into S&E occupations since the 1990 census would be represented in this part of the total sample (with this subpopulation oversampled to provide about the same amount of coverage as in option 1). The remaining part of the sample would be derived from a large subsample of the 1999 NSCG panel. The coverage gaps under this option would be the same as in option 1. However, the sample sizes for the undercovered subgroups may not be the same.
This option permits an assessment of possible "panel effects" in the existing NSCG sample. If the two samples produce comparable results, then they can be combined with relatively little loss in efficiency. On the other hand, if the comparisons indicate that estimates from the existing panel are markedly different from those based on the new sample, then differential bias may be presumed. In this case, the new sample can still be used to make cross-sectional estimates (but with reduced levels of precision). Analysis of the differences might also provide improved nonresponse adjustment methodology that would bring the estimates from both frames back into effective joint use. This option would produce the most S&E cases and has the best control for rare demographic S&E groups if the samples can be combined.
Option 4 is a variant of option 2 with supplementation from the NSRCG. The old panels would continue as in option 2. Assuming that new samples can be drawn from the old NSRCG panel lists, these samples could be used to supplement the old panels. There could also be an attempt to contact old nonrespondents. As in option 2, the 2000 census would be used for the limited purpose of refreshing the sample with the "missing" subpopulation of the foreign degreed and, perhaps, those with non-S&E degrees who held S&E occupations in 2003.
The coverage gaps would be the same as in option 2. Sampling from the old NSRCG frames would not cover any missing subgroup but would only address the decline in sample size through attrition. It should be noted that the resampling would not reduce the existing nonresponse bias but would merely provide another independent sample for comparison. As with option 3, the nonresponse bias, if any, could be studied. This option would require extensive tracing to locate individuals selected from the old NSRCG lists, and adds very little as compared with option 2.