Current and Alternative Sources of Data on the Science and Engineering Workforce
Current Design of the SESTAT Surveys
As described in the previous sections, the SESTAT data are collected from three surveys that have been conducted every 2 years since 1993. A brief description of the three SESTAT component surveys follows. See the SESTAT website http://sestat.nsf.gov/ for further details.
The NSCG is a panel survey that started with a sample of individuals with at least a bachelor's degree at the time of the 1990 census and then added samples of recent S&E graduates at subsequent SESTAT rounds. The 1993 NSCG, which primarily covered the experienced S&E population, was conducted by the U.S. Census Bureau, using a subsample of the 1990 decennial census long form sample. A sample of eligible respondents from the 1993 NSCG was followed in subsequent SESTAT rounds and supplemented by samples from the NSRCG.
The 1993 NSCG was a special baseline survey of a sample of all those who had earned a bachelor's degree or higher (in any field) before 1 April 1990, the date of the 1990 decennial census, and were age 72 or younger at that time. The sample design was a two-phase stratified random sample of individuals with at least a bachelor's degree. Phase 1 consisted of the procedure used by the Census Bureau for sampling households for the census long form. That procedure was a stratified systematic sample, with differing sampling rates for administrative areas of different sizes. Phase 2 consisted of subsampling individuals with at least a bachelor's degree and age 72 or younger from the long form records, within strata defined according to demographic characteristics (race/ethnicity, citizenship, and disability status), highest degree achieved, occupation, and sex. Within each stratum, individuals were selected using probability-proportional-to-size (PPS) systematic sampling. The long form sampling weight was used as the size measure for selection to compensate as much as possible for the differing long form sampling rates, and hence to come as close as possible to an overall self-weighting sample within each phase 2 stratum. The maximum sampling rate was 3.00%, but most strata were sampled at rates of between 2.03% and 2.82%. The unweighted response rate for the 1993 NSCG was 78%, yielding more than 148,000 individuals who had at least a bachelor's degree, and identifying an additional 19,000 people not eligible to receive the survey (e.g., those who did not have a bachelor's degree, were deceased, were over 75, or were no longer living in the United States). Survey responses were then used to determine whether the respondents fit into SESTAT's target population of scientists and engineers by virtue of having an S&E degree at the time of the census and/or working in an S&E occupation at the time of the 1993 NSCG. More than 74,000 survey respondents matched the SESTAT definition.
The 1995 NSCG sample was selected from 1993 NSCG eligible respondents and the 1993 NSRCG respondents (described in the next section), using a unique identification rule to avoid two chances of selection for individuals eligible for both surveys. The 1995 NSCG frame was stratified by factors such as highest S&E degree level, highest S&E major study field, demographic group, and sex. A sample of 62,004 individuals was selected for the survey using PPS sampling within these strata, with the 1993 analysis weight being used as the size measure for PPS sampling. Sampled individuals were contacted initially by mail. A total of 41,522 eligible sample members responded to the mail component of the survey. Nonrespondents were then subsampled for computer-assisted telephone interview (CATI) or computer-assisted personal interview (CAPI) follow-up. Across all data collection modes, a total of 53,448 eligible scientists and engineers responded to the 1995 NSCG. The conditional unweighted response rate (conditional on having responded in 1993) was 95%. The unconditional response rate (taking into account nonresponse in 1993) was approximately 74%.
The 1997 NSCG was selected from eligible respondents to the 1995 NSCG (itself derived from 1993 NSCG and 1993 NSRCG respondents) and augmented by a sample of the 1995 NSRCG respondents. The 1995 NSRCG respondents were oversampled to support more detailed analyses of recent S&E college graduates, and a few extra questions were added to the NSCG questionnaire for these individuals. The cases originally sampled in the 1993 NSRCG and 1995 NSRCG were referred to in 1997 and subsequent years as the 1993 NSRCG panel and the 1995 NSRCG panel. The unweighted conditional response rate for the 1997 NSCG was 94% and the unconditional response rate is estimated to be about 70%.
The 1999 NSCG sample was drawn from a frame consisting of eligible respondents from the 1997 NSCG (itself including 1993 NSCG and 1993 and 1995 NSRCG panels) and the 1997 NSRCG. The conditional unweighted response rates were 90% for the 1993 NSCG and 1993 NSRCG panel components combined, and 81% for the 1995 and 1997 NSRCG panel components combined. The unconditional response rate was in the region of 60%.
During the 1990s, the NSRCG covered those individuals who received an S&E degree from a U.S. educational institution in the 2 academic years before the SESTAT survey reference dates. Specifically, the 1993 NSRCG covered individuals who received bachelor's or master's degrees in an S&E field from a U.S. educational institution between 1 April 1990 and 30 June 1992. The 1995 NSRCG covered those who received bachelor's or master's degrees in an S&E field from a U.S. educational institution in the period from 1 July 1992 to 30 June 1994; the 1997 NSRCG covered the period from 1 July 1994 to 30 June 1996; and the 1999 NSRCG covered the period from 1 July 1996 to 30 June 1998.
A two-stage sample design was used in each round of the NSRCG. Educational institutions were sampled at the first stage, and S&E bachelor's degree and master's degree graduates were sampled from within these institutions at the second stage. The Integrated Postsecondary Education Data System (IPEDS) was used to construct the sampling frame for educational institutions. For the 1993 NSRCG, 196 of the eligible institutions had such large numbers of S&E graduates that they were selected with certainty. From the remaining institutions, 79 were selected using systematic, PPS sampling from a file sorted by ethnicity, region, public/private status, and presence of agricultural courses. The measure of size was devised to account for the rareness of certain fields of study and for the incidence of Hispanic, black, and noncitizen students. Of the 275 sampled institutions, 273 provided lists of their students receiving bachelor's or master's degrees in S&E fields between 1 April 1990 and 30 June 1992. From the 273 responding institutions, 25,785 students were selected using stratified systematic sampling, with markedly varying sampling rates by stratum field. Of the 25,785 selected students, a total of 19,426 eligible degree recipients responded to the 1993 NSRCG. The unconditional unweighted response rate over both stages of sampling was 85%.
The 1995 design for the NSRCG was similar to the 1993 design. In the two-stage sampling approach, educational institutions were again sampled in the first stage using PPS sampling. In 1995, a composite measure of size was introduced that was designed to facilitate oversampling of rare domains of interest (e.g., minority graduates). The 1991–92 IPEDS was used to construct the sampling frame for institutions. There were 102 institutions that were so large that they were selected with certainty. In addition, 173 institutions were sampled from the remaining portion of the frame after stratifying by region, control (public versus private), and percentage of S&E degrees. From the 266 responding institutions, 23,771 students were selected using stratified systematic sampling. Initial nonrespondents and those who had to be traced were subsampled for further follow-up, thus reducing the sample size to 21,000 graduates. A total of 16,338 eligible degree recipients responded to the 1995 NSRCG. The unconditional unweighted response rate was about 83%.
The 1997 NSRCG retained the same sample of institutions that was selected for the 1995 cycle, that is, the same 102 certainty selections and 173 noncertainty selections. All of the 275 sampled institutions responded. A total of 14,057 graduates were sampled. Of these, 1,032 were ineligible, and 10,452 eligible respondents completed the survey. The unconditional unweighted response rate was 82%.
The 1999 NSRCG retained the 275 institutions sampled in 1995 and surveyed again in 1997. Also, four additional institutions were included in the sample, all selected with certainty, to improve population coverage. Of the 279 sampled institutions, 1 turned out to be ineligible and 1 did not provide a graduate list. A total of 13,918 graduates were sampled, of whom 987 were ineligible and 9,984 were eligible and completed the survey. The unconditional unweighted response rate was 78%.
The SDR covers individuals who received a doctorate degree in an S&E field from a U.S. educational institution after 1 January 1942. (Non-U.S. doctorates in S&E are covered by NSCG if the recipient entered the United States by 1 April 1990.) The sampling frame for SDR was constructed from the Doctorate Records File, which is a database of all U.S. research doctorate recipients since 1920. The 1993, 1995, 1997, and 1999 rounds of SDR covered those who received degrees up to 30 June 1992, 30 June 1994, 30 June 1996, and 30 June 1998, respectively.
The SDR is a panel survey of doctorate recipients. Samples of new cohorts are added to the base sample every 2 years, and some cuts are made to maintain overall sample size. The SDR frame is restricted to two groups of recipients of U.S. S&E doctorates under age 76: (1) U.S. citizens, and (2) non-U.S. citizens who plan to remain in the United States after receiving their doctoral degrees. A two-phase sample design was used in 1991 and 1993 for the SDR to facilitate oversampling of the disabled and certain minority groups and to facilitate overall sample size reduction for the survey.
The overall sampling rate for the 1993 SDR was 8.8%, with rates for individual sampling strata ranging from 4.5% to 66.7%. The sample size was 49,228 doctorate recipients, of whom 39,495 were eligible S&E doctorate respondents. The conditional unweighted response rate was 87%.
For the 1995 SDR, a sample of those earning doctoral degrees at U.S. institutions between 1 July 1992, and 30 June 1994 (the new cohort) was added, and the previous sample of doctorate recipients whose degrees were received between 1 January 1942, and 30 June 1992 (the old cohort) was subsampled to produce a combined sample of about the same size as the 1993 sample. The sampling rates for the new and old cohorts were similar within strata defined by demographic group, field of study, and sex. An initial sample of 49,829 cases was selected. Of these, 31,243 responded by mail. Nonrespondents were subsampled for CATI follow-up, again using stratified PPS sampling procedures. Across all modes of data collection, 35,370 eligible doctorate recipients responded to the survey. The conditional unweighted response rate for the 1995 SDR was 77%. The corresponding weighted rate, allowing for the nonrespondent subsampling for CATI follow-up, was 85%.
Following the same procedures, the 1997 SDR sample was composed of those earning doctoral degrees at U.S. institutions between 1 July 1994 and 30 June 1996, and a subsample of the previous sample of doctorate recipients (degrees received between 1 January 1942 and 30 June 1994). The new cohort cases were sampled at about twice the rate of the old cohort cases, but the proportion of cases within strata (defined by demographic group, field of study, and sex) was similar for the old and new cohort. An initial sample of 54,103 degree recipients was selected, 28,886 of whom responded by mail. Of these cases, 27,382 were deemed complete interviews, with the remainder being either permanently or temporarily out of scope. Nonrespondents were subsampled for CATI follow-up, again using stratified PPS sampling procedures. Across all modes of data collection, 35,667 eligible doctorate recipients responded to the survey. The conditional unweighted response rate for the 1997 SDR was 85%.
Like the previous rounds of the survey, the 1999 SDR added a new cohort. In this case the new cohort comprised those who earned S&E doctorate degrees between 1 July 1996 and 30 June 1998. The new cohort was oversampled such that out of a total sample size of 40,000 doctorates, 4,000 were allocated to the new cohort. The sample from the 1997 SDR was divided into two subgroups, termed the old cohort and the nearly new cohort, corresponding to those who earned their doctorates before 1 July 1992 and those who earned their doctorates between 1 July 1992 and 30 June 1996. The nearly new cohort was then sampled at a somewhat higher rate than the old cohort for the 1999 SDR. Of the initial sample of 40,000 individuals, 27,269 responded by mail, and of those 26,216 were deemed complete responses; the remainder were out of scope. Mail nonrespondents were followed up by CATI, which yielded 5,102 complete interviews. Thus overall 31,318 eligible doctorate recipients responded. The conditional unweighted response rate was 82%.
Selection probabilities for the SESTAT surveys vary greatly. The sampling weight for each sampled individual is defined as the reciprocal of that individual's probability of selection. The sampling weights are then adjusted for nonresponse and poststratification using weighting class adjustment procedures. The final adjusted sampling weights become the analysis weights, which have been added to each individual's record in the survey database.
In the 1993 NSCG, poststratification adjustment was used to adjust the weighted counts for survey respondents to the 1990 decennial census long form sample estimates. In the 1993 NSRCG and the 1993 SDR, the weights were adjusted only for nonresponse. Similarly, for the 1995, 1997, and 1999 surveys, weights were adjusted for nonresponse with no postratification adjustment.
The SESTAT database was constructed for each survey round by combining the three component surveys, which meant addressing the potential for cross-survey multiplicity. Scientists and engineers in SESTAT could belong to the surveyed population of more than one component survey, depending on their degrees and when they received them. For instance, a person with a bachelor's degree at the time of the 1990 decennial census who went on to complete a master's degree in 1991 could be selected in the 1993 NSCG and the 1993 NSRCG. The following unique-linkage rule was devised to remove these multiple selection opportunities: each member of SESTAT's target population is uniquely linked to one and only one component survey, and that individual is included in the integrated SESTAT database only when he or she is selected for the linked survey. As a result, each person has only one chance of being selected into the combined SESTAT database. Individuals with multiple selection opportunities were first linked to the SDR, and then to the NSRCG if the individual was not linked to the SDR. In the NSCG, sampled individuals who also had a chance of being selected for the NSRCG or the SDR in that year were assigned zero as their SESTAT analysis weight. Similarly, sampled individuals in the NSRCG who also had a chance of being selected for the SDR in that year were assigned zero as their SESTAT analysis weight. The component survey's analysis weight for all other cases was used to develop the SESTAT analysis weight. Cases with a zero weight are not eligible to be sampled in future waves of the longitudinal samples.
As can be seen from the foregoing, the current NSF approach is complex and has sample selection and coverage problems. The two most significant problems are as follows:
It is important to recognize that a substantial number of members of the S&E population are not S&E graduates from U.S. educational institutions. Data from the 1993 SESTAT database substantiate this point. The data showed that there were 593,600 individuals in S&E occupations with non-S&E degrees only; this number excludes those who graduated in non-S&E fields between 1 April 1990 and 1993 who were working in S&E occupations in 1993. Additionally, there were approximately 428,000 individuals in the SESTAT database who had only foreign degrees. There was some overlap between these two populations, so in 1993 a total of approximately 1,020,000 individuals in the SESTAT database either (a) had an S&E occupation but no S&E degree, or (b) had only foreign degrees. These individuals comprised approximately 9% of the 1993 SESTAT universe of 11.6 million individuals.
NSF seeks to explore alternative survey approaches for developing a cost-effective S&E personnel data system that provides a more complete representation of the universe of scientists and engineers than the current approach. If possible, the approach should provide means to include individuals who are working in S&E occupations but do not hold a bachelor's degree or higher in an S&E field. Some alternative approaches are discussed in the next section, as is the possibility of using establishment survey frames for this purpose.
The type of sample design chosen should derive from user data needs and the types of analysis to be conducted. One important issue is whether analysts are interested in longitudinal analyses of data from these surveys. There would appear to be great potential for such analyses, but there has been limited use of the longitudinal data to date. The current design provides longitudinal data as a byproduct of sample generation, which is not the case with the designs based on the alternative surveys discussed in this report, such as the American Community Survey and the National Immunization Survey. Another consideration that would affect the choice of sample design is survey frequency. Currently, the SESTAT surveys are conducted biennially. The assumption for this review is that this will continue to be the case, although the Division of Science Resources Statistics (SRS) has considered other cycle lengths.
We consider four types of sample design and briefly discuss the data analysis implications and the response burden issues associated with each.
 Although conceptually the 1995 NSRCG panel is a component of the NSCG, it has been conducted as part of the NSRCG data collection and not as part of the NSCG data collection.