Survey Quality Measures
Availability of Data
1. Overview (2008 survey cycle)
The Survey of Doctorate Recipients (SDR) is a longitudinal study of individuals who received a doctoral degree from a U.S. institution in a science, engineering, or health (SEH) field. The goal of the SDR is to provide policymakers and researchers with high-quality data and findings for making informed decisions related to the educational and occupational achievements and career movement of the nation's doctoral scientists and engineers. This group is of special interest to many decision makers because it represents the highest-educated individuals in the U.S. workforce.
The SDR has been conducted every 2 to 3 years since 1973 for the National Science Foundation (NSF) in conjunction with the National Institutes of Health (NIH). The survey follows a sample of individuals with SEH doctorates throughout their careers from the year of their degree award through age 75. The panel is refreshed in each survey cycle with a sample of new SEH doctoral degree earners.
The data from this survey are combined with that from two other NSF surveys of scientists and engineers, the National Survey of College Graduates (NSCG) and National Survey of Recent College Graduates (NSRCG). The three surveys are closely coordinated and share the same reference date and nearly identical instruments. The database developed from the three surveys, the Scientists and Engineers Statistical Data System (SESTAT), provides a comprehensive picture of the number and characteristics of individuals with training and/or employment in science, engineering, or related fields in the United States.
Respondents have the following characteristics:
- Hold a research doctorate in an SEH field from a U.S. institution
- Lived in the United States during the survey reference week
- Are not institutionalized
- Are less than 76 years of age
c. Key variables
- Citizenship status
- Country of birth
- Country of citizenship
- Date of birth
- Disability status
- Educational history (for each degree held: field, level, institution, year received)
- Employment status (unemployed, employed part time, or employed full time)
- Geographic place of employment
- Marital status
- Number of children
- Primary occupation (current or past job)
- Primary work activity (e.g., teaching, basic research, etc.)
- Postdoctoral status
- Publication and patent activities
- Satisfaction with and importance of various aspects of job
- School enrollment status
- Sector of employment (e.g., academia, industry, government, etc.)
- Work-related training
2. Survey Design
a. Target population and sample frame
The target population of the 2008 survey consisted of all individuals less than 76 years of age as of the survey reference date (i.e., born on or after 1 October 1932) who had received a research doctorate in a science, engineering, or health field from a U.S. academic institution, were not institutionalized, and were living in the United States or a U.S. territory during the survey reference week of 1 October 2008.
The sample frame used both to identify the initial panel of respondents and to refresh the panel over time with new SEH doctorate recipients is the Doctorate Records File (DRF), maintained by the NSF. The primary source of information for the DRF is the Survey of Earned Doctorates (SED).
The 2008 SDR sampling frame included individuals who:
- Earned a research doctoral degree from a U.S. college or university in a science, engineering, or health field through 30 June 2007;
- Were U.S. citizens at time of receipt of their doctoral degree or, if non-U.S. citizens, indicated on the SED that they had plans to remain in the United States after degree award; and
- Were younger than 76 years of age as of 1 October 2008 (the survey reference date) or if age was unknown, had not received a baccalaureate degree prior to 1950.
The 2008 SDR frame was constructed as two separate cohort frames, an existing 2006 SDR cohort frame and a new cohort frame. The cohorts are defined by the year of receipt of their first U.S.-granted doctoral degree . The existing cohort frame represents individuals who had received their SEH doctorate before 1 July 2005; the new cohort frame represents individuals who had received an SEH doctorate between 1 July 2005 and 30 June 2007. The new cohort frame was a "primary frame" that included all known newly eligible cases; the existing cohort frame was a "secondary frame" that carried forward the SDR cohort from the previous survey cycle and each member's sampling weight from the previous cycle.
Existing Frame Construction. The SDR existing (or old cohort) frame was constructed from the final operational sample file used for data collection in the previous survey cycle less cases determined to be permanently ineligible in that prior cycle (e.g., sample members determined to be deceased or over age 75 during 2006 survey operations). Existing frame cases were originally selected into the SDR as new cohort members who were sampled from the SED.
New Cohort Frame Construction. The data source for constructing the SDR new cohort sampling frame for 2008 was the two most recent doctoral cohorts included in the SED. The most recent SED cohort always lags one year behind the current SDR reference year; the two most recent cohorts for the 2008 SDR were thus the 2006 and 2007 (academic year) doctoral cohorts.
The cases within all of the frame sources were analyzed individually for SDR eligibility requirements. Persons who did not meet the age criteria or who were known to be deceased, terminally ill or incapacitated, or permanently institutionalized in a correctional or health care facility were dropped from the 2008 sampling frames. Sample persons who were non-U.S. citizens and were known to be residing outside the United States or U.S. territories during at least two prior consecutive survey cycles were also permanently eliminated from the SDR sampling frame. After ineligible cases were removed from consideration, the remaining cases from the two sources were combined to create the 2008 SDR sampling frame. In total, there were 102,579 cases in the 2008 SDR frame, including 41,612 existing cohort cases and 60,967 new cohort cases.
b. Sample design
The 2008 SDR has a stratified probability sampling design that was similar to the 2006 SDR design. The total number of cases selected for the 2008 SDR sample was 40,093. The main difference between the 2008 and 2006 sampling designs was a reduction in the number of sampling strata from 164 in 2006 to 150 in 2008. This was due to the elimination of missing-race strata; in 2008, cases with missing race were imputed based on surname or their U.S. state or non-U.S. country of birth. Thus in 2008, the frame was stratified into 150 strata based on three variables: demographic group, degree field, and sex. The demographic group variable included the following 9 categories defined by race/ethnicity, disability status, and citizenship at birth:
- Hispanics, regardless of race, citizenship-at-birth, and disability
- Non-Hispanic blacks, regardless of citizenship-at-birth and disability
- U.S.-citizen-at-birth, non-Hispanic Asians, regardless of disability
- Non-Hispanic American Indians/Alaska Natives, regardless of citizenship-at-birth and disability
- Non-Hispanic Native Hawaiians/Other Pacific Islanders, regardless of citizenship-at-birth and disability
- U.S.-citizen-at-birth, disabled, non-Hispanic whites
- U.S.-citizen-at-birth, nondisabled, non-Hispanic whites
- Non-U.S.-citizen-at birth, non-Hispanic whites, regardless of disability
- Non-U.S.-citizen-at birth, non-Hispanic Asians, regardless of disability
These nine categories were defined in a hierarchical manner as the category definitions imply to ensure higher selection probability for rarer population groups. For example, all Hispanics belonged to the first demographic group regardless of other demographic characteristics. Similarly, all non-Hispanic blacks belonged to the second demographic group regardless of other characteristics. Prior to 2003, a 15-category degree field variable was used to stratify the categories within demographic groups, resulting in a large number of strata with very small populations. Beginning in 2003, only the three largest categories within the demographic group (U.S. white, non-U.S. white, and non-U.S. Asian) were stratified by the 15-category degree field variable. All other categories within demographic group were stratified by a 7-category degree field variable except American Indians/Alaska Natives and Native Hawaiians/Other Pacific Islanders, who were stratified only by sex. Thus, the 2008 SDR design featured a total of 150 strata defined by a revised demographic group variable (i.e., missing race was no longer a demographic group), two degree field variables (i.e., a 7-level and a 15-level aggregated field of degree variable), and sex. The sample was then selected from each stratum systematically. The goal of the 2008 SDR sample stratification design was to create strata that conformed as closely as possible to the reporting domains used by analysts and for which the associated subpopulations were large enough to be suitable for separate estimation and reporting.
The 2008 SDR sample allocation strategy consisted of three main components: (1) ensuring a minimum sample size for the smallest strata through a supplemental stratum allocation; (2) allotting extra sample for specific demographic group-by-sex domains through a supplemental domain allocation; and (3) allocating the remaining sample proportionately across all strata. The final sample allocation was therefore based on the sum of a proportional allocation across all strata, a domain-specific supplement allocated proportionately across strata in that domain, and a stratum-specific supplement added to obtain the minimum stratum size.
The 2008 SDR sample selection was carried out independently for each stratum and cohort-substratum. For the existing cohort strata, the past practice of selecting the sample with probability proportional to size continued, where the measure of size was the base weight associated with the previous survey cycle. For each stratum, the sampling algorithm started by identifying and removing self-representing cases through an iterative procedure. Next, the non-self-representing cases within each stratum were sorted by citizenship, disability status, degree field, and year of doctoral degree award. Finally, the balance of the sample (i.e., the total allocation minus the number of self-representing cases) was selected from each stratum systematically with probability proportional to size.
The new cohort sample was selected using the same algorithm used to select the existing cohort sample. However, because the base weight for every case in the new cohort frame was identical, each stratum sample from the new cohort was actually an equal-probability or self-weighting sample.
The 2008 SDR sample of 40,093 consisted of 36,644 cases from the existing cohort frame and 3,449 cases from the new cohort frame. The overall sampling rate was about 1 in 20 (5.0%) but sampling rates varied considerably across the strata.
c. Data collection techniques
The 2008 SDR was conducted for the NSF's National Center for Science and Engineering Statistics (NCSES) and the NIH by the National Opinion Research Center (NORC) at the University of Chicago (Chicago, IL), a survey contractor. The 2006, 2003, and 1997 surveys were also conducted by NORC. The U.S. Census Bureau conducted the 1999 and 2001 surveys. Prior to 1997, the survey was conducted by the National Research Council of the National Academy of Sciences.
The data collection approach from 1991 to 2001 consisted of mailing letters, self-administered paper questionnaires (mail SAQ), and postcards, with a sequence of follow-up mailings to non-responders according to a set schedule. Final follow-up contacts among remaining non-responding sample members were via telephone. The telephone contact urged prompt return of the self-administered paper survey or offered an immediate computer-assisted telephone interview (CATI) to complete the survey. In 2003, CATI and self-administered online questionnaires (Web) were introduced as initial response modes on an experimental basis. The experiment results indicated that the CATI and Web approaches had merit and that for certain types of cases, starting in CATI or Web improved both response quality and response rate.
The tri-mode data collection approach initiated in 2003 was continued in the 2006 and the 2008 survey cycles. Using mode preference information reported in the 2006 SDR and taking into account results from the 2003 SDR mode experiments, the 2008 selected sample was assigned to various starting mode data collection protocols. Existing cohort sample members who responded to the 2006 SDR were stratified by explicit or implicit mode preference and the cases were assigned to a starting mode accordingly. Explicit mode preference was determined by the response to the mode preference question on the 2006 SDR survey. For those who did not respond to the preference question or indicated no preference, implicit preference was defined as the mode they used to complete the 2006 SDR. The 2008 SDR included a mode experiment for sample members who did not report a mode preference, although they had completed the survey by telephone or paper in the 2006 cycle. These sample members were randomly assigned to either the mail SAQ or Web starting modes. In addition, 2006 SDR non-respondents were assigned a starting mode based on analysis conducted on the 2003 data, which indicated that past refusals are more likely to cooperate if started in the mail SAQ mode and other non-respondents are more likely to cooperate if started in the Web mode. New cohort members with complete sampling stratification variables were assigned to the mail SAQ mode; this decision was also based on analysis conducted on the 2003 SDR data. New cohort sample members with missing sampling stratification variables were assigned to the CATI start mode to facilitate collection of the missing sampling data. Those who were living abroad and had not completed the 2006 SDR were started in the Web mode to decrease mailing costs for sample members most likely to be ineligible for the 2008 SDR. Those without any physical or e-mail address were started in CATI.
At the start of data collection, 15,119 cases received the paper questionnaire in the mail as their initial mode (37.7%), 1,788 cases were started in the CATI mode (4.4%), and 22,826 cases were started in the Web mode (56.9%). The 360 remaining cases were not assigned to a mode because they were determined to be deceased or hostile refusals. Based on Dillman's Total Design Method, different data collection protocols were developed for each of the three different data collection approaches. At any given time, sample members could request to complete the survey in a mode other than the mode to which they were originally assigned. A total of 28.0% of the SDR respondents completed the survey in a mode that was different from their start mode (n=8,693). The data collection protocols for the mail SAQ, CATI, and Web starting modes are described below.
Mail SAQ. Sample members who were started in the mail SAQ first received an advance notification letter from the NSF to notify them of the survey. One week later on the survey reference date, the first questionnaire mailing occurred followed by a thank-you/reminder postcard the following week. Approximately six weeks after the first questionnaire mailing, sample members who had not returned a completed questionnaire (by any mode) were sent a second questionnaire by mail. Three weeks later, any cases who still had not responded received a prompting notice via e-mail to verify receipt of the paper form and encourage cooperation. Telephone follow-up calls began two weeks later for all mail SAQ start mode nonrespondents to request participation, preferably by CATI.
CATI. Sample members who were started in the CATI mode first received an advance notification letter from the NSF to notify them of the survey. One week later on the survey reference date, telephone contacting and interviewing began. Approximately six weeks later, sample members who had not yet responded were sent an e-mail prompt to solicit survey participation in any mode. Three weeks later, any cases who still had not responded received a first questionnaire mailing sent via U.S. mail, followed by a thank-you/reminder postcard one week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents.
Web. Sample members who were started in the Web mode first received an advance notification letter from the NSF to notify them of the survey. One week later on the survey reference date, they were sent an e-mail with a PIN and password to access the Web survey. Two and a half weeks later, nonrespondents were sent a follow-up letter and an e-mail. Two weeks later, any cases who had not responded received a prompting telephone call to verify receipt of the Web survey access information and encourage cooperation. Four weeks later, any cases who still had not responded received a first questionnaire mailing, followed by a thank-you/reminder postcard one week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents.
Three additional prompting contacts were sent later in the data collection field period to any remaining nonrespondents from any of the starting mode groups in January, April, and June 2009. Additionally, there were 234 cases that strongly refused in prior SDR rounds to ever participate. These cases only received an advance notification letter from the NSF to notify them of the survey. CATI follow-up, locating and prompting ended 30 June 2009, data entry processes ended on 15 July 2009, and the Web survey was closed on 17 August 2009. Quality assurance procedures were in place at each step (address updating, printing, package assembly and mailout, questionnaire receipt, data entry, coding, CATI, and post data collection processing) to ensure that the designated sample member received the correct information (e.g., PIN and password) and that respondent information was carried forward accurately during receipt and data processing.
d. Estimation techniques
The SDR is based on a complex sampling design (see Section 2.b above) and uses sampling weights that are attached to each responding sample member's record to produce accurate population estimates. The primary purpose of the weights is to adjust for unequal sampling probabilities and nonresponse. The final analysis weights were calculated in three stages:
- A base weight was calculated for every case in the sample to account for its selection probability under the sample design.
- An adjustment for unknown eligibility was made to the base weight by distributing the weight of the unknown eligibility cases to the known eligibility cases proportionately to the observed eligibility rate within each adjustment class.
- An adjustment for nonresponse was made to the adjusted base weight to account for the known eligible sample cases for which no response was obtained.
3. Survey Quality Measures
a. Sampling variability
Estimates based on the total sample have relatively small sampling errors. However, sampling error increases and can be quite substantial when estimating characteristics of small subpopulations. Estimates of the sampling errors associated with various measures are included in the methodology report for the 2008 survey (available upon request) and in the forthcoming publication Characteristics of Doctoral Scientists and Engineers in the United States: 2008.
The SDR has minimal coverage error given the minimal coverage error in the SED, which is the sample frame for most of the SDR sample. However, for the years prior to 1957 (the commencement of the SED), the sample frame was compiled from a variety of sources. Although it is likely that this component of the sample frame was more subject to coverage problems than is true for later cohorts, pre-1957 doctorates constitute less than .01% of the target population in 2008.
(1) Unit nonresponse - The unweighted response rate for this survey in 2008 was 80.7%. Adjustment for unit nonresponse was based on statistical weighting techniques (see section 2.d above). The weighted response rate was 80.5%.
(2) Item nonresponse - In 2008, the item nonresponse rates for key items (employment status, sector of employment, field of occupation, and primary work activity) ranged from 0.0% to 2.7%. Some of the remaining variables had nonresponse rates that were considerably higher. For example, particularly sensitive variables, such as salary and earned income, had item nonresponse rates of 8.6% and 11.7%, respectively. Personal demographic data, such as marital status, citizenship, and race/ethnicity, had item nonresponse rates ranging from 0.0% to 3.7%. Cases missing the primary critical items were classified as survey nonresponse. Primary critical items included working for pay or profit, looking for work, last job, principal job, and living in the United States. All missing data were imputed using a hot-deck imputation procedure, except for the primary critical items, verbatim text items, and some coded variables based on verbatim text items.
(3) Imputation - The 2008 SDR used a combination of logical imputation and statistical imputation.
Logical Imputation. For the most part, logical imputation was accomplished as part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing was also used to create "missing" data for statistical imputation . During sample frame building for the SDR, some demographic frame variables were found to be missing for sample members . The values for these variables were imputed at the frame construction stage, and when possible, were updated with data obtained during data collection.
Statistical Imputation. The 2008 SDR primary method for statistical imputation was hot-deck imputation. Almost all SDR variables were subjected to hot deck imputation, where each variable had its own class and sort variables structure created based on a regression analysis. Critical items (which must be collected for all respondents) and text variables were not imputed. Efforts were made to collect missing sampling variables directly from the sample members during data collection, in order to replace the imputed values with respondent-reported information.
For some variables, there is no set of class and sort variables that is reliably related to or predict the missing value. In these instances consistency was better achieved outside of the hot-deck procedures using random imputation. For example, respondents with a missing marital status (question E1) may have answered questions E2 or E3 regarding their spouse or partner's employment status but failed to answer question E1 regarding their marital status. This implies that E1 should be '1' (Married) or '2' (Living in a marriage-like relationship). Our procedure was to assign a random value for E1 with a probability proportional to the number of cases in each of the valid values (e.g., if there are three married respondents for every respondent living in a marriage-like relationship, then missing values of E1 would be filled in with a '1' 75% of the time and '2' 25% of the time).
Some of the key variables in this survey can be difficult to measure. For example, individuals do not always know the precise definitions of occupations that are used by experts in the field and, thus, may select occupational fields that are technically incorrect. In order to reduce measurement error, the SDR survey instruments for 2006 were pretested, using cognitive interviews and a mail pretest. The SDR instrument also benefited from extensive pretesting of the NSCG and NSRCG instruments, because most SDR questions also appear on the NSCG and the NSRCG. The 2008 SDR instrument was consistent with the 2006 SDR.
As is true for any multimode survey, it is likely that the measurement errors associated with the different modalities are somewhat different. This possible source of measurement error is especially troublesome, because the proclivity to respond by one mode or another may be associated with variables of interest in the survey. To the extent that certain types of individuals may be relatively likely to respond by one mode compared to another, the multimodal approach may introduce some systematic biases into the data. However, a study of differences across modes was conducted after the 2003 survey and showed that all three modes yielded comparable data for the most critical data items. Further, data captured in the Web mode had lower item nonresponse for contacting variables and more complete verbatim responses for the occupation questions than did the data captured in the mail SAQ mode.
4. Trend Data
There have been a number of changes in the definition of the population surveyed over time. For example, prior to 1991, the survey included some individuals who had received doctoral degrees in fields outside of SEH or had received their degrees from non-U.S. universities. Because coverage of these individuals had declined over time, the decision was made to delete them beginning with the 1991 survey. Survey improvements made in 1993 are sufficiently great that NCSES staff suggest that trend analyses between the data from the surveys after 1991 and the surveys in prior years must be performed very cautiously, if at all. Individuals who wish to explore such analyses are encouraged to discuss this issue further with the survey project officer listed below.
5. Availability of Data
The data from this survey are published biennially in Detailed Statistical Tables in the series Characteristics of Doctoral Scientists and Engineers in the United States, as well as in several InfoBriefs and Special Reports. Information from this survey is also included in Science and Engineering Indicators, Women, Minorities, and Persons With Disabilities in Science and Engineering, and Science and Engineering State Profiles.
b. Electronic access
Data from this survey are available on the NCSES website and on the SESTAT website. Selected aggregate data are available in public use data files upon request. Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement.
c. Contact for more information
Additional information about this survey may be obtained by contacting:
Lynn Milan, Ph.D.
Human Resources Statistics Program
National Center for Science and Engineering Statistics
National Science Foundation
4201 Wilson Boulevard, Suite 965
Arlington, VA 22230
Phone: (703) 292-2275
SEH fields include biological, agricultural, and environmental life sciences; computer and information sciences; mathematics and statistics; the physical sciences; psychology; the social sciences; engineering; and health fields.
Individuals who turn 76 on the survey reference date are considered eligible, in order to simplify survey operations.
The SDR frame is based on the first U.S. doctoral degree earned. Prior to SED 2003, persons who had earned two doctoral degrees where the first degree is a non-SEH degree and the second degree is an SEH degree are not included in the SDR frame. Based on information collected annually by the SED about the number and characteristics of those earning two doctorates, this exclusion results in negligible undercoverage bias. In 1983–2000, for example, the total number of double doctorate recipients with a non-SEH first degree and an SEH second doctorate was 154, representing 0.046% of the total number of SEH doctorates awarded in that period.
Dillman, Don A. 1978. Mail and Telephone Surveys: The Total Design Method
. New York: Wiley-Interscience.
This type of edit would occur when the respondent provides data that are inconsistent with previously reported data or with the allowable range of responses according to the logic of the question. For instance, if the respondent reports working in 2008 and reports starting the job before 4/2006, but consistency checks show that the respondent marked never worked
in 2006, then reported start year and month would be set to missing and a start date between 4/2006 and 10/2008, inclusive, would be imputed. Another example would be a case in which a respondent is asked to designate the two most important reasons for taking a postdoc, but reports the same reason for the first and second reason. The second reason would have to be set to missing and have its value imputed from the list of reasons provided in the previous question if the respondent supplied more than two valid reasons for pursuing a postdoc.
A small number of 2008 SDR old cohort and new cohort cases were missing critical SDR sampling stratification variables. Those variables are race, ethnicity, citizenship, and sex.