Characteristics of Doctoral Scientists and Engineers in the United States: 2006
Appendix A. Technical Notes
The Survey of Doctorate Recipients (SDR) gathers information from individuals who have obtained doctoral degrees from U.S. institutions in a science, engineering or health field (SEH). The SDR is a panel study (i.e., a longitudinal survey) that is conducted every 2 years on a nationally representative cohort of SEH research doctorate recipients. These technical notes on the 2006 SDR include information on the target population and sample design, data collection and response rates, data editing, imputation, weighting, reliability of estimates including sampling and nonsampling errors, and changes from previous cycles of the SDR. In addition, this appendix includes standard error tables (tables A-1 to A-81) that provide an estimate of the standard error for each corresponding estimate provided in the detailed statistical table. More thorough discussion of the SDR protocol is provided in the 2006 SDR methodology report (available upon request).
The primary sponsor of the SDR is the National Science Foundation, Division of Science Resources Statistics (SRS). The National Institutes of Health also provide funding for the survey. The SDR is designed to complement two other surveys of scientists and engineers conducted by SRS, the National Survey of College Graduates (NSCG) and the National Survey of Recent College Graduates (NSRCG). The surveys are collectively known as the Scientists and Engineers Statistical Data System (SESTAT, http://www.nsf.gov/statistics/sestat/). The three surveys are closely coordinated and share the same reference date and nearly identical instruments. In addition, the three surveys are combined into a merged database that provides a comprehensive picture of the number and characteristics of individuals with bachelor’s level or higher education and/or employment in science, engineering, or health fields in the United States. Additional data on education and demographic information in the SDR come from the Survey of Earned Doctorates (SED), an annual census of research doctorates earned in the United States that began in 1957–58 (SED, http://www.nsf.gov/statistics/doctorates/). The SED provided a sampling frame for establishing the SDR in 1973 and continues to provide a sampling frame to replenish the SDR panel with new doctorate recipients each survey cycle.
Target Population and Sample Design
The 2006 SDR target population consisted of individuals who:
As in previous cycles, the 2006 SDR sampling frame was constructed from two separate listings: the existing 2003 SDR cohort and a new cohort frame. The cohorts are defined by the year of receipt of the first U.S.-granted SEH doctoral degree. The existing cohort frame represents individuals who received their science, engineering, or health doctorate before 1 July 2002; the new cohort frame represents individuals who received their science, engineering, or health doctorate between 1 July 2002 and 30 June 2005.
The cases within the existing and new cohort frames were analyzed individually for SDR eligibility requirements. Persons who did not meet the age criteria or who were known to be deceased, terminally ill, incapacitated, or permanently institutionalized in a correctional or health care facility were dropped from the sampling frames. Sample persons who were non-U.S. citizens and were known to be residing outside the United States or one of its territories during at least two prior consecutive survey cycles were also eliminated from the existing frame. After ineligible cases were removed from consideration, the remaining cases from the existing and new cohort frames were used to create the sampling frame for the 2006 SDR. In total, there were 89,139 eligible cases in the 2006 SDR sampling frame, 49,703 new cohort cases and 39,436 existing cohort cases.
The 2006 SDR sample design was basically the same as the 2003 SDR design. The 2006 SDR sample consisted of 42,955 cases. The frame was stratified into 164 strata by three variables: demographic group, degree field, and sex. The sample was then selected from each stratum systematically. The goal of the 2006 SDR sample stratification design was to create strata that conformed as closely as possible to the reporting domains used by analysts and for which the associated subpopulations were large enough to be suitable for separate estimation and reporting. The demographic group variable included 10 categories defined by race/ethnicity, disability status, and citizenship at birth. The classification of frame cases into these categories was done in a hierarchical manner to ensure higher selection probability for rarer population groups.
Prior to 2003, a 15-category degree field variable was used to stratify all demographic groups, resulting in a large number of strata with very small populations. NSF decided that an alternative degree field variable was needed to stratify the smaller demographic groups. Beginning in 2003, only the three largest demographic groups (U.S.-citizen-at-birth, non-disabled, non-Hispanic whites; non-U.S.-citizen-at-birth, non-Hispanic whites regardless of disability status; and non-U.S.-citizen-at-birth, non-Hispanic Asians regardless of disability status) were stratified by the 15-category degree field variable. All other demographic groups were stratified by a 7-category degree field variable except for non-Hispanic American Indians (including Alaskan Natives) regardless of citizenship-at-birth and disability status, and non-Hispanic Pacific Islanders (including Native Hawaiians) regardless of citizenship-at-birth and disability status who were stratified only by sex. Thus, the 2006 SDR design featured a total of 164 strata defined by a revised demographic group variable, two degree-field variables, and sex.
The 2006 SDR sample allocation strategy consisted of three main components: (1) allocate a minimum sample size for the smallest strata through a supplemental stratum allocation; (2) allocate extra sample for specific demographic group-by-sex domains through a supplemental domain allocation; and (3) allocate the remaining sample proportionately across all strata. The final sample allocation was therefore based on the sum of a proportional allocation across all strata, a domain-specific supplement allocated proportionately across strata in that domain, and a stratum-specific supplement added to obtain the minimum stratum size.
The 2006 SDR sample selection was carried out independently for each stratum and cohort-substratum. For the existing cohort strata, the past practice of selecting the sample with probability proportional to size continued, where the measure of size was the base weight associated with the previous survey cycle. For each stratum, the sampling algorithm started by identifying and removing self-representing cases (i.e., those with a base weight = 1) through an iterative procedure. Next, the non-self-representing cases (i.e., those with a base weight>1) within each stratum were sorted by citizenship, disability status, Doctorate Records File degree field, and year of doctoral degree award. Finally, the balance of the sample (i.e., the total allocation minus the number of self-representing cases) was selected from each stratum systematically with probability proportional to size.
The new cohort sample was selected using the same algorithm used to select the existing cohort sample. However, because the base weight for every case in the new cohort frame was identical, each stratum sample from the new cohort was actually an equal-probability or self-weighting sample. Thus, the 2006 SDR sample of 42,955 consisted of 38,027 cases from the existing cohort frame and 4,928 cases from the new cohort frame. The overall sampling rate was about 1 in 18 (5.5%). However, sampling rates varied considerably across the strata.
Data Collection and Response Rates
Data collection for the 2006 SDR used three protocols. Each protocol had a different initial mode of data capture based primarily on the existing cohort’s prior indication of mode preference: self-administered paper questionnaire (SAQ), computer-assisted telephone interview (CATI), and self-administered online questionnaire (Web). After the initial contact, each protocol included sequential contacts by postal mail, telephone, and e-mail and ran in parallel throughout the data collection period. In addition, sample members were encouraged to switch to any other mode for their convenience in providing their response.
SAQ. The protocol for those starting in the SAQ mode (37% of sample members) was as follows: sample members first received an advance notification letter from NSF to acquaint them with the survey. The first questionnaire mailing occurred a week later, followed by a thank you/reminder postcard the following week. Approximately seven weeks after the first questionnaire mailing, the sample members who had not returned a completed questionnaire (by any mode) were sent a second questionnaire by U.S. priority mail. Five weeks later, any cases still not responding received a prompting notice via e-mail to verify receipt of the paper form and encourage cooperation. Telephone follow-up calls began three weeks later for all outstanding mail-start mode nonrespondents and requested participation, preferably by the CATI mode.
CATI. The protocol for those starting in the CATI mode (18% of sample members) was as follows: sample members first received an advance notification letter from NSF to notify them about the survey. One week later, telephone contacting and interviewing began. Approximately seven weeks later, sample members who had not yet responded were sent an e-mail prompt to solicit survey participation in any mode. Four weeks later, any cases still not responding received a first questionnaire mailing sent via U.S. mail, followed by a thank you/reminder postcard one week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents.
Web. The protocol for those starting in the Web mode (45% of sample members) was as follows: sample members first received a survey notification letter via U.S. mail and e-mail. Three weeks later, nonrespondents were sent a follow-up letter via U.S. mail and e-mail. Three weeks later, any cases still not responding received a prompting telephone call to verify receipt of the Web-survey access information and encourage cooperation. Telephone follow-up calls to complete the CATI for all Web-start mode nonrespondents began four weeks later. Six weeks later, any cases still not responding received a first questionnaire mailing sent via U.S. mail, followed by a thank you/reminder postcard one week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents.
At the very end of the field period, an additional notice to gain cooperation was sent via U.S. mail and e-mail to all remaining nonrespondents regardless of their initial start-mode protocol.
Quality assurance procedures were in place at each step (address updating, printing, package assembly and mailing, questionnaire receipt, data entry, coding, CATI, and post data collection processing). Active data collection ended in December 2006. The telephone contact and data entry processes ended on 14 December 2006. However, the Web-survey access remained available through January 2007 to capture any last-minute responses. Overall, 32% of the responses were SAQ, 21% were CATI, and 47% were Web-surveys, with approximately 25% of the respondents choosing to respond in a mode other than their initial start mode.
Extensive locating and follow-up was conducted in order to find and obtain responses from the sample members. The overall unweighted response rate was 77.9%; the weighted response rate was 78.3%. The 2006 SDR unweighted and weighted response rates are comparable to the response rates obtained in past survey cycles. Lower response rates generally occurred among groups of non-U.S. citizens (weighted response rate = 68.2%) and among persons with missing demographic data (weighted response rate = 48.4%). Missing demographic data typically indicated incomplete records from the SED that resulted in more difficulty locating these cases to complete the survey. Prior experience has shown that if sample members are located, they generally complete the survey. Individuals who could not be located accounted for a majority of nonresponse cases (62.4%).
Data Editing and Coding
Complete case data were captured and edited under the three separate data collection modes for the 2006 SDR. A computer assisted data-entry system was used to process the SAQ paper forms. In contrast, the CATI system, including an additional CATI instrument used to collect critical-item follow-up data, and the Web survey had internal editing controls. Mail questionnaire data and Web-based returns were reviewed for any missing critical items (working status, job code, or resident status in United States). Telephone callbacks were initiated to obtain this information, in order to consider the response complete. All completed CATI responses included critical items. After receipt of this information, data from the three separate modes were merged into a single database for all subsequent coding, editing, and cleaning.
Following established SESTAT guidelines, staff were trained in conducting a standardized review and coding of occupation and education information, "Other/Specify" verbatim responses, state and country geographical information, and postsecondary institution information. For standardized coding of occupation, the respondent's occupational data were reviewed along with other work-related data from the questionnaire by specially trained coders to correct known respondent self-reporting problems to obtain the best occupation codes. The assignment of an education code for a newly earned degree was based solely on the verbatim response for degree field.
Imputation of Missing Data
Item nonresponse for key employment items, such as employment status, sector of employment, and primary work activity, ranged from 0.0% to 2.2%. Nonresponse to a few questions deemed somewhat sensitive, such as salary or earned income, were between 8.2% and 12.2%. Personal demographic data, such as marital status, citizenship, and race/ethnicity, had item nonresponse rates ranging from 0.0% to 3.6%. Item nonresponse was imputed using logical imputation and hot deck imputation methods.
For the most part, logical imputation was accomplished as part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing procedures found inconsistent data that were blanked out and therefore subject to statistical imputation as well. During sample frame building for the SDR, some demographic frame variables, such as race or ethnicity, that were found to be missing for sample members were imputed at the frame construction stage using additional information on the sampling frame.
The 2006 SDR primary method for statistical imputation was hot-deck imputation. Almost all SDR variables were subjected to hot-deck imputation, where each variable had its own class and sort variables structured by a multiple regression analysis. However, imputation was not performed on critical items (which must be provided for a case to be considered complete) and text variables. For some variables, there was no set of class and sort variables that were reliably related to or suitable for predicting the missing value. In these instances consistency was better achieved outside of the hot deck procedures using random imputation.
To enable weighted analyses of the 2006 SDR data, a final weight was calculated for every person in the sample. In general, a final weight approximates the number of persons in the population of recipients of U.S. doctorates that a sampled person represents. The primary purpose of the weights is to adjust the statistical estimates for potential bias due to unequal selection probabilities and nonresponse. The first step of the weighting process calculated a base weight for all cases selected into the 2006 SDR sample. The base weight accounts for the sample design, and it is defined as the reciprocal of the probability of selection under the sample design. In the next step, an adjustment for nonresponse was performed on completed cases to account for the sample cases that did not complete the survey. Nonresponse-adjusted weights were assigned to both respondents and known ineligible cases (i.e., cases who were deceased, institutionalized, over 75 years of age, or living abroad during the survey reference period), but eligible nonrespondents and cases with unknown eligibility received a weight of zero. The total weight carried by unknown-eligibility cases was distributed to respondents assuming the same eligibility rate as observed among the respondents. Thus the sum of weights equals the frame size.
Reliability of Estimates
Because the estimates produced from the SDR are based on a probability sample, they may vary from those that would have been obtained if all members of the target population had been surveyed using the same data-collection procedures. Two types of error are possible when population estimates are derived from a sample survey: sampling error and nonsampling error. By looking at these errors, the accuracy and precision of the survey estimates can be assessed for reliability in relation to sampling error and for bias in relation to nonsampling error.
Sampling error is the variation that occurs by chance because a sample, rather than the entire population, is surveyed. The particular sample that was used to estimate the 2006 population of science, engineering, and health doctorate recipients in the United States is one of a large number of samples that could have been selected using the same sample design and sample size. Estimates based on each of these samples would be apt to vary, and such random variation across all possible samples is called the sampling error. Sampling error is measured by the variance or standard error of the survey estimate. The 2006 SDR sample is a systematic sample selected independently from each sampling stratum. The successive difference replication method (SUD) was used to estimate the sampling errors. The theoretical basis for the SUD is described in Wolter (1984) and in Fay and Train (1995). As with any replication method, successive differences replication involves constructing a number of subsamples (replicates) from the full sample and computing the statistics of interest for each replicate. The mean square error of the replicate estimates around their corresponding full sample estimate provides an estimate of the sampling variance of the statistic of interest.
Standard Error Tables
Each statistical data table included in this report has a corresponding standard error table included in this appendix based on the method described above. For example, table A-1 is the standard error table that corresponds to table 1. The standard error of an estimate can be used to construct a confidence interval for the estimate. To construct a 95% confidence interval about an estimate, the corresponding standard error of the estimate is multiplied by a z-score of 1.96 (i.e., the reliability coefficient) and then added to the estimate to establish the upper bound of the confidence interval and then subtracted from the estimate to establish the lower bound of the confidence interval.
In addition to sampling error, survey estimates are subject to nonsampling error, which can arise at many points in the survey process. Sources of nonsampling error include (1) nonresponse error, which arises when the characteristics of respondents differ systematically from nonrespondents; (2) measurement error, which arises when the variables of interest cannot be precisely measured; (3) coverage error, which arises when some members of the target population are excluded from the frame and thus do not have a chance to be selected for the sample; (4) respondent error, which occurs when respondents provide incorrect data; and (5) processing error, which can arise at the point of data editing, coding, or data entry. The analyst should be aware of potential nonsampling errors, but these errors are much harder to quantify than sampling errors. As noted previously, quality assurance procedures are included throughout the various stages of data collection and data processing to reduce possibilities for nonsampling error.
Changes in the Survey
Caution should be exercised when making comparisons with previous SDR results. In all previous cycles of the SDR, the new cohort consisted of graduates from the two academic years immediately preceding the survey year. However, in 2006, data were collected from graduates in the three previous academic years.
Before 2003, data on employed doctorate recipients were presented in only two categories: by employment in an S&E occupation and by employment in a non-S&E occupation. In 2003 a third category, S&E-related occupations, was added. S&E-related occupations include health-related occupations, S&E managers, S&E precollege teachers, and S&E technicians and technologists.
The 2006 SDR maintained the questionnaire design changes that were implemented in 1993 (for the survey questionnaire, see appendix C). The questionnaire comprises a large set of core data items that are retained in each survey round to enable trend comparisons, and several sets of module questions asked intermittently on special topics of interest. In the 2006 SDR, the questionnaire included a module on history of postdoctoral appointments, awarded primarily for gaining additional education and training in research, as a follow-up to a similar module included in the 1995 SDR. A module on international collaboration among doctorate recipients also was part of the 2006 questionnaire.
In addition to the postdoctoral appointment module, new questions were added to request the current job title among those working during the reference period and the last job title held among those not working during the reference period. A question on overall job satisfaction and a question regarding academic position among those working at a postsecondary academic institution, both added in 2003, were retained in 2006. A special module on publication and patenting activities during the past 2-year period, first introduced in 1995 and fielded in 2001 and 2003, was dropped from the questionnaire in 2006. Also dropped from the 2006 questionnaire were questions asked of foreign-born doctorate recipients in the 2003 SDR to obtain information about immigration.
Definitions and Explanations
Employer location. Survey question A8 includes location of the principal employer, and data were based primarily on responses to this question. Individuals not reporting place of employment were classified by their last mailing address.
Field of doctorate. The doctoral field is as specified by the respondent in the SED at the time of degree conferral. These codes were subsequently recoded to the field of study codes used in SESTAT questionnaires. (See appendix table B-1 for field-of-study codes.)
Full time and part time employment. Full time (working 35 hours or more per week) and part time (working less than 35 hours per week) employment status is for principal job only, not for all jobs held in the labor force. For example, an individual could work part time in his/her principal job, but full time in the labor force. Full time and part time employment status is not comparable to data reported in previous years when no distinction was made between the principal job and other jobs held by the individual.
Involuntarily out-of-field rate. The involuntarily out-of-field rate is the percentage of employed individuals who reported working part time exclusively because a suitable job was not available and/or reported working in an area not related to the first doctoral degree (in their principal job), at least partially because a job in the doctoral field was not available.
Labor force participation rate. The labor force participation rate (RLF) is the ratio (E + U) / P, where E (employed) + U (unemployed; those not-employed persons actively seeking work) = the total labor force, and P = population, defined as all science, engineering, and health doctorate holders less than 76 years of age who were residing in the United States during the week of 1 April 2006 and who earned their doctorates from U.S. institutions.
Non-U.S. citizen, temporary resident. This citizenship status category does not include individuals who at the time they received their doctorate reported plans to leave the United States and thus were excluded from the sampling frame.
Occupation data. These data were derived from responses to several questions on the kind of work primarily performed by the respondent. The occupational classification of the respondent was based on his/her principal job (including job title) held during the reference week—or last job held, if not employed in the reference week (survey questions A17/A18 or A5/A6). Also used in the occupational classification was a respondent-selected job code (survey question A19 or A7). (See appendix table B-2 for the list of occupations.)
Race/ethnicity. American Indian/Alaska Native, Asian, black, Native Hawaiian/Other Pacific Islander, white, and persons reporting more than one race refer to non-Hispanic individuals only. These race/ethnicity data are from prior rounds of the SDR and the SED. The most recently reported race/ethnicity data were given precedence.
Salary. Median annual salaries are reported for the principal job, are rounded to the nearest $100, and are computed for full-time employed scientists and engineers. For individuals employed by educational institutions, no accommodation was made to convert academic-year salaries to calendar-year salaries. Users are advised that due to changes in the salary question after 1993, salary data for 1995–2006 are not strictly comparable with 1993 salary data.
Sector of employment. "Employment sector" is a derived variable based on responses to survey questions A11 and A13. In the detailed tables, the category "4-year educational institutions" includes 4-year colleges or universities, medical schools (including university-affiliated hospitals or medical centers), and university-affiliated research institutions. "Other educational institutions" include 2-year colleges, community colleges, or technical institutes and other precollege institutions. "Private-for-profit" includes those self-employed in an incorporated business. "Self-employed" includes those self-employed or a business owner in a non-incorporated business.
Unemployment rate. The unemployment rate (Ru) is the ratio U / (E + U), where U = unemployed (those not-employed persons actively seeking work), and E (employed) + U = the total labor force.
Changes in the Detailed Statistical Tables
The 2006 SDR report adds nine tables to the complement of tables provided in the 2003 SDR. Six of these report data from the 2006 SDR questionnaire module on temporary postdoctoral appointments awarded primarily for gaining additional education and training in research. The remaining three tables provide data about the population of doctoral scientists and engineers with disabilities. The rest of the changes to the 2006 report were made to labels and headers of existing tables. Tables for the 2006 SDR report retain the changes made in the 2003 SDR that provided for more detailed field-of-doctorate and occupation classifications than in tables in earlier survey reports.
Fay RE, Train GF. 1995. Aspects of survey and model-based postcensal estimation of income and poverty characteristics for states and counties. ASA Proceedings of the Section on Government Statistics: 154–159.
 The SDR frame is based on the first U.S. doctorate earned in a science, engineering, or health (SEH) field. Prior to 2003, recipients of two doctorates whose first degree was in a non-SEH field were not included in the SDR frame, even if their second doctorate was in a SEH field. Based on information collected annually by the Survey of Earned Doctorates on the number and characteristics of those earning two doctorates, this exclusion resulted in a slight undercoverage bias. Between 1983 and 2000, for example, the total number of double doctorate recipients with a non-SEH first doctorate and a SEH second doctorate was 154, representing 0.046% of the total number of SEH doctorates awarded in that period. Starting in 2003, the new cohort frame included all SEH doctorate recipients except those who earned an SEH doctorate in a prior year.
 For more complete details regarding the 2006 SDR mode assignments and data collection protocols, see "2006 Survey of Doctorate Recipients Mode Assignment Analysis Report," Grigorian and Hoffer, 2007.
Standard Error Tables