Characteristics of Doctoral Scientists and Engineers in the United States: 2008
Appendix A. Technical Notes
The SDR is a panel study conducted every 2 years on a nationally representative cohort of individuals who have received a research doctorate in a science, engineering, or health (SEH) field. The National Science Foundation (NSF), through its National Center for Science and Engineering Statistics (NCSES), is the primary sponsor of the SDR. The National Institutes of Health also provides funding for the survey. The reference date for the 2008 SDR was 1 October 2008. The 2008 SDR was conducted by the National Opinion Research Center (NORC) at the University of Chicago.
The SDR is designed to complement two other surveys of scientists and engineers conducted by NCSES: the National Survey of College Graduates (http://www.nsf.gov/statistics/srvygrads/) and the National Survey of Recent College Graduates (http://www.nsf.gov/statistics/srvyrecentgrads/). These three surveys share a reference date and have overlapping and compatible questionnaires. Results from the three surveys are combined in the Scientists and Engineers Statistical Data System (SESTAT) database (see "Data Availability").
Additional data on education and demographic information in the SDR come from the Survey of Earned Doctorates (SED), an annual census of research doctorates earned in the United States that began in 1957 (http://www.nsf.gov/statistics/doctorates/). The SED provided a sampling frame for establishing the SDR in 1973 and continues to provide a sampling frame to replenish the SDR panel with new doctorate recipients for each new SDR survey cycle.
This appendix provides an overview of the SDR protocol. More thorough discussion is provided in the 2008 SDR methodology report, available upon request from the project officer.
The 2008 SDR target population consisted of individuals with the following characteristics:
As in previous cycles, the 2008 SDR sampling frame was constructed from two separate listings: the existing 2006 SDR cohort and a new cohort frame. The cohorts are defined by the year of receipt of the first U.S.-granted SEH doctoral degree. (See appendix B-1 for SEH fields included in the 2008 SDR sampling frame.) The existing cohort frame represents individuals who received their SEH doctorate before 1 July 2005; the new cohort frame represents individuals who received their SEH doctorate between 1 July 2005 and 30 June 2007. The existing cohort frame is a secondary frame—it consists of the SDR sample selected for the previous survey cycle, and each frame member carried a sampling weight from the previous cycle. The new cohort frame is a primary frame, including all known eligible cases from the two most recent doctoral award years.
The cases within the existing and new cohort frames were analyzed individually for SDR eligibility requirements. Individuals who did not meet the age criterion or who were known to be deceased, terminally ill, incapacitated, or permanently institutionalized in a correctional or health care facility were dropped from the sampling frames. Sample persons who were not U.S. citizens and were known to be residing outside the United States or one of its territories during at least two prior consecutive survey cycles were also eliminated from the existing frame. After ineligible cases were removed from consideration, remaining cases from the two frame sources were combined to create the 2008 SDR sampling frame. In total, there were 102,579 eligible cases in the 2008 SDR frame; 41,612 existing cohort cases and 60,967 new cohort cases.
The 2008 SDR sample design reduced the number of sampling strata from 164 (in 2006 and 2003) to 150 through the elimination of strata for race and ethnicity values that were missing. Missing race and ethnicity values were logically imputed from surname or place of birth.
The frame was stratified into the 150 strata by three variables—demographic group, degree field, and sex. The sample was then systematically selected from each stratum. The demographic group variable included nine categories defined by race/ethnicity, disability status, and citizenship at birth. To ensure higher selection probability for rarer population groups, classification of frame cases into these categories was done hierarchically. The goal of the 2008 sample stratification design was to create strata that conformed as closely as possible to the reporting domains used by analysts, provided that the associated subpopulations were large enough to be suitable for separate estimation and reporting.
The 2008 SDR sample selection was carried out independently for each stratum and cohort-substratum. For existing cohort strata, the past practice of selecting the sample with probability proportional to size continued, where the measure of size was the base weight associated with the previous survey cycle. For each stratum, the sampling algorithm started by identifying and removing self-representing cases (i.e., those with a base weight = 1) through an iterative procedure. Next, the non-self-representing cases (i.e., those with a base weight > 1) within each stratum were sorted by citizenship, disability status, Doctorate Records File degree field, and year of doctoral degree award. Finally, the balance of the sample (i.e., the total allocation minus the number of self-representing cases) was selected from each stratum systematically with probability proportional to size.
The new cohort sample was selected using the same algorithm used to select the existing cohort sample. However, since the base weight for every case in the new cohort frame was identical, each stratum sample from the new cohort was actually an equal-probability or self-weighting sample.
Thus, the 2008 SDR sample of 40,093 cases consisted of 36,644 cases from the existing cohort frame and 3,449 cases from the new cohort frame. The overall sampling rate was about 1 in 20 (5.0%), although sampling rates varied considerably across strata. Of these 40,093 sampled cases, 29,974 completed the survey and were eligible for inclusion in SESTAT. All critical items must be provided for a case to be considered complete. These completed eligible cases consisted of 27,252 cases from the existing cohort frame and 2,722 cases from the new cohort frame.
The questionnaire comprises a large set of core data items that are retained in each survey round to enable trend comparisons and several sets of module questions asked intermittently on special topics of interest. The module for the 2008 SDR gathered information on sample members' second job (previously asked on the 2001 SDR). Two sets of questions from the 2003 questionnaire were also reinstated: (1) questions measuring the technical expertise required for the primary job held by respondents and by respondents' spouses; and (2) questions measuring respondents' research productivity (authorships or co-authorships of papers, articles, books or monographs and number and type of patents earned). The modules on history of postdoctoral appointments and international collaboration among doctorate recipients from the 2006 SDR were not used. (See appendix C for the questionnaire.)
As noted, critical items are required for a case to be considered complete. After indicating their residence location (in or out of the United States) and employment status (working or not working on the reference date), all respondents must provide the title, description, and category of their current or most recent job, and non-working respondents must also indicate whether they were looking for employment during the four weeks prior to the reference date.
Data collection for the 2008 SDR employed three protocols. Each protocol used a different initial mode for data capture based primarily on the existing cohort's prior indication of mode preference:
After initial contact, each protocol included sequential contacts by postal mail, telephone, and e-mail that ran in parallel throughout the data collection period. In addition, sample members were encouraged to participate in the mode that was most convenient for them.
SAQ protocol (38% of sample members; 15,119). Initial contact was an advance notification letter from NSF. The first questionnaire was mailed 1 week after initial contact, followed by a thank you/reminder postcard mailed 1 week later. Approximately 6 weeks after the first questionnaire mailing, sample members who had not returned a completed questionnaire (by any mode) were sent a second questionnaire by U.S. priority mail. Three weeks later, any cases still not responding received a prompting notice via e-mail to verify receipt of the paper form and encourage cooperation. Telephone follow-up calls began 2 weeks later for all outstanding mail-start mode nonrespondents and requested participation, preferably by the CATI mode.
CATI protocol (5% of sample members; 1,788). Initial contact was an advance notification letter from NSF. Telephone contact and interviewing began 1 week after initial contact. Approximately 6 weeks later, sample members who had not yet responded were sent an e-mail prompt to solicit survey participation in any mode. Three weeks later, any cases still not responding received a first questionnaire mailing sent via U.S. mail, followed by a thank you/reminder postcard one week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents.
Web protocol (57% of sample members; 22,826). Initial contact was a survey notification letter via U.S. mail and e-mail. Two and one-half weeks after initial contact, sample members who had not yet responded were sent a follow-up letter via U.S. mail and e-mail. Two weeks later, any cases still not responding received a prompting telephone call to verify receipt of the Web-survey access information and encourage cooperation. Telephone follow-up calls to complete the CATI for all Web-start mode nonrespondents began 2 weeks later. Four weeks after the start of telephone contact, any cases still not responding received a first paper questionnaire via U.S. mail, followed by a thank you/reminder postcard 1 week later. Seven weeks after the first questionnaire mailing, a second questionnaire was mailed to the remaining nonrespondents. At the end of the field period, an additional notice to gain cooperation was sent via U.S. mail and e-mail to all remaining nonrespondents regardless of their initial start-mode protocol.
Quality assurance procedures were in place at each step (address updating, printing, package assembly and mailing, questionnaire receipt, data entry, coding, CATI, and post-data-collection processing). Active data collection ended in July 2009. The telephone contact and data entry processes ended on 15 July 2009. However, the Web-survey access remained available until 17 August 2009 to capture any last-minute responses. Overall, 30.1% of the responses were SAQ, 11.7% were CATI, and 55.1% were Web-surveys, with approximately 28% of the respondents choosing to respond in a mode other than their initial start mode.
Response rates were calculated on complete responses, as determined by the presence of critical items. The overall unweighted response rate was 80.7%; the weighted response rate was 80.5%. The 2008 SDR unweighted and weighted response rates are comparable to the response rates obtained in past survey cycles. Lower response rates generally occurred among groups of non-U.S. citizens (unweighted response rate = 71.0%) and among persons with missing demographic data (unweighted response rate = 47.2%). Missing demographic data typically indicated incomplete records from the SED. These cases typically are more difficult to locate. Prior experience has shown that sample members who are located usually complete the survey. Individuals who could not be located accounted for a high proportion of nonresponse cases (42.7%).
Data Editing and Coding
Complete case data were captured and edited under the three separate data collection modes for the 2008 SDR. A computer assisted data-entry system was used to process the SAQ paper forms. The CATI system, including an additional CATI instrument used to collect critical-item follow-up data, and the Web survey had internal editing controls. Mail questionnaire data and Web-based returns were reviewed for any missing critical items (working status, job code, and resident status in the United States). Telephone callbacks were used to obtain this information for a complete response. (All completed CATI responses included critical items.) Complete responses from the three separate modes were merged into a single database for all subsequent coding, editing, and cleaning.
Following established SESTAT guidelines, staff were trained in conducting a standardized review and coding of occupation and education information, "Other/Specify" verbatim responses, state and country geographical information, and postsecondary institution information. For standardized coding of occupation, the respondent's occupational and other work-related data from the questionnaire were reviewed by specially trained coders who corrected known respondent self-reporting problems to obtain the best occupation codes. The education code for a newly earned degree or first bachelor's degree earned was assigned solely on the basis of the verbatim response for degree field.
Imputation of Missing Data
Item nonresponse for key employment items, such as employment status, sector of employment, and primary work activity, ranged from 0.0% to 2.7%. Nonresponse to a few questions deemed somewhat sensitive, such as salary or earned income, had values between 8.6% and 11.7%. Personal demographic data, such as marital status, citizenship, ethnicity, and race, had item nonresponse rates ranging from 0.0% to 3.7%. Item nonresponse was imputed using logical imputation and hot-deck imputation methods.
For the most part, logical imputation was accomplished as part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing procedures found inconsistent data that were blanked out and therefore subject to statistical imputation. During sample frame building for the SDR, some missing demographic variables, such as race and ethnicity, were imputed before sample selection by using other existing information from the sampling frame.
The 2008 SDR primary method for statistical imputation was hot-deck imputation. Almost all SDR variables were subjected to hot-deck imputation, with each variable having its own class and sort variables structured by a multiple regression analysis. However, imputation was not performed on critical items or on text variables. For some variables, there was no set of class and sort variables that were reliably related to or suitable for predicting the missing value. In these instances, consistency was better achieved outside of hot-deck procedures using random imputation.
To enable weighted analyses of the 2008 SDR data, a final weight was calculated for every person in the sample. In general, a final weight approximates the number of persons in the population of recipients of U.S. doctorates that a sampled person represents. The primary purpose of weights is to adjust statistical estimates for potential bias due to unequal selection probabilities and nonresponse. The first step of the weighting process calculated a base weight for all cases selected into the 2008 SDR sample. The base weight accounts for sample design, and it is defined as the reciprocal of the probability of selection under the sample design. In the next step, an adjustment for nonresponse was performed on completed cases to account for the sample cases that did not complete the survey. Nonresponse-adjusted weights were assigned to both respondents and to known ineligible cases (i.e., cases who were deceased, institutionalized, over 75 years of age, or living abroad during the survey reference period), but eligible nonrespondents and cases with unknown eligibility received a weight of zero. The total weight carried by unknown-eligibility cases was distributed to respondents assuming the same eligibility rate as observed among the respondents. Thus, the sum of weights equals the frame size.
Reliability of Estimates
The particular sample that was used to estimate the 2008 population of SEH doctorate recipients in the United States is one of a large number of samples that could have been selected using the same sample design and sample size. Estimates based on each of these samples would likely be apt to vary, and such random variation across all possible samples is called the sampling error. Sampling error is measured by the variance or standard error of the survey estimate.
The 2008 SDR sample is a systematic sample selected independently from each sampling stratum. The successive difference replication method (SUD) was used to estimate sampling errors. The theoretical basis for the SUD is described in Wolter (1984) and in Fay and Train (1995). As with any replication method, successive differences replication involves constructing a number of subsamples (replicates) from the full sample and computing the statistic of interest for each replicate. The mean square error of the replicate estimates around their corresponding full sample estimate provides an estimate of the sampling variance of the statistic of interest.
Each statistical data table in this report has a corresponding standard error table included in this appendix based on the method described above. For example, table A-1 is the standard error table that corresponds to table 1. The standard error of an estimate can be used to construct a confidence interval for the estimate. To construct a 95% confidence interval for an estimate, the corresponding standard error of the estimate is first multiplied by a z-score of 1.96 (i.e., by the reliability coefficient) and then added to the estimate to establish the upper bound of the confidence interval and then subtracted from the estimate to establish the lower bound of the confidence interval.
Sources of nonsampling error include (1) nonresponse error, which arises when the characteristics of respondents differ systematically from nonrespondents; (2) measurement error, which arises when the variables of interest cannot be precisely measured; (3) coverage error, which arises when some members of the target population are excluded from the frame and therefore do not have a chance to be selected for the sample; (4) respondent error, which occurs when respondents provide incorrect data; and (5) processing error, which can occur at the point of data editing, coding, or data entry. The analyst should be aware of potential nonsampling errors, but these errors are far harder to quantify than sampling errors. Quality assurance procedures were included throughout the stages of data collection and data processing to reduce the possibilities of nonsampling error.
Changes in the Detailed Statistical Tables
The number of detailed tables published in this edition of the series has been reduced. The complete list of tables produced for the 2008 SDR is shown in exhibit 1. The published tables are designated by table number in the first column. The remaining tabulations, designated as "supplemental," are available on request from the SDR Project Officer. NCSES has under development a new system for delivering tabular data. When fully implemented, it will provide online access to the expanded set of detailed tabulations associated with this series. This system will also provide the opportunity for table customization. Select data tables will continue to be published, together with the survey's technical documentation.
NOTES: Prior-year numbering for tables published in this report are in boldface. Tables designated by "S" are available on request from project officer.
The 2008 SDR questionnaire did not include extensive questions regarding respondents' postdoctoral (postdoc) history; thus, three postdoc tables from 2006 were dropped: (1) Number of postdocs ever held by doctoral scientists and engineers, by years since doctorate and broad field of doctorate; (2) Primary reason for holding postdoc for doctoral scientists and engineers, by number of postdocs and broad field of doctorate; and (3) Benefit of current postdoc to doctoral scientists and engineers, by broad field of doctorate.
Four new tables have been added, which report on data from reinstated modules and questions: (1) Employed doctoral scientists and engineers engaged in patent-related activities, by field of doctorate and employment sector; (2) Employed doctoral scientists and engineers engaged in publication-related activities, by field of doctorate and employment sector; (3) Employed doctoral scientists and engineers working in second jobs, by field of doctorate and principal job employment sector; and (4) Employed doctoral scientists and engineers working in second jobs, by selected demographic characteristics and broad occupation of principal job.
To reduce redundancy, tables that previously reported both counts and percentages, now report only counts. The rest of the changes to the 2008 report were made to labels and headers of existing tables.
Changes in the Survey
Caution should be exercised when making comparisons with previous SDR results. In all previous cycles of the SDR, the new cohort consisted of graduates from the 2 academic years immediately preceding the survey year, as is the case for SDR 2008. However, in 2006, SDR collected data from graduates in the 3 previous academic years.
2003. Data on employed doctorate recipients was expanded to include the category "S&E-related occupations." S&E-related occupations include health-related occupations, S&E managers, S&E precollege teachers, and S&E technicians and technologists.
2002 and prior. Data on employed doctorate recipients were presented in two categories: employment in an S&E occupation and employment in a non-S&E occupation.
2006. The questionnaire included a module on history of postdoctoral appointments, awarded primarily for gaining additional education and training in research, as a follow-up to a similar module included in the 1995 SDR, plus a module on international collaboration among doctorate recipients.
2003. Beginning with 2003, the new cohort frame includes all SEH doctorate recipients except those who earned an SEH doctorate in a prior year. The SDR frame is based on the first U.S. research doctorate earned in an SEH field.
2002 and prior. Recipients of two doctorates whose first degree was in a non-SEH field were not included in the SDR frame, even if their second doctorate was in an SEH field. Based on information collected annually by the SED on the number and characteristics of those earning two doctorates, this exclusion resulted in a slight undercoverage bias. Between 1983 and 2000, for example, the total number of double doctorate recipients with a non-SEH first doctorate and an SEH second doctorate was 154, representing 0.046% of the total number of SEH doctorates awarded in that period.
Definitions and Explanations
Employer location. Survey question A8 includes location of the principal employer, and data were based primarily on responses to this question. Individuals not reporting place of employment were classified by their last mailing address
Field of doctorate. The doctoral field is as specified by the respondent in the SED at the time of degree conferral. These codes were subsequently recoded to the field of study codes used in SESTAT questionnaires. (See appendix table B-1 for field-of-study codes.)
Full-time and part-time employment. Full-time (working 35 hours or more per week) and part-time (working less than 35 hours per week) employment status is for principal job only, not for all jobs held in the labor force. For example, an individual could work part time in his/her principal job, but full time in the labor force. Full-time and part-time employment status is not comparable to data reported in previous years when no distinction was made between the principal job and other jobs held by the individual.
Involuntarily out-of-field rate. Involuntarily-out-of-field is the percentage of employed individuals who reported, for their principal job, working in an area not related to the first doctoral degree at least partially because a job in their doctoral field was not available.
Labor force participation rate. The labor force participation rate (RLF) is the ratio (E + U) / P, where E (employed) + U (unemployed; those not-employed persons actively seeking work) = the total labor force, and P = population, defined as all SEH doctorate holders less than 76 years of age who resided in the United States during the week of 1 October 2008 and who earned their doctorate from a U.S. institution.
Non-U.S. citizen, temporary resident. This citizenship status category does not include individuals who, at the time they received their doctorate, reported plans to leave the United States, and who therefore were excluded from the sampling frame.
Occupation data. Occupation data were derived from responses to several questions about the kind of work primarily performed by the respondent. The occupational classification of the respondent was based on his/her principal job (including job title) held during the reference week—or on his/her last job held, if not employed in the reference week (survey questions A19/A20 or A5/A6). Also used in the occupational classification was a respondent-selected job code (survey question A21 or A7). (See appendix table B-2 for a list of occupations.)
Race and ethnicity. Values include American Indian/Alaska Native, Asian, black, Native Hawaiian/Other Pacific Islander, and white; those persons who report multiple races refer only to individuals not of Hispanic origin. Race and ethnicity data are from prior rounds of the SDR and the SED. The most recently reported race and ethnicity data are given precedence.
Salary. Median annual salaries are reported for the principal job, rounded to the nearest $100, and computed for full-time employed scientists and engineers. For individuals employed by educational institutions, no accommodation was made to convert academic-year salaries to calendar-year salaries. Users are advised that due to changes in the salary question after 1993, salary data for 1995–2008 are not strictly comparable with 1993 salary data.
Sector of employment. "Employment sector" is a derived variable based on responses to survey questions A13 and A15. In the detailed tables, the category "4-year educational institutions" includes 4-year colleges or universities, medical schools (including university-affiliated hospitals or medical centers), and university-affiliated research institutes. "Other educational institutions" include 2-year colleges, community colleges, technical institutes, precollege institutions, and "other" educational institutions. Users should note that prior to 2008, "other" educational institutions were grouped with 4-year educational institutions. "Private-for-profit" includes respondents who were self-employed in an incorporated business. "Self-employed" includes respondents who were self-employed or were a business owner in a non-incorporated business.
Unemployment rate. The unemployment rate (Ru) is the ratio U / (E + U), where U = unemployed (those not-employed persons actively seeking work), and E (employed) + U = the total labor force.
Additional data and reports from the SDR are available at http://www.nsf.gov/statistics/doctoratework/. Data from the SDR are also available in the Scientists and Engineers Statistical Data System (SESTAT) at http://www.nsf.gov/statistics/sestat/. SESTAT provides an integrated database of information on employment, education, and demographic characteristics of scientists and engineers in the United States collected through the SDR, the National Survey of College Graduates (http://www.nsf.gov/statistics/srvygrads/), and the National Survey of Recent College Graduates (http://www.nsf.gov/statistics/srvyrecentgrads/).
Fay RE, Train GF. 1995. Aspects of survey and model-based postcensal estimation of income and poverty characteristics for states and counties. ASA Proceedings of the Section on Government Statistics: 154–159.
Wolter K. 1984. An investigation of some estimators of variance for systematic sampling. Journal of the American Statistical Association 79(388): 781–790.
Standard Error Tables