Characteristics of Doctoral Scientists and Engineers in the United States: 1995

Technical Notes

The Sampling Frame and Target Population
Sample Design
Data Collection
Survey Design and Content
Response Rates
Data Preparation
Weighting and Estimation
Notes on the Tables
Selected Employment Characteristics
Appendix Tables

The data on doctoral scientists and engineers contained in this report come from the 1995 Survey of Doctorate Recipients (SDR). The SDR has been conducted biennially since 1973 by the National Research Council (NRC) for the National Science Foundation (NSF). Additional data on education and demographic information come from the National Research Council's Doctorate Records File (DRF). The DRF contains data from an ongoing census of research doctorates earned in the United States since 1920.

The Sampling Frame and Target Population

For the 1995 SDR the sampling frame for scientists and engineers was selected from the DRF to include individuals who

  1. had earned a doctoral degree from a U.S. college or university in a science or engineering field;

  2. were U.S. citizens or, if non-U.S. citizens, indicated they had plans to remain in the United States after degree award; and

  3. were under 76 years of age as of April 1995 (the survey reference date).

The 1995 frame consisted of graduates who had earned their degrees between January 1942 and June 1994. Persons who did not meet the age criteria (or had died) were eliminated from the sample.

The survey had two additional eligibility criteria for the survey target population. The sampled member must be resident in the United States and not institutionalized as of the reference date.

Sample Design

In 1995, the SDR sample size was 49,829. The total sample was selected from 2 groups:

  1. 1993 sample members who were still eligible in 1995, and

  2. a sample of the 1993-94 graduating cohort.

Group 1 cases were included with certainty because they are the core sample that is conveyed from year to year; group 2 cases were sampled and added to the core sample to form the total sample. A maintenance cut was done to the sample to keep the sample size roughly the same as it was in 1993.

The basic sample design was a stratified random sample. The variables used for stratification were 15 broad fields of degree, 2 genders, and an 8-category "group" variable combining race/ethnicity, handicap status, and citizenship status.

The overall sampling rate was about 1 in 12 (8 percent) in the 1995 SDR, applied to a population of 594,300. However, sampling rates varied considerably within and between the strata. These differences resulted from oversampling of women, minority groups and other groups of special interest, and the accumulation of sample size adjustments over the years.

Data Collection

In 1995, there were 2 phases of data collection: a mail survey and telephone followup interviewing with nonrespondents. The mail survey consisted of an advance letter and 2 waves of a personalized mailing package, with a reminder postcard between waves 1 and 2. The first-wave mailing was sent in May 1995, with the follow-up mailing sent by priority mail in July.

Phase 2 consisted of telephone interviewing. A 60 percent sample of nonrespondents to the mail survey were followed up using computer-assisted telephone interviewing (CATI). Telephone interviewing was conducted between November 1995 and February 1996.

Survey Design and Content

The 1995 SDR retained questionnaire design changes that were implemented in 1993. Most items on the 1995 questionnaire were the same as in 1993 with the addition of a section to collect data on employment history and periods of unemployment.

Response Rates

The overall response rate for the 1995 SDR was 85 percent. The response to the mail phase of the survey was about 62 percent. (Response rates were calculated as the weighted response divided by the weighted sample cases.)

Data Preparation

As completed survey mail questionnaires were received, they were logged and transferred to the editing and coding unit at the NRC for processing. The coders carried out a variety of checks to prepare the documents for data entry. Specifically, they resolved incomplete or contradictory answers, imputed missing answers if logically appropriate, reviewed "other specify" responses for possible backcoding to a listed response, and assigned numeric codes to open-ended questions such as employer name.

Once questionnaires were edited and coded, they were sent to data entry. The data entry program contained a full complement of range and consistency checks to check for entry errors and inconsistent answers. The range and consistency checks were also applied to the CATI data via batch processing. Further computer checks were performed to test for inconsistent values; these were corrected and the process repeated until no inconsistencies remained.

At this point, the survey data file was ready for imputation of missing data. As a first step, basic frequency distributions were produced to show nonresponse rates to each question—these were generally less than 2 percent, with the exception of salary, which was 5.9 percent. Two methods for imputation were adopted. The first, cold decking, was used mainly for demographic variables that are static, i.e., not subject to change. Using this method, historical data provided by respondents in previous years were used to fill a missing response. For example, if a respondent indicated in 1993 that his birth year was 1947, but left the item blank in 1995, then "1947" was assigned to his birth year in 1995. In cases where no historical data were available, and for nondemographic variables (such as employment status, primary work activity, and salary), hot decking was used. This is the process of finding a donor with characteristics similar to the case with the missing value and using the response given by the donor as a proxy response. Hot decking involves creating groups of cases with common characteristics (through the cross-classification of auxiliary variables) and then selecting a donor at random for the case with the missing value. As a general rule, no data value was imputed from a donor in one cell to a recipient in another cell.

For a few variables, such as employer name and zip code, imputation was not performed.

Weighting and Estimation

The next phase of the survey process involved weighting the survey data to compensate for unequal probabilities of selection to the sample and to adjust for the effects of unit nonresponse. The first step was the construction of sampling weights, which were calculated as the inverse of the probability of selection, taking into account all stages of the sample selection process overtime. The sampling weight can be viewed as the number of population members the sample member represents. Sampling weights varied within cells because different sampling rates were used depending on the year of selection and the stratification in effect at that time.

The second step was to construct a combined weight, which took into account the subsampling of nonrespondents at the CATI phase. All respondents received a combined weight, which for mail respondents was equal to the sample weight and for CATI respondents was a combination of their original sample weight and their CATI subsample weight.

The third step was to adjust the sampling weights for unit nonresponse. (Unit nonresponse occurs when the sample member refuses to participate or cannot be located.) This was done in a group of nonresponse adjustment cells created using poststratification. Within each nonresponse adjustment cell, a weighted nonresponse rate, which took into account both mail and CATI nonresponse, was calculated. The nonresponse adjustment factor was the inverse of this weighted response rate. The initial set of nonresponse adjustment factors was examined and, under certain conditions, some of the cells were collapsed if use of the adjustment factor would create excessive variance.

The final weights for respondents were calculated by multiplying their respective combined weights by the nonresponse adjustment factor. In data analysis, population estimates are made by summing the final weights of all respondents who possess a particular characteristic.


The statistics in this report are subject to both sampling and nonsampling error. Sampling variability occurs because a sample rather than an entire population is surveyed. Sampling errors were developed using a generalized variance procedure in order to provide approximate sampling errors that would be applicable to a wide variety of items. As a result, these sampling errors provide an indication of the order of magnitude of a sampling error rather than a precise sampling error for any specific item.

Information provided in table A-3 permits the user to calculate approximate standard errors. The general form of the equation used to model the generalized variances is V = a + b/x, where V was modeled in relative standard error form.

The following computational form can be used for estimating the standard error of totals using the formula


where "x" equals the estimated total and "a" and "b" are the regression coefficients provided. Values of "a" and "b" by S&E fields for selected groups are given in table A-3.[2]

Tables A-4, A-5, A-6, A-7, and A-8 present approximate standard errors associated with totals for different segments of the doctoral population. Tables A-9, A-10, A-11, A-12, and A-13 present standard error estimates for the estimated percent[3] of a subgroup having a particular characteristic.

The approximate standard error of percentages also was developed using the same general model form. Standard errors for percentages may be estimated using the computational formula

Sp = p[b((1/x)-(1/y))]1/2

where p equals the percentage possessing the specific characteristic and x and y represent the numerator and denominator, respectfully, of the ratio that yields the observed percentage.

In addition to sampling error, data are subject to nonsampling error. Sources of nonsampling error include nonresponse bias, which arises when individuals who do not respond to a survey differ significantly from those who do, and measurement error, which arises when we are not able to precisely measure the variables of interest. These sources of error are much harder to estimate than sampling errors.

Notes on the Tables

The following notes facilitate use of data in the detailed tables.

Because of the changes introduced to the 1993 SDR and retained in the 1995 SDR, users are advised that data in this report are not strictly comparable with SDR data published by NSF prior to 1993.

Field of doctorate is the field of degree as specified by the respondent in the Survey of Earned Doctorates at the time of degree conferral.

Occupation data were derived from responses to several questions on the kind of work done by the respondent. The occupational classification of the respondent was based on his or her principal job held during the reference week—or last job held, if not employed on the reference week (questions A18 and A5). Also used in the occupational classification was a respondent-selected job code (questions A19 and A6).

Sector of employment was based on responses to questions A13 and A15. The category "universities and 4-year colleges" includes 4-year colleges or universities, medical schools (including university-affiliated hospitals or medical centers), and university affiliated research institutions. "Private-for-Profit" includes self-employed in incorporated business.

Geographic division was based primarily on responses to question A11 on the location of employment. Individuals not reporting place of employment were classified by their mailing address.

Place Of Birth categories were defined as follows:

U.S. = Fifty states plus the Virgin Islands, Panama Canal Zone, Puerto Rico, American Samoa, Trust Territory, and Guam
Europe = Albania, Armenia, Austria, Belarus, Bosnia-Herzegovina, Bulgaria, Czech Republic, Croatia, Estonia, Georgia, Greece, Hungary, Latvia, Lithuania, Poland, Romania, Russia, Slovakia, Ukraine, Federal Republic of Yugoslavia, Andorra, Belgium, France, Gibraltar, Luxembourg, Monaco, The Netherlands, Portugal, Spain, Switzerland, Germany, Italy, Liechtenstein, Malta, Denmark, England, Finland, Iceland, Northern Ireland, Republic of Ireland, Norway, Scotland, Sweden, Wales, Europe, not specified
Asia = Afghanistan, Bahrain, Bangladesh, Cyprus, India, Iran, Iraq, Israel, Jordan, Kuwait, Lebanon, Nepal, Palestine, Saudi Arabia, Sri Lanka, Syria, Turkey, Cambodia, People's Republic of China, Philippines, Taiwan, China Unspecified, Hong Kong, Japan, Republic of Korea, Korea Unspecified, Laos, Malaysia, Singapore, Thailand, Democratic Republic of Vietnam, Republic of Vietnam, Asia, not specified
North America = Bermuda, Canada, Greenland, North America, not specified
Central America = Belize, Costa Rica, El Salvador, America Guatemala, Honduras, Mexico, Nicaragua, Panama, Central America, not specified
Caribbean = Barbados, Cuba, Dominican Republic, Haiti, Jamaica, Caribbean not specified
South America = Argentina, Bolivia, Brazil, Chile, Columbia, Ecuador, French Guiana, Guyana, Paraguay, Peru, Suriname, Uruguay, Venezuela, South America, not specified
Africa = Algeria, Egypt, Ethiopia, Ghana, Kenya, Libya, Morocco, Nigeria, South Africa, Sudan, Africa, not specified
Oceania = Australia, Indonesia, New Zealand, Oceania, not specified

Primary work activity was determined from responses to question A27. "Development" includes the development of equipment, products, and systems. "Design" includes the design of equipment, processes, and models.

Federal support was determined from responses to questions A40 and A41.

Tenure status was obtained from the response to question A17.

Race/ethnicity categories of white, black, Asian/Pacific Islander and Native American refer to non-Hispanic individuals only.

Citizenship status category of Non-U.S., temporary resident does not include individuals who, at the time they received their doctorate, expressed plans to leave the U.S. These individuals were excluded from the sampling frame.

Salary data were derived from responses to question A37, in which information was requested regarding annual salary before deductions for income tax., social security, retirement, but excluding bonuses, overtime, and summer teaching. Salaries reported are median annual salaries, rounded to the nearest $100 and computed for full-time employed scientists and engineers. For individuals employed by educational institutions, no accommodation was made to convert academic-year salaries to calendar-year salaries. Users are advised that due to a wording change in the salary question, 1995 salary data are not strictly comparable with 1993 salary data.

Selected Employment Characteristics

This report contains several derived statistical measures reflecting labor force and employment rates as of April 1995:

Labor force participation rate. The labor force is defined as those employed (E) plus those unemployed (U—i.e., those not-employed persons actively seeking work). The labor force participation rate (RLF) is the ratio of the labor force to the population (P).

RLF = (E+U) / P

Unemployment rate. The unemployment rate (RU) is the ratio of those who are unemployed but seeking employment (U) to the total labor force (E+U).

RLU = U / (E+U)

S&E involuntarily out-of-field rate. The S&E involuntarily out-of-field rate is the percent of employed individuals who reported they were either:

  1. working part-time exclusively because suitable full-time work was not available; and/or

  2. working in an area not related to the first doctoral degree (in their principal job) at least partially because suitable work in the field was not available.


[1] The data and material on sampling reliability presented here are from The Methodological Report of the 1995 Survey of Doctorate Recipients (Washington, D.C. Office of Scientific and Engineering Personnel, National Research Council, forthcoming).
[2] The generalized error estimates in this report were based on a set of assumptions that did not appear to hold in the case of some small subpopulations. In such cases, the parameters listed for a higher-level field within a demographic group or a higher-level demographic group within a field were considered a useful substitute as a generalized error estimate.
[3] The estimated percent is based on the ratio of two estimated totals, where the numerator is a subset of the denominator.

Top of Page Table of Contents Help SRS Homepage