## Appendix C

## Technical Notes

## Data Used in This Report

The Survey of Doctorate RecipientsThe 1993 Survey of Doctorate Recipients (SDR) includes individuals under 76 years of age who received a research doctorate in science or engineering from a U.S. university in 1992 or earlier. The focus of the current report was restricted to individuals in the labor market

[40]at the time of the survey (April 1993). Thus, individuals who were neither employed nor seeking employment at that time were excluded from the analyses. The available sample size was approximately 36,000 cases.

Historical DataChanges have been made in the population definition and data collection procedures for the SDR that reduce the direct comparability of the earlier surveys with the 1993 survey.

[41]The 1973 data were adjusted to make them as comparable as possible to the 1993 data.[42]A report by the National Science Foundation (NSF), Unemployment Rates and Employment Characteristics for Scientists and Engineers, 1971, is used for comparison purposes within this report, even though the NSF study differed considerably in population definition and research design from the 1993 SDR. The scientists for the earlier NSF survey were those included in the 1970 National Register of Scientific and Technical Personnel. To be included in the register, individuals were required to have "full professional standing based on academic training and work experience, as determined by the appropriate scientific professional society for the fields of science covered."

[43]Approximately 60 percent of the scientists did not have doctorates.[44]Engineers were selected from a mailing list maintained by the Engineers Joint Council that "consisted of 23 major engineering societies and constituted about 40 percent of the total number of individuals in the Nation identified as engineers."[45]Thus, the definitions of scientist and engineer in the 1972 study were not strictly comparable to one another, nor were they comparable to the 1993 definition of an individual with a doctorate in one of the science and engineering fields.

Total Population DataInformation on total population unemployment was taken from data collected by the Bureau of Labor Statistics (BLS) in the Current Population Survey (CPS). The definition of unemployment used in the CPS is essentially the same as that used in the SDR.

## Trend Analysis

Changes in the SDR methodology over time (e.g., fluctuating response rates and population definitions) have affected the size of the unemployment rate estimates. In 1973, the response rate for the survey was 75 percent. Between 1973 and 1989, the response rate gradually declined to 55 percent. In 1991, extensive locating and telephone follow-up procedures were instituted that helped raise the response rate in 1991 and 1993 to approximately 87 percent. During the 1991 redesign of the SDR, the population definition was modified. The 1973 study used a sample frame that included many individuals who received doctoral degrees from non-U.S. institutions. However, after 1973, only individuals with doctoral degrees from U.S. institutions were added to the survey. By 1991, it was clear that the coverage of the non-U.S.-educated population was extremely poor. Since improving this coverage within the SDR was not practical, this segment was deleted entirely.

To understand the likely impact of the 1991 changes on the unemployment rate, rates were calculated for 1989 and 1991 using population and methodological definitions that were as similar as possible. Foreign-educated individuals were excluded from the 1989 estimate, and individuals who responded during the telephone follow-up stage in 1991 were also excluded from this comparison. The resulting unemployment rate for the 1989 group was 0.8 percent; the 1991 rate was 1.3 percent. The published rates with the differing population definitions and methodology for the two years were 0.8 percent and 1.4 percent. Therefore, it appears that the changes in methodology and population definition resulted in a slight increase in the estimated unemployment rate between the two years. Since the impact of the changed methodology on unemployment rates did not appear to be substantial, it was decided that a fairly good approximation for trend analysis purposes could be made by adding 0.1 percent to the pre-1991 unemployment rates.

There are some discrepancies in reported doctoral unemployment rates for 1973. The rate reported in the National Academy of Sciences publications was 1.2 percent, although the 1973 rate reported in the NSF's Characteristics of Doctoral Scientists and Engineers in the United States: 1989 was 1.1 percent. Since the latter rate was published as part of the trend analysis used to calculate the adjusted 1989 rate, it was assumed that the NSF rate was the best rate for use in calculating adjusted pre-1989 figures.

## Variable Definitions

Unemployment RateThe definition of unemployment used in this report is the standard Federal definition of the percent of individuals in the labor force who were not employed. The labor force is defined as individuals who were employed, were on lay-off, or had sought work within the preceding four weeks. Although this is the most commonly used measure of unemployment, other measures are used. The Bureau of Labor Statistics, for example, in a 1995 article discusses a variety of alternative measures used for different purposes (Bregger and Haugen).

Involuntary Part-Time RateThe involuntary part-time rate is defined as the number of individuals who reported working part-time exclusively because suitable full-time work was not available, divided by the number of individuals in the labor force.

Involuntary Out-of-Field RateFor this report, the involuntary out-of-field rate is defined as the number of individuals (other than those who were involuntarily part-time employed) who reported that they were working out of their doctoral field at least partially because suitable work in the field was not available, divided by the number of individuals in the labor force. This is slightly different than the definition used in the NSF report, Characteristics of Doctoral Scientists and Engineers in the United States: 1993, which combines individuals who are involuntarily part-time or involuntarily out-of-field into a single measure, referred to as involuntary out-of-field. For the purposes of this report, the components are broken out. This report also uses the number of individuals in the labor force as the denominator for calculating this rate, rather than the number of employed individuals, in order to facilitate combining the three measures of adverse career events.

OccupationStandard SRS occupational groupings were used for coding the 1988 science and engineering occupations. These codes are detailed in NSF 96-302. For non-science and engineering occupations, a further breakdown of occupations into managerial or professional specialty positions was made. Non-management/professional specialty occupations included: technologists and technicians; clerical/administrative support; computer programmers; surveyors; farmers, foresters, and fishermen; nurses; sales and marketing; service occupations other than health; and elementary and secondary teachers. Jobs in this category were selected based on the characteristics of individuals in these jobs in the 1993 National Survey of College Graduates. The remaining non-S&E occupations were considered to be managerial and professional specialty jobs. This category includes the clergy, lawyers, and managers, where high-level degrees are common.

Variables Related to 1988 Employment and Occupational StatusThe 1993 SDR included a series of questions about the employment status of individuals in 1988. These questions asked whether the individual had changed employer or occupation since 1988 and, if so, asked for information about the 1988 position. This retrospective information was used throughout the report to describe 1988 occupational characteristics.

Other VariablesIn examining associations between single variables and the unemployment rate, the goal was to restrict analyses to groups that consisted of at least 400 sample cases. This is a relatively large cut-off, because of the high sampling variability encountered in small samples when rates are as low as 1.6 percent. Meeting the minimum sample size goal required collapsing categories. When logical combinations did not permit the desired sample size goal to be met, smaller sample sizes were retained. If this was not feasible, small residual categories were treated as missing for the purposes of examining the bivariate relationships between the independent variables of interest and unemployment status.

The categories used in the bivariate analyses were also used as a starting point for creating dummy variables for the multivariate work. However, since the regression routines used in the multivariate analyses ignore all cases with missing values for one or more variables, the missing value codes were examined again before conducting the multivariate analysis. Some categories (for example, "Other Physical Science" under degree field) that were not displayed in the univariate analysis were used in the multivariate analysis. The remaining missing value cases were treated as if they belonged to whichever dummy variable category had been selected for omission in the dummy variable regression. Normally, this was the modal category for the variable.

## Standard Errors and Tests of Significance

Observed differences in comparing unstandardized unemployment rates between groups were tested for statistical significance at the .05 significance level. Standard errors for these tests were calculated using the equation appropriate for a simple random sample. This is equivalent to assuming that there is no design effect. Although this methodology provides only an approximate estimate of the standard error, it greatly simplifies computation. Since the sample design for the survey was a stratified random sample, this approach should provide reasonably good estimates.

Sample sizes for some of the 1973 subgroups used in comparing 1973 and 1993 unemployment rates were not readily available. Therefore, the number of cases in the subgroup was estimated by multiplying the 1993 sample size for the group by the ratio of total 1973 sample size to total 1993 sample size. Although this is a fairly rough test, it provides general guidance on the probable statistical significance of observed differences.

## Standardization Methodology

The first step in developing a model for estimating unemployment was an examination of the bivariate associations between the independent variables of interest and unemployment. Some variables were eliminated from further consideration after examination of these relationships based primarily on whether the observed bivariate relationship could reasonably be interpreted as one in which the independent variable affected unemployment. For example, non-work-related training appeared to be associated with high unemployment rates. However, it seems more reasonable to believe that being unemployed leads one to seek additional training than that obtaining additional training increases the probability of unemployment. The bivariate relationships for these omitted variables were discussed in Appendix B.

The preliminary analysis also suggested the appropriate shape of curves to fit in the multivariate analysis. For example, for variables (such as years since the doctorate was earned) that display high unemployment rates at the extremes of the distribution, parabolic relationships were fit by including squared and linear terms for the relevant independent variables.

Once the preliminary independent variables were identified, a multiple regression analysis was performed to identify possible problems with multicollinearity that required the deletion of additional variables.

[46]Stepwise regression analysis was also used to determine if there were additional variables that could be deleted due to a lack of statistical significance. Variables omitted at this stage included gender, race/ethnicity, and whether the individual had children. At that point, a limited number of plausible two-way interactions were introduced into the analysis and tested for statistical significance (for example, gender by whether children are present). The next step was to perform a logistic regression analysis. The preliminary logistic model was simplified by eliminating variables that were not statistically significant from the model.[47]The parameters for the final logistic regression model are presented in table C-1.A problem with using logistic regression analysis is that interpretation of the results is not straightforward. The impact of an independent variable on unemployment depends on the value of the other variables in the model. Since such complex relationships are difficult to comprehend, a standardization technique was used. For most variables, iterative techniques were used to select a standardization value for all factors other than the independent variable of interest. This resulted in a total unemployment rate equal to the observed unemployment rate.

The standardization methodology selected was modified slightly to deal with situations where there was a logical dependence between categoric independent variables in the analysis. For example, individuals categorized as not employed in 1988 in the occupational analysis were categorized the same way in the sector variable. Logit regression parameters were calculated for each category formed by cross-classifying the interdependent independent variables. For example, chemists in the private sector would have a combined logit parameter equal to the sum of the parameters for the dummy variable for chemists, the value for the dummy variable for the private sector, and the values of the dummy variables used to indicate employment and student status. Standardization was performed for these detailed occupations by sector categories, and the value for each sector and occupation was obtained by weighting these subcategories-for example, sector categories within an occupation group-according to the observed distribution.

An exception to this general treatment was made for the variable characterizing 1988 occupation by the percent of individuals within the occupation in the 1993 NSCG who were involuntarily out-of-field (an indicator of the perceived desirability of the occupation). For this variable, the standardized values for the two unemployed in 1988 categories were set equal to the values observed in the analysis of 1988 occupation and 1988 sector described prior to calculating standardized values for the remaining categories.

For continuous variables, standardization was done within categories. For the purpose of evaluating the regression values, the midpoint of the category was used to estimate the dependent variable mean unless knowledge of the data suggested a different value would be more appropriate.

In standardizing for disability status, the categories were not mutually exclusive, because of the possibility that an individual could have multiple disabilities. Instead of standardizing to the total observed unemployment rate or forcing the categories to be mutually exclusive, unemployment rates were standardized to a hypothetical total unemployment rate calculated from the observed values of the univariate disability categories.

Footnotes

[40]An individual in the labor market is defined as employed or, if not employed, having actively sought work within the preceding four months or being on layoff.

[41]See the Technical Notes for a discussion of changes in the SDR over time.

[42]See below (under Trend Analysis) for more information on this adjustment.

[43]NSF 1972, pp. 112--113.

[44]NSF 1972, p. 15.

[45]NSF 1972, pp. 114--115.

[46]All analyses for this report were performed using SAS.

[47]Note that interaction effects were tested after a decision was made on whether the primary variables should be retained.