Go to

Technical Notes
Decomposition of Salary Gaps


To examine the issue of salary equity, statistical techniques are used that permit a more comprehensive approach than is possible using the cross-tabulation approach used in most of this report. Although these techniques are widely used in the scientific literature in analyzing similar issues, it should be noted that the techniques used do have some disadvantages when compared with the cross-tabulation approach. Most important, they require the researcher to make a number of "simplifying assumptions." If these assumptions are correct (or approximately correct), the estimates of the salary gaps "explained" by differences in group characteristics are likely to be superior to those obtained by examining cross-tabulations. If the assumptions are far from being correct, however, the researcher may end up with conclusions that are erroneous.


Data from the 1993 Survey of Doctorate Recipients (SDR) were used in the decomposition of salary gaps in chapter 5. Part-time employees and self-employed individuals were excluded from the analysis, because salary data for these individuals are not likely to be comparable to those for individuals who are employed full time. Approximately 31,100 cases were usable for the analysis.

Basic Statistical Methodology

The first step in the analysis of the salary gaps was to fit a single least-squares regression equation to the total eligible sample, using log salary as the dependent variable and using as independent variables a large number of variables from the SDR. The demographic variables of interest (gender, race/ethnicity, whether U.S.-born, and disability status) were excluded from the equation. Those independent variables that did not have a statistically significant relationship with salary (at the 0.001 level) were deleted from further consideration at this stage. [44] This relatively high level for exclusion was selected, primarily because the large sample size resulted in a large array of statistically significant variables. Even at this conservative level, the number of variables retained makes comprehension of the model difficult. [45]

The parameters of the reduced regression equation were used to decompose the salary gaps of interest, using a modification of the Oaxaca (1973) methodology frequently used for decomposing salary gaps. In this revised methodology, the proportion of a salary gap explained is considered to be equal to:

bt(1 - 2)

where bt is the vector of parameters from the reduced regression equation, 1 is the vector of means for the nonminority group of interest (i.e., men, whites, U.S.-born whites, non-U.S.-born whites or persons without disabilities) and 2 is the vector of means for the corresponding minority group of interest.

Current Methodology Compared With Alternate Approaches

The current methodology deviates from the Oaxaca methodology in the selection of the regression equation used for standardization. We have standardized to the regression equation for the total population, whereas the most common application of the Oaxaca methodology is to standardize to the equation for the nonminority group (i.e., using b1 instead of bt in the above equation).

We opted to use the regression equation for the total population rather than the nonminority group for three reasons. First, using the total population is consistent with the null hypothesis that no discrimination on the basis of demographic characteristics occurs; this is, of course, the primary null hypothesis of interest. [46] Second, when multiple overlapping groups are considered (i.e., groups based on gender, race/ethnicity, birthplace, and disability status), the Oaxaca approach is conceptually more confusing than that adopted. Do we, for example, use the regression coefficients for men when comparing women with men and use the regression coefficients for whites for the analysis of racial/ethnic groups, or do we compare all of the groups to U.S.-born white men without disabilities? If the latter, does it make sense to compare all women to U.S.-born white men without disabilities or must we consider all 60 groups formed by cross-classifying the demographic variables of interest? Third, by using the same regression equation for all of the decompositions, meaningful comparisons of the salary gaps between different groups are more easily made, e.g., comparisons of the gender salary gap with the black/white salary gap.

To determine the sensitivity of the analysis to the choice of the regression equation used for standardization, a Oaxaca-type decomposition was made for the gender salary gap. The total percentage explained, standardizing to the equation for men rather than the total equation, was 88 percent rather than 90 percent-a fairly trivial difference. Yet another alternative is to standardize to the minority group equation. [47] Using this approach for the gender salary gap led to an estimated total percentage explained of 80 percent. Although this latter alternative provides a substantially lower estimate than that obtained for the model selected, standardization to the minority group equation is not a commonly accepted procedure.

Another approach to estimating the impact of demographic variables on salary is to do a multiple regression analysis, using dummy variables to measure the demographic groups of interest. This approach is used less frequently in the literature than is the Oaxaca approach. This approach does permit examination of the effects of each of the demographic variables of interest, however, while controlling for the other demographic variables of interest. It also has the advantage of permitting tests of significance for the effects of the demographic variables on salary and permits examination of specific interactions of interest. This approach was, therefore, used to supplement the basic decomposition approach used in the report. The parameter estimates and standard errors for this equation are included in appendix table 5-46. [48]

Variable Selection

As noted in the text, the adequacy of the analysis is contingent, in large part, on the independent variables used in the analysis. If major variables are omitted, the estimate of how much of the salary gap is "explained" will be inaccurate. Similarly, if variables that are not truly explanatory factors are included, the model will be inadequate.

As discussed in the text, some variables that could have influenced salaries (such as measures of productivity and direct measures of the relative importance of salary to other job rewards) were not collected in the SDR. Other variables were excluded for theoretical reasons or because the empirical evidence indicated that they were not, in fact, determinants of salary.

Among the available variables that were omitted for theoretical reasons, the most controversial decision was the decision to exclude academic rank and tenure. A number of analyses of the academic labor market include these variables; however, they are not always included. [49] We believe that academic rank and tenure are themselves best viewed as rewards for work performed rather than as "control" variables that help explain the salary gap. [50] To obtain an understanding of how sensitive the findings are to this particular decision, the doctoral gender salary gap was decomposed with the inclusion of academic rank and tenure in the model. The inclusion of these two variables resulted in an estimate of the explained gender gap of 91 percent rather than the 90 percent observed in the model used in chapter 5. It is thus unlikely that their inclusion would have substantially altered the findings in the chapter. [51]

We also excluded from consideration for theoretical reasons whether pay, job unavailability, or layoffs were factors in taking a job outside of the field of degree or in changing jobs. We believe that such responses may be more indicative of events that directly affect salary than they are of life choices. For example, if women and men were equally interested in being promoted, but men were promoted more often than women, men would more frequently report job changes for pay and promotion.

Note that one could argue that some of the variables included also should have been excluded. For example, one can argue that differences between groups with respect to management activities may be reflective of "discrimination" in the labor market. To the extent that this is true, one can argue that the inclusion of these variables has artificially increased the amount explained by the model.

The variables excluded for lack of statistical significance at the 0.001 level were

Finally, some variables that would have required extensive recoding were not included because of time constraints. In making these decisions, the amount of time needed to recode the variable was weighed against the likelihood of the recoding making a significant difference in the analysis. For example, with a modest amount of effort, it would have been possible to categorize field of degree for those who obtained a degree subsequent to the doctorate. The most important fields for such a break-out, however, are indicated by the type of degree, because over half of individuals with additional degrees had degrees that indicate the field of study (MBA, M.D., and the law degrees). On the other hand, productivity measures that would have been very interesting to include would require an extensive amount of matching of data files with citation indices.

Variable Measurement

The measurement of most of the variables in this analysis was quite straightforward, given the basic coding structure of the SDR. [53] In a few cases noted below, however, some modifications to the coding need to be explained.

Salary: In the 1993 SDR, individuals were asked to report their salary or earned income for their primary job, using whatever unit (e.g., hour, week) preferred. These have been annualized on the SDR database using appropriate inflators (e.g., 2,080 times hourly wage, 52 times weekly wage). It is difficult, however, to know what the correct inflator is for academic year. The 1993 database did not inflate academic year salaries, whereas previous SDR surveys used an inflator of 11/9. The first option is tantamount to assuming that the individual does not work in the summer, and the second assumes that the individual has a typical research grant that pays 2/9 of his/her academic year salary. Although both approaches are somewhat arbitrary, using the 11/9 estimator is the more reasonable approach and is roughly comparable to multiplying a weekly wage by 52 under the assumption that the worker is employed all year.

The dependent variable in the regression analysis is the logarithm of salary, which is often used in analyzing salary, because it is consistent with the concept that salary increases are typically expressed as percentage increases rather than in absolute dollars. [54] Because the log of salary was used as the dependent variable in the regression equations, the average salaries presented in the chapter are geometric means. [55] Like the median, the geometric mean places less emphasis on extremely high values in the calculation of the average, so that the geometric means for salary will normally be lower than the mean.

Years since receipt of doctorate, age at PhD, years of full-time experience, and years of part-time experience: The model fitted included squared terms for age when the doctoral degree was received, years since receiving the doctorate, years of full-time experience, and years of part-time experience in addition to the linear terms for these variables. Incorporation of such squared terms is common in the literature (cf. Weiler 1990). Its use was also verified through visual inspection of the graphed relationships between salary and these variables and by verifying that the squared terms were statistically significant at the .001 level when incorporated into the model after inclusion of the linear terms. It should be noted that a quadratic formulation is consistent with the idea that salary may decline toward the end of one's career.

In addition to these variables, it would have been interesting to include a measure of time not in the labor force in the model, but the 1993 SDR does not include a direct measure of this.

Occupation: Occupation was measured, using NSF's standard detailed coding of occupations except for a split of non-science-and-engineering occupations into "low" and "high" status occupations [56] on the basis of information from the 1993 National Survey of College Graduates (NSCG). Non-science-and-engineering occupations were classified in the "low status" category if fewer than 10 percent of the NSCG respondents in the occupation had doctorate degrees and if the average salary of NSCG respondents in the occupation in 1993 was under $45,000.

Type of employer:The SDR contains two highly related variables that describe the type of employer-sector of employment and, for those in academia, Carnegie classification of employer. Sector of employment in the SDR is based on individuals' self-report of the sector to which they belong, using the following categories: 2-year college; 4-year college; medical school; health-related school other than medical school; university-affiliated research institute; other educational institution; elementary, middle, or secondary school; private for-profit company; private not-for-profit organization; local government; State government; U.S. military service; U.S. Government (civilian employee); and other employer type. [57] The Carnegie classification of academic institutions is a commonly used classification of postsecondary institutions, based on level of degree awarded, fields in which degrees are conferred and, in some cases, enrollment, Federal research support, and selectivity of admissions criteria. It was not possible to include dummy variables for all categories of both of these variables in the regression analysis, because the high correlations between some of the sector variables and some of the Carnegie classification variables led to severe multicollinearity problems. After deletion of redundant measures, a set of dummy variables remained that are not strictly mutually exclusive but collectively describe the type of employer.

[43] Individuals with questions on the methodology employed are encouraged to contact Carolyn Shettle, Division of Science Resources Studies, Room 965, 4201 Wilson Boulevard, Arlington, VA 22230; (703) 306-1780; cshettle@nsf.gov. For background information on salary regression models and on variables used in this model, see Shettle (1972), Blinder (1973), Centra (1974), Kennedy (1992), Kahn (1993), and Wright (1994).
[44] When multiple dummy variables were derived from a single categorical variable, the 0.001 criterion for retention was applied to the entire categorical variable.
[45] See appendix table 5-46 for a list of the variables included in the final regression model along with estimates of the regression coefficients for the variables retained and their standard errors.
[46] This is analogous to using a pooled estimate of a proportion in calculating the standard error for the difference between two proportions, when testing the null hypothesis that the difference between two proportions is equal to 0.
[47] Barbezat (1991) used this approach in addition to using the Oaxaca approach.
[48] Demographic variables presented in this appendix table were included for those demographic variables that had a statistically significant impact on salary at the 0.05 level. Excluded for lack of statistical significance were type of disability (seeing, hearing, walking, and/or lifting) and interaction terms between race and gender and between race, gender, and whether born in the United States.
[49] See Barbezat (1991) for a discussion of this issue.
[50] See Weiler (1990) for a discussion of this issue.
[51] The coefficients for this model are included in appendix table 5-46. Analysts interested in performing a more detailed analysis of the salary gap based on this model can download the relevant appendix tables in spreadsheet format through the Science Resources Studies' Web site (http://www.nsf.gov/statistics/) or can obtain copies of the spreadsheets by contacting Carolyn Shettle (703-306-1780, cshettle@nsf.gov).
[52] This variable was close to being statistically significant. Note also that Formby et al. (1993) found this variable to be important among highly ranked economics departments.
[53] Individuals wishing a copy of the SDR code book or more information on variable coding should contact Carolyn Shettle (703-306-1780, cshettle@nsf.gov).
[54] See, for example, Barbezat (1991), Broder (1993), and Formby et. al. (1993).
[55] A geometric mean for a variable is the antilogarithm of the mean of the logarithms of the individual observations on that variable.
[56] The occupations included in the "low status" group included science-related fields such as technologists and technicians and computer programmers as well as occupations such as clerical/administrative support and precollegiate teachers/professors, and mechanics and repairers.
[57] Although the question permits individuals to classify themselves as self-employed, self-employed individuals were excluded from the current analysis.

Top of
Table of