U.S. Academic Scientific Publishing
4.0 Research Approach: Scope, Data Sources, and Analysis Methods
This section provides an overview of the scope of work, the variables that were included in the database, and analysis methods.
Section 4.1 discusses the two major research efforts in this study: 1) to construct a database on U.S. academic institutions, and 2) to develop models that address the institutional characteristics related to article production, how changes over time in these characteristics relate to changes in article production, and how variables related to article production differ among fields of science.
Section 4.2 discusses the categories of variables contained in the database that was constructed for this study. Specific information about the variable definitions and the methods used to construct the database is contained in the appendices.
Section 4.3 contains a description of the primary analysis methods used for this study.
4.1 Scope of Work
The scope of work for this effort encompassed two major research tasks. In the first, a database on U.S. academic institutions was constructed. This database, called the Publication Trends database, included available information on numerous variables, including publications, citations, patents, research and development (R&D) funding, research-related personnel, institutional and departmental quality, and, where available, academic fields associated with these various measures; the database variables are described in greater detail later in this chapter and in appendix C. Field breakdowns were specified that are consistent with SRS fine field taxonomies, and that allow analyses either by individual field of science or larger aggregations. To the maximum extent possible, the relevant variables, including field of science and institutional names/boundaries, drawn from different databases were measured consistently in the newly constructed database. A codebook that thoroughly documents the database was developed. The database was delivered to SRS to allow SRS analysts to conduct further examination and analysis.
After the construction of the database, SRI consulted with SRS subject matter experts and statisticians and performed regression and other multivariate analyses to develop models from the newly constructed database to address the following research questions:
This report discusses the database development process and documents the analyses that were performed and the results of those analyses.
4.2 Variables in the Database
4.3 Statistical Analysis Methods
Most of the analyses in this study were conducted using linear regression where the dependent (outcome) variable was a publication or citation count, and the independent (explanatory) variables were personnel counts (e.g., faculty by rank, S&E postdoctorates, and S&E Ph.D. recipients), R&D expenditures, or university characteristics (such as Carnegie classification, number of patents approved, etc.). R&D expenditures were deflated using the GDP implicit price deflator. All independent variables were lagged to reflect the average time between research and publication and between publication and subsequent citation (as determined by varying lags and maximizing the correlations between lagged variables). Most analyses were conducted at the institution-year level, or the field group-year level. Because the independent variables were highly correlated, stepwise linear regression analysis was employed. In addition to statistical significance, we typically also imposed the requirement that variables added into the model increase the proportion of explained variance (i.e., r-square) by at least .01.
Because the data on institutions were longitudinal, it was important to examine whether statistical significance levels obtained via linear regression were accurate. We repeated some important analyses using Hierarchical Linear Modeling (HLM), which accommodates the clustering that occurs in longitudinal data, and found that all coefficients that were added to the model with linear regression were also statistically significant with HLM.
Publication counts could be modeled using three primary variables: academic R&D expenditures, the number of S&E postdoctorates, and the number of S&E Ph.D. recipients. However, these three explanatory variables are not independent. We postulated that increases in academic R&D expenditures would affect publication counts both directly and indirectly through the hiring of additional S&E postdoctorates and support for S&E Ph.D. recipients. To estimate the full effect of academic R&D expenditures on publication counts, we used path analytic modeling. Linear regression was used to estimate the number of additional S&E postdoctorates and S&E Ph.D. recipients that would result from an increase in academic R&D expenditures. The total effect of increased academic R&D expenditures on publication counts could then be estimated as the sum of the direct effects and indirect effects (via additional S&E postdoctorates and Ph.D. recipients).
Factor analysis was used to examine the relationships among the various publication and citation counts. We found that these counts were very highly correlated, and a single factor (which was approximately an average of normalized versions of these counts) accounted for almost all of their variability. This factor (the first principal component) was used in many of our preliminary analyses. However, we have chosen not to use the first principal component as the dependent variable in our regressions. Rather, to make the analysis more easily interpretable, the dependent variable in each regression is a specific type of fractional or whole publication count (for example, fractional publication counts in the expanding journal set).