nsf.gov - NCSES U.S. Academic Scientific Publishing - US National Science Foundation (NSF)
text-only page produced automatically by LIFT Text
Transcoder Skip all navigation and go to page contentSkip top navigation and go to directorate navigationSkip top navigation and go to page navigation
National Science Foundation National Center for Science and Engineering Statistics
U.S. Academic Scientific Publishing

4.0 Research Approach: Scope, Data Sources, and Analysis Methods


This section provides an overview of the scope of work, the variables that were included in the database, and analysis methods.

Section 4.1 discusses the two major research efforts in this study:  1) to construct a database on U.S. academic institutions, and 2) to develop models that address the institutional characteristics related to article production, how changes over time in these characteristics relate to changes in article production, and how variables related to article production differ among fields of science.

Section 4.2 discusses the categories of variables contained in the database that was constructed for this study. Specific information about the variable definitions and the methods used to construct the database is contained in the appendices.

Section 4.3 contains a description of the primary analysis methods used for this study.

4.1 Scope of Work

The scope of work for this effort encompassed two major research tasks. In the first, a database on U.S. academic institutions was constructed. This database, called the Publication Trends database, included available information on numerous variables, including publications, citations, patents, research and development (R&D) funding, research-related personnel, institutional and departmental quality, and, where available, academic fields associated with these various measures; the database variables are described in greater detail later in this chapter and in appendix C. Field breakdowns were specified that are consistent with SRS fine field taxonomies, and that allow analyses either by individual field of science or larger aggregations. To the maximum extent possible, the relevant variables, including field of science and institutional names/boundaries, drawn from different databases were measured consistently in the newly constructed database. A codebook that thoroughly documents the database was developed. The database was delivered to SRS to allow SRS analysts to conduct further examination and analysis.

After the construction of the database, SRI consulted with SRS subject matter experts and statisticians and performed regression and other multivariate analyses to develop models from the newly constructed database to address the following research questions:

  • How do various institutional characteristics (e.g., NRC quality ratings, R&D funding, institutional control, availability of science and engineering graduate students and Ph.D.s, and patenting activity) relate to article production and changes over time in those characteristics (measured in the range of ways possible in the database)?
  • How do the variables related to an institution's article production differ for different fields of science?

This report discusses the database development process and documents the analyses that were performed and the results of those analyses.

Top of page. Back to Top

4.2 Variables in the Database

Brief descriptions of the variables in the Publication Trends database are presented below. For more details about these variables and source references, see Appendices A, B, and C.

  • Federal Interagency Committee on Education (FICE) codes—The FICE code is an institutional identifier used to collect data by the NSF Survey of Research and Development Expenditures at Colleges and Universities. Each institutional group was identified using the FICE code for the parent institution (which was called the PFICE code) and the parent's institutional name. The database contains information on the top 200 institutional groups. In the remainder of this report we refer to the institutional groups as the top 200 R&D performing academic institutions. FICE codes are used to distinguish one institution from another for analytical purposes.
  • Field codes—A standardized code for science and engineering academic disciplines as defined in the NSF WebCASPAR system. Appendix E lists the WebCASPAR fields.
  • Publications and Citation Counts—The two types of outcome (dependent) variables are: 1) annual article counts and 2) annual citation counts (i.e., the number of times that the article has been cited in the given year). Each of these outcome variables can be measured in 4 ways: 1) with whole counts and the fixed journal set, 2) with whole counts and the expanding journal set, 3) with fractional counts and the fixed journal set, and 4) with fractional counts and the expanding journal set. This yields eight different variables represented in the database.
  • Patents—Number of patents granted by year and associated with an institution (from the U.S. Patent Office database, http://www.uspto.gov/web/offices/cio/cis/prodsvc.htm).
  • R&D Funding—Amount of annual R&D funding received by an institution, disaggregated by source (from the NSF Survey of Research and Development Expenditures, 1988 to 2001, available in the WebCASPAR system of the SRS web site).
  • Quality Measures—Departmental quality as measured in the National Research Council's Assessment of Research-Doctorate Programs (1993).
  • Graduate Students and Postdoctorates—Annual number of graduate students and postdoctorate appointees in science and engineering (S&E) at an institution, as reported in the NSF-NIH Survey of Graduate Students and Postdoctorates in Science and Engineering (1988–2001).
  • Doctoral Recipients—Annual number of doctoral recipients in S&E at the institutional and departmental levels, as reported in the Survey of Earned Doctorates (SED) between 1988 and 2001. Departmental affiliation was obtained from the 3 digit SED/Doctorate Records File (DRF) specialty code.
  • Degree Data—Annual number of degrees granted at the undergraduate, master's and Ph.D. levels by institution and by department, as reported in the NCES Integrated Postsecondary Educational Data System (IPEDS) Completions Survey (1988–2001).
  • Institutional Type and Control—The 1994 Carnegie classification and institutional control (public versus private).
  • General Financial—General financial data at the institutional level regarding reliance on Federal, state, and tuition income and spending for education, research, and other functions, as reported in the NCES IPEDS Finance Survey (1988-1996). Significant changes occurred in this data after 1996, rendering the data before and after 1996 incomparable. Since the changes in publication trends that NSF was most interested in started at around 1996, these data were not used in the analysis.
  • Faculty Counts—Annual number of faculty at the institutional level, including breakdowns by faculty rank, as reported in the NCES IPEDS Salaries, Tenure and Fringe Benefits of Full-time Instructional Faculty Survey (1988-1997, 1999).
  • Number of S&E Ph.D.s employed—Although information on the annual number of S&E Ph.D.s holding U.S. doctorates employed at each institution was obtained from the NSF Survey of Doctorate Recipients, and was included in the database, due to limitations of the data (see appendix B), these data were not used in the analyses.
  • Non-Faculty Doctoral Research Staff—Annual number of non-faculty staff with doctorates at the departmental level by presence or absence of MD degree, as reported in the NSF-NIH Survey of Graduate Student and Postdoctorates in Science and Engineering (1988–2001).

4.3 Statistical Analysis Methods

Most of the analyses in this study were conducted using linear regression where the dependent (outcome) variable was a publication or citation count, and the independent (explanatory) variables were personnel counts (e.g., faculty by rank, S&E postdoctorates, and S&E Ph.D. recipients), R&D expenditures, or university characteristics (such as Carnegie classification, number of patents approved, etc.). R&D expenditures were deflated using the GDP implicit price deflator. All independent variables were lagged to reflect the average time between research and publication and between publication and subsequent citation (as determined by varying lags and maximizing the correlations between lagged variables). Most analyses were conducted at the institution-year level, or the field group-year level. Because the independent variables were highly correlated, stepwise linear regression analysis was employed. In addition to statistical significance, we typically also imposed the requirement that variables added into the model increase the proportion of explained variance (i.e., r-square) by at least .01.

Because the data on institutions were longitudinal, it was important to examine whether statistical significance levels obtained via linear regression were accurate. We repeated some important analyses using Hierarchical Linear Modeling (HLM), which accommodates the clustering that occurs in longitudinal data, and found that all coefficients that were added to the model with linear regression were also statistically significant with HLM.

Publication counts could be modeled using three primary variables: academic R&D expenditures, the number of S&E postdoctorates, and the number of S&E Ph.D. recipients. However, these three explanatory variables are not independent. We postulated that increases in academic R&D expenditures would affect publication counts both directly and indirectly through the hiring of additional S&E postdoctorates and support for S&E Ph.D. recipients. To estimate the full effect of academic R&D expenditures on publication counts, we used path analytic modeling. Linear regression was used to estimate the number of additional S&E postdoctorates and S&E Ph.D. recipients that would result from an increase in academic R&D expenditures. The total effect of increased academic R&D expenditures on publication counts could then be estimated as the sum of the direct effects and indirect effects (via additional S&E postdoctorates and Ph.D. recipients).

Factor analysis was used to examine the relationships among the various publication and citation counts. We found that these counts were very highly correlated, and a single factor (which was approximately an average of normalized versions of these counts) accounted for almost all of their variability. This factor (the first principal component) was used in many of our preliminary analyses. However, we have chosen not to use the first principal component as the dependent variable in our regressions. Rather, to make the analysis more easily interpretable, the dependent variable in each regression is a specific type of fractional or whole publication count (for example, fractional publication counts in the expanding journal set).

Top of page. Back to Top

U.S. Academic Scientific Publishing
Working Paper | SRS 11-201 | November 2010