Summary of March status report
Steps taken by the National Center for Science and Engineering Statistics (formerly SRS) of NSF in fall 2007 to strengthen the confidentiality protections applied to the Survey of Earned Doctorates (SED) impacted the reporting of data about the race/ethnicity and gender of doctorate recipients by fine field of degree. SED data users reacted negatively to these steps and to the consequent diminished utility of two SED publications – the Interagency Summary Report and the Race/Ethnicity/Gender (REG) Tables report. In response, NCSES initiated efforts to learn more about the data needs and uses of the SED data user community, and to solicit feedback that would inform the redesign of these reports to increase their utility while still protecting the confidentiality of respondent data. These efforts included informal meetings with a variety of concerned groups and individuals, a web survey of SED data users (373 respondents) about their data needs and uses, and eight outreach meetings with segments of the SED data user community that especially value the SED race/ethnicity data (e.g., representatives of minority-serving doctoral degree-granting institutions and STEM professional organizations).
Data suppression and data perturbation disclosure protection methods were judged to be infeasible for achieving a desirable balance between data utility and disclosure risk when reporting REG data from the SED. Consequently, NCSES turned to data aggregation, and developed three options that aggregate data by the three data dimension reported in the REG Tables:
- Aggregate across race/ethnicity groups that have small counts to create an "underrepresented minorities" category.
- Aggregate across fields of degree that have small counts.
- Aggregate across years (2-year, 3-year, and 4-year options were developed).
At the time of the March status report, NCSES was engaged in the process of analyzing the information collected from the web survey and outreach meetings and determining which data aggregation approach would best meet the needs of data users while still protecting the confidentiality of data provided by SED respondents.
Update since March
After analyzing the feedback information from data users and assessing the risks and technical feasibility of the three data aggregation alternatives, "field aggregation" was selected as the basis for the new disclosure protection approach for reporting SED race/ethnicity/gender data by field of degree. At the request of NCSES, the Committee on National Statistics (CNSTAT) of the National Academies of Science convened an Experts Panel that held a workshop in May 2009 that addressed NCSES's proposed disclosure protection approach. NCSES developed a background paper on its proposed approach which provided input to the CNSTAT panel's review. The workshop afforded experts in the statistical confidentiality field an opportunity to analyze the new approach and assess the extent to which it effectively balances disclosure risk and data utility. The CNSTAT panel confirmed that NCSES needs to utilize some form of disclosure protection technique when reporting SED race/ethnicity/gender data by fine field of degree, and judged that the proposed field aggregation approach would offer a sufficient level of protection. The CNSTAT Expert Panel's report was published by the National Academies Press in October 2009 and is entitled "Protecting and Accessing Data from the Survey of Earned Doctorates: A Workshop Summary."
It is important to note that the "field aggregation" disclosure protection strategy is being applied to a small (but important) subset of SED data: statistical tables that report counts of doctorate recipients by fine field of degree within race/ethnicity and/or gender categories. Still, the application of the new strategy necessitated the redesign of the entire REG Tables report, several tables within the Interagency Summary Report, and a few tables each in the S&E Doctorate Awards report and the S&E Degree by Race/Ethnicity report.
Beginning with the forthcoming editions, the REG Tables report and the Interagency Summary Report are now being published by NSF and will conform to NSF publication standards and formats. Further, users will no longer be charged for the REG Tables. Both reports had previously been published by the SED survey contractor. The 2007 Summary Report was published December 2008 in abbreviated form, that is, without narrative text and reporting a subset of the statistical tables usually published. The combined 2007/2008 Summary Report includes the complete set of 2008 tables plus narrative discussion of the 2008 SED data, and also includes the full array of 2007 Summary Report tables as an appendix. The following schedule notes the projected publication dates for the next editions of SED reports.
||2007/2008 (combined) Summary Report; 2008 Data Release InfoBrief
||2008 REG Tables
||2008 S&E Doctorate Awards
||S&E Degree, by Race/Ethnicity of Recipients: 1999–2008
||S&E Degree: 1966–2008
Changes are also forthcoming in the way SED data will be made available through WebCASPAR, the data reporting system that is accessible via the NCSES website. The changes will strengthen the confidentiality protections applied to SED data on WebCASPAR. By mid-January 2010, the following variables from the 2007 SED and 2008 SED will be loaded into WebCASPAR: academic institution, academic discipline (both broad and detailed fields), institutional control (i.e., public versus private), highest degree (another categorization of institutions), and state. The system will no longer permit the generation of certain types of complex tables (e.g., counts of doctorate recipients by institution and classified by race/ethnicity/gender) that are currently possible. NCSES is in the initial stages of testing a more technologically advanced approach to providing SED data to its data users. Public online access to a limited set of SED variables will become available to all data users in spring 2010. This new system will give SED data users access to a broader array of SED data, and more sophisticated data exploration and analysis tools, while still protecting the confidentiality of the data provided by respondents to the SED.
Outline of field aggregation as disclosure protection approach
The over-arching concern with reporting small counts of doctorate recipients in race/ethnicity/gender categories by field of degree is that, in light of the close-knit character of some academic fields, reporting this data simplifies the task of uncovering the names and sensitive information of particular individuals via relating data cell values to other (e.g., internet-accessible) information sources. Field aggregation reduces this disclosure risk by combining the data in a small field of degree with one or more related fields, so that the degree count in the new "aggregated field" will be sufficiently large that it becomes very difficult to identify the individuals whose data are reported in the cell. The field aggregation approach adopted by NCSES involves the following three general steps.
1. Determine which fields must be aggregated
NCSES judged that 25 doctorate recipients is an appropriate threshold for determining whether the total number of doctorates awarded in a field of degree is large enough to safely report doctorate recipient counts by race/ethnicity/gender within that field. This judgment is based on the standard threshold for cell suppression previously used in SED reports to protect the confidentiality of respondent data—the data cell value is suppressed if the cell count was less than five—and the number of basic race/ethnicity categories (also five: American Indian/Alaska Native, Asian, Black, Hispanic, and White). Hence, if 25 or more individuals earn a doctorate degree in a particular field in a given year, then the counts of doctorate recipients in all race/ethnicity/gender categories within that field are reported regardless of the small number of doctorates represented in the cell. If fewer than 25 doctorates are awarded in a field, the counts of doctorate recipients are aggregated with those of one or more related fields until the total number of doctorates in the aggregated field rises to at least 25. After aggregation, it is safe to report any data cell value in a race/ethnicity/gender category of the aggregated field regardless of its small count, but the constituent fields that comprise the aggregated field are not reported separately.
2. Determine which fields are candidate aggregation partners
The choice of aggregation partners for below-threshold fine fields—the determination of which fields are "related" and suitable aggregation partners—is guided by the Classification of Instructional Programs (CIP) taxonomy. The CIP is a taxonomic scheme developed by the U.S. Department of Education's National Center for Education Statistics to support the tracking, assessment, and reporting of fields of study and program completions activity. The CIP taxonomy lends legitimacy to field aggregation decisions, as it is a widely accepted framework that describes relationships among fields of degree, and many SED data users have used it. The CIP taxonomy encompasses over 50 high level categories of instructional fields (the 2-digit CIP code level), several hundred categories of 4-digit CIP fields, and well over one thousand 6-digit CIP fields. In general, the 270+ fine fields of degree in the 2006 SED taxonomy correspond to fields at the 6-digit level of the CIP taxonomy, and the 30+ major and 7 broad fields of degree in the SED taxonomy correspond to CIP fields at the 2-digit level. Consequently, in many cases 4-digit CIP codes can serve as an intermediate level of field aggregation between the fine fields and major fields of the SED taxonomy, and can help determine the candidate aggregation partners for a below-threshold fine field of degree. That is, the search for aggregation partners for a below-threshold fine field of degree can begin with other fine fields that share the same 4-digit CIP code, as the common 4-digit code signifies that the fine fields are related or similar to some extent.
3. Selection of aggregation partners from among candidates
NCSES developed a set of "Aggregation Rules" that specify how below-threshold fine fields are to be assigned particular aggregation partners from among potential aggregation candidates.
An in-depth discussion of the origins of the REG data reporting issue, the process by which NCSES collected information from SED data users, the three aggregation alternatives presented to outreach meeting participants, and the details of the field aggregation approach to disclosure protection—including illustrations of the Aggregation Rules using 2006 REG Table data—appears in the NCSES working paper, "Disclosure Protection Strategies for Race/Ethnicity/Gender (REG) Tables for the Survey of Earned Doctorates (SED)" (forthcoming, 2010), which will be accessible via this webpage. This is the report NCSES submitted to the CNSTAT Expert Panel in April 2009. For a shorter discussion of the same topics, see "Obtaining User Needs While Protecting Individually Identifiable Data for the Survey of Earned Doctorates," a paper presented by NCSES staff at the annual conference of the American Statistical Association, August 2009, and published in the conference proceedings (please contact Mark Fiegener at firstname.lastname@example.org to request an electronic copy of this paper).