Email Print Share

HDR: Harnessing the Data Revolution


The National Science Foundation's (NSF) Harnessing the Data Revolution (HDR) Big Idea is a national-scale activity to enable new modes of data-driven discovery that will allow new fundamental questions to be asked and answered at the frontiers of science and engineering. Through this NSF-wide activity, HDR will generate new knowledge and understanding, and accelerate discovery and innovation. The HDR vision is realized through an interrelated set of efforts in:

  • The foundations of data science;
  • Algorithms and systems for data science;
  • Data-intensive science and engineering;
  • Data cyberinfrastructure; and
  • Education and workforce development.

Each of these efforts is designed to amplify the intrinsically multidisciplinary nature of the emerging field of data science. The HDR Big Idea will establish theoretical, technical, and ethical frameworks that will be applied to tackle data-intensive problems in science and engineering, contributing to data-driven decision-making that impacts society.

Below are examples of science and engineering challenges that HDR Institutes may choose to address. This list is meant to be illustrative, not exhaustive.


Most ecological forecasting in the past has focused on centennial-scale climate responses. However, understanding how ecosystems and the resources they provide will change over the next decade and how human decisions affect such trajectories will require harnessing data from disparate sources, such as the National Ecological Observatory Network (NEON), Long Term Ecological Research (LTER) sites, NOAA satellites, and awards made through the EarthCube and Advances in Biological Informatics programs. Predicting the "Rules of Life" that drive these processes will require convergence and translational teams of data scientists, engineers, and domain scientists integrating heterogeneous data sets in new and innovative ways to translate these data resources into increased understanding and human decision making. The Near-Term Ecological Forecasting data challenge is to stimulate activities that use diverse data sets to model and better understand the natural processes that drive ecological change over decadal time frames.


The electric power grid, which plays a crucial role in supporting modern civilization, is no longer poised to meet the needs of the 21st century, and similar technological hurdles exist for other critical industries, such as transportation, and cyber-enabled manufacturing. Radical transformation of these industries and infrastructures is vital to maintaining U.S. economic competitiveness and quality of life. Next-generation engineering systems need to systematically integrate dynamic sensor data, real-time learning, and distributed decision making, and exploit both data-driven methodologies and domain-specific knowledge. This will require leveraging advances in novel sensors and networks, high-bandwidth wireless communications and widespread use of smart devices. This will also require integration of real-time information from users, resources, and service providers with domain knowledge. Research is needed to develop predictive frameworks by combining data-driven approaches with domain-specific models and systems that incorporate strategic human behavior, sub-system interactions, technological constraints, as well as privacy and security constraints. The envisioned cross-disciplinary and convergent research is expected to result in a paradigm shift wherein the network intelligence and computation will be dynamically distributed between the network edge and a centralized cloud system, in response to the dynamic operational requirements. The Real-Time Sensing, Learning and Decision-Making data challenge is to integrate expertise in sensing and communication systems, signal processing, machine learning, control and optimization techniques, and social, economic and network sciences with domain-specific knowledge, to develop radically new paradigms for the design, analysis and optimization of dynamic and complex engineering systems.


There is a paradigm shift underway in Earth system science. Until recently, geoscience measurements were taken in situ, for example, within a watershed, on a ship, or with seismic stations. However, considering the expanse of the Earth, these measurements are few and far between, and have limited the development and validation of forecasting models. Today, measurements are also captured via sensor networks, drones, aircraft, and satellites, and thus provide near-real-time Earth information around the entire globe. The information gathered has the potential to revolutionize the development of new physical models and forecasting across the geosciences. In addition, data assimilation spanning scientific domains, such as the atmospheric and ocean sciences, can significantly advance forecasting at various scales. However, realizing this potential requires overcoming myriad challenges, including sensors gathering data at disparate spatial and temporal scales, integrating that heterogeneous information, and enabling models to ingest those data. For instance, surface brightness temperature is not the same as surface temperature, and land surface elevation data gathered from airborne LIDAR or radar are not strictly measuring ground elevation. When data are integrated across domains, the volumes and types of data complicate how they are used to improve models and subsequently how these models can interpret observations at varying scales. Addressing the data science challenges to advance our Understanding and Predictability of the Earth System will require convergence between domain scientists, in the geosciences and closely affiliated fields, and data and computer scientists, including climate, weather, hydrologic, seismic and space weather hazards.


Advanced materials stimulate and enable the realization of technological innovation and are a conduit to the discovery of new fundamental science and new states of matter. Synergistic interaction of computation, experiment, and theory combined with new algorithms, advanced instrumentation, and the innovative use of shared data provides a means to accelerate the process from materials discovery to deployment. Shared data and their innovative use play a key role as they offer mechanisms to provide insight and break bottlenecks in the solution of challenging scientific problems along the way. Examples include but are not limited to: the prediction of first-synthesis pathways to realize new materials; advances in predictive modeling that leverage, for instance, machine learning, artificial intelligence, data mining, and sparse approximation; and real-time control of autonomous experiments. Challenges include accelerating the process to discover advanced materials, related states of matter, and fundamental science through the support of data-intensive research that integrates heterogeneous data. The goal is to develop new Data Analytic Methods and Algorithms for Advanced Materials Discovery using innovative data cyberinfrastructure to effectively find, share, access, and understand materials data.


Chemical space is vast and largely unexplored. Although the value of data-driven approaches and rational design in chemistry is recognized, there remain significant barriers to broad implementation of bench-top research. Integration of data science paradigms and tools with chemical synthesis, characterization and simulation would provide better mechanistic insights and spark new directions in chemistry research. Example areas of interest include, but are not limited to, the use of chemical data—successes and failures—to discover new chemical reactions; prediction and design of next-generation catalysts; elucidating fundamental principles for emergent molecular properties from atomic-scale interactions; and enabling real-time, complex decision making in chemical systems. The Predictive Design and Decision Making in Chemistry data challenge is to manage, query, and analyze complex chemical data sets to accelerate novel knowledge discovery, advance our capacity to tackle sophisticated chemical problems, and improve our understanding of molecular systems.


The first observation of a binary neutron star merger by LIGO/Virgo, Fermi and many ground- based observatories around the world announced the arrival of the long-anticipated field of Multi-Messenger Astrophysics (MMA). Observation and analysis of just this single event took weeks and consumed a significant fraction of the human and computational resources of the global gravitational and astronomical communities. The IceCube Neutrino Observatory at South Pole has recently identified a very high-energy (>PeV) neutrino and the Fermi and MAGIC observatories have identified high-energy gamma rays coincident in time and location with that neutrino. The Large Synoptic Survey Telescope (LSST) primary survey will start in late 2022, needing extra capacity to ensure the daily massive data sets are processed and coordinated with gravitational, particle, and other astronomical and astrophysical observations, and all in near- real-time and with very low latency. MMA needs now to address the challenges of this increase in the amount and complexity of data that will soon be upon us, as we start around-the-clock monitoring of frequent multi-messenger events. The Massive Data in MMA challenge is to develop a new community-scale data cyberinfrastructure for timely handling, processing, analysis, and modeling of multi-messenger astrophysical data.


Understanding human behavior and social processes is critical for important societal goals such as designing better work environments, promoting well-being and safety, and improving life-long learning. Large amounts of data at different levels of granularity provide opportunities for new research tools to support these efforts. The multiple types of relevant data include but are not limited to brain scans, perceptual and cognitive performance, surveys and social media, and market and administrative data. In addition to methods for structuring each of these separate data types, research is needed on ways to combine heterogeneous data with different degrees of predictive validity that also vary across space and time. At the same time, given the data about individuals that is now accessible, it is essential to identify mechanisms to protect confidentiality and privacy associated with information in these large data sets. Realizing the potential of big data for improved human productivity and innovation will require convergent efforts of teams of data scientists and social and behavioral scientists to develop the requisite models, software, and cyberinfrastructure. The challenge of Integrating Heterogeneous Data to Support Human productivity and Innovation motivates the creation an infrastructure that will stimulate the use of diverse and complex data sets to understand fundamental questions of societal importance. Further investments in integrating heterogeneous data will empower the social, behavioral, and economics sciences communities to address national priorities through robust and reliable convergence research.


Broadening participation and improving student success across Science, Technology, Engineering, and Mathematics (STEM) disciplines, including data science, can be achieved by harnessing heterogeneous data about student learning. Massive amounts of heterogenous data are being collected from physical classrooms and virtual learning environments. If harnessed to their best potential, these data could answer fundamental questions about what works in STEM education, for whom, and why. By combining existing heterogeneous data sets with new data streams (e.g., video, social media, GPS), these data would also enable development of predictive analytics to support personalized learning interventions, with the potential to promote the success of each student. In this context, the Understanding Student Learning and Success Across STEM Disciplines challenge can motivate the development and use learning analytics, data mining tools, models, and algorithms that cross data scales, from individual student behaviors to massive government data sets. Using these tools to understand differences among groups of STEM learners is a priority, because this knowledge could support development of personalized learning strategies to increase equitable access to STEM careers, thus broadening participation in STEM.


Visual science spans the study of biological vision systems and computer vision systems. Researchers in biology, neuroscience, cognitive science, and biomedical engineering use computational models to understand the basis of human and animal vision. Computer scientists and engineers build computer systems that turn pixels into high-level conceptual representations for a variety of practical applications. Optical scientists discover principles of light and materials that constrain both biological and computer vision. There has long been fruitful interplay among the life science, computer science, and optical science approaches to vision. For example, many of today's breakthroughs in deep learning algorithms for computer vision are based on insights from animal studies of early visual processing. Many researchers agree, however, there needs to be greater sharing of ideas between the fields. The Harnessing Visual Science for Scientific and Societal Impact challenge can bring scientists and engineers to together and harness modern data-intensive models for biological and computer vision, leading to breakthroughs in both scientific knowledge and applications with broad positive social impacts.