Synopsis
Enormous digital datasets abound in all facets of our lives - in e-commerce, in World Wide Web information resources, and in many realms of science and engineering. Looking ahead, the pace of data production will only accelerate with increasing digitization of communication and entertainment and the continuing assimilation of computing into everyday life. Data will arise from many sources, will require complex processing, may be highly dynamic, be subject to high demand, and be of importance in a range of end-use tasks. The broad availability of data coupled with increased capabilities and decreased costs of both storage and computing technologies has led to a rethinking of how we solve problems that were previously impractical or, in some cases, even impossible to solve. Further, despite the continuing advances and decreasing costs of computing and storage technologies, data production and collection are outstripping our ability to process and store data. This compels us to rethink how we will manage - store, retrieve, explore, analyze, and communicate - this abundance of data.
These technical and social drivers have increased the urgent need to support computation on data of far larger scales than ever previously contemplated. Data-intensive computing is at the forefront of ultra-large-scale commercial data processing, and industry has taken the lead in creating data-centers comprised of myriad servers storing petabytes of data to support their business objectives and to provide services at Internet-scale. Such data centers are instances of data-intensive computing environments, the target of this solicitation. For data-intensive computing, massive data is the dominant issue with emphasis placed on the data-intensive nature of the computation.
Data intensive computing demands a fundamentally different set of principles than mainstream computing. Many data-intensive applications admit to large-scale parallelism over the data and are well-suited to specifications via high-level programming primitives in which the run-time system manages parallelism and data access. The increasingly capacious and economical storage technologies greatly change the role that storage plays in such large-scale computing. Many data-intensive applications also require extremely high degrees of fault-tolerance, reliability, and availability. Applications also often face real-time responsiveness requirements and must confront heterogeneous data types and noise and uncertainty in the data. Scale will impact a system's ability to retrieve new and updated data and to provide, whenever appropriate, guarantees of integrity and availability as part of the system's basic functionality in the face of varying levels of uncertainty.
The Data-intensive Computing program seeks to increase our understanding of the capabilities and limitations of data-intensive computing. How can we best program data-intensive computing platforms to exploit massive parallelism and to serve best the varied tasks that may be executed on them? How can we express high-level parallelism at this scale in a natural way for users? What new programming abstractions (including models, languages and algorithms) can accentuate these fundamental capabilities? How can data-intensive computing platforms be designed to support extremely high levels of reliability, efficiency, and availability? How can they be designed in ways that reflect desirable resource sensibilities, such as in power consumption, human maintainability, environmental footprint, and economic feasibility? What (new) applications can best exploit this computing paradigm, and how must this computing paradigm evolve to best support the data-intensive applications we may seek? These are examples of questions that at their core ask how we can support data-intensive computing when the volume of data surpasses the capabilities of the computing and storage technologies that underlie them.
The program will fund projects in all areas of computer and information science and engineering that increase our ability to build and use data-intensive computing systems and applications, help us understand their limitations, and create a knowledgeable workforce capable of operating and using these systems as they increasingly become a major force in our economy and society.
This program also supports research previously supported separately by the Cluster Exploratory (CluE) program, which made available for data-intensive computing projects a massively scaled highly distributed computing resource supported by Google and IBM and a similar resource at the University of Illinois in partnership with Hewlett-Packard, Intel, and Yahoo!. The Data-Intensive Computing program welcomes proposals that may request and use any such resources available to or accessible by the proposer(s), in order to pursue innovative research ideas in data-intensive computing and explore the potential benefits this technology may have for science and engineering research as well as to applications that may benefit society more broadly.
Proposals requesting or intending to use such resources are required to include in a separate section of their Project Description a description of the computing resources needed to test and evaluate their research ideas. This description should include what facility/facilities they plan to access and how, including as much detail as possible (e.g., schedule of use, time, space, data upload) to show the viability of their project.
Data-intensive Computing Point of Contact: Chitaranjan Das, Point of Contact, Data-intensive Computing Program, 1115S, telephone: (703) 292-8910, fax: (703) 292-9059, email: cdas@nsf.gov
Funding Opportunities for Data-intensive Computing:
 
         
        