Award Abstract # 1942794
CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability

NSF Org: CNS
Division Of Computer and Network Systems
Awardee: JOHNS HOPKINS UNIVERSITY, THE
Initial Amendment Date: June 17, 2020
Latest Amendment Date: June 17, 2020
Award Number: 1942794
Award Instrument: Continuing Grant
Program Manager: Marilyn McClure
mmcclure@nsf.gov
 (703)292-5197
CNS
 Division Of Computer and Network Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: July 1, 2020
End Date: June 30, 2025 (Estimated)
Total Intended Award Amount: $609,497.00
Total Awarded Amount to Date: $243,124.00
Funds Obligated to Date: FY 2020 = $243,124.00
History of Investigator:
  • Peng  Huang (Principal Investigator)
    huang@cs.jhu.edu
Awardee Sponsored Research Office: Johns Hopkins University
1101 E 33rd St
Baltimore
MD  US  21218-2686
(443)997-1898
Sponsor Congressional District: 07
Primary Place of Performance: Johns Hopkins University
3400 N Charles Street
Baltimore
MD  US  21218-2608
Primary Place of Performance
Congressional District:
07
DUNS ID: 001910777
Parent DUNS ID: 001910777
NSF Program(s): CSR-Computer Systems Research
Primary Program Source: 040100 NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1045
Program Element Code(s): 7354
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Cloud systems are the crucial infrastructure to many services existing today. Ensuring cloud software runs continuously without disruptions is both vital and challenging. Decades of research have developed mature techniques to detect and mask faults in distributed systems. But these techniques often use a simple model that assumes a system component either works or completely stops. Numerous real-world cloud incidents, however, suggest that production cloud systems frequently experience gray failures---a degraded operational mode in which a system component appears to be working but is in fact severely impaired. Gray failures cannot be effectively dealt with by current solutions. The overall objective of this proposal is to develop a holistic approach to detect, pinpoint and diagnose gray failures in production cloud systems.

To realize the objective, four synergistic research activities are proposed. Specifically, the project conducts a study on real-world gray failure cases in popular distributed systems, measure and characterize the observability of existing systems. The project then designs a novel hybrid analysis that automatically inserts report-generation hooks across the whole systems stack to harness observability for detecting gray failures. To pinpoint the culprit component, this project further proposes algorithms to infer causality from the collected observations. Lastly, this project designs a runtime checking framework for increasing observability and online diagnosis of gray failures.

Gray failures are a common cause of cloud service outages, resulting in significant financial loss. This project can effectively improve our understandings of gray failures and help detect and debug gray failures to reduce their impact on the ubiquitous cloud infrastructures. Software is moving to be more distributed with increasing subtle failure modes. Observability, fault detection, and localization are critical skills for this paradigm shift but are rarely covered in the existing curriculum. This project addresses this educational gap through curriculum development and student training. This project also promotes Computer Science education to underrepresented Baltimore high school students by organizing workshops in partnership with a non-profit organization, Code in the Schools, for local high school students to showcase cloud and system failure concepts.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Choi, Brian and Burns, Randal and Huang, Peng "Understanding and dealing with hard faults in persistent memory systems" Proceedings of the Sixteenth European Conference on Computer Systems , 2021 https://doi.org/10.1145/3447786.3456252 Citation Details
Levy, Sebastien and Yao, Randolph and Wu, Youjiang and Dang, Yingnong and Huang, Peng and Mu, Zheng and Zhao, Pu and Ramani, Tarun and Govindraju, Naga and Li, Xukun and Lin, Qingwei and Shafriri, Gil Lapid and Chintalapati, Murali "Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions" Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation , 2020 https://doi.org/ Citation Details
Lou, Chang and Huang, Peng and Smith, Scott "Understanding, Detecting and Localizing Partial Failures in Large System Software" Proceedings of the 17th USENIX Symposium on Networked Systems Design , 2020 https://doi.org/ Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page