Award Abstract # 1942794
CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability
| NSF Org: |
CNS
Division Of Computer and Network Systems
|
| Awardee: |
JOHNS HOPKINS UNIVERSITY, THE
|
| Initial Amendment Date: |
June 17, 2020 |
| Latest Amendment Date: |
June 17, 2020 |
| Award Number: |
1942794 |
| Award Instrument: |
Continuing Grant |
| Program Manager: |
Marilyn McClure
mmcclure@nsf.gov
(703)292-5197
CNS
Division Of Computer and Network Systems
CSE
Direct For Computer & Info Scie & Enginr
|
| Start Date: |
July 1, 2020 |
| End Date: |
June 30, 2025 (Estimated) |
| Total Intended Award Amount: |
$609,497.00 |
| Total Awarded Amount to Date: |
$243,124.00 |
| Funds Obligated to Date: |
FY 2020 = $243,124.00
|
| History of Investigator: |
-
Peng
Huang
(Principal Investigator)
huang@cs.jhu.edu
|
| Awardee Sponsored Research Office: |
Johns Hopkins University
1101 E 33rd St
Baltimore
MD
US
21218-2686
(443)997-1898
|
| Sponsor Congressional District: |
07
|
| Primary Place of Performance: |
Johns Hopkins University
3400 N Charles Street
Baltimore
MD
US
21218-2608
|
Primary Place of Performance Congressional District: |
07
|
| DUNS ID: |
001910777
|
| Parent DUNS ID: |
001910777
|
| NSF Program(s): |
CSR-Computer Systems Research
|
| Primary Program Source: |
040100 NSF RESEARCH & RELATED ACTIVIT
|
| Program Reference Code(s): |
1045
|
| Program Element Code(s): |
7354
|
| Award Agency Code: |
4900
|
| Fund Agency Code: |
4900
|
| Assistance Listing Number(s): |
47.070
|
ABSTRACT

Cloud systems are the crucial infrastructure to many services existing today. Ensuring cloud software runs continuously without disruptions is both vital and challenging. Decades of research have developed mature techniques to detect and mask faults in distributed systems. But these techniques often use a simple model that assumes a system component either works or completely stops. Numerous real-world cloud incidents, however, suggest that production cloud systems frequently experience gray failures---a degraded operational mode in which a system component appears to be working but is in fact severely impaired. Gray failures cannot be effectively dealt with by current solutions. The overall objective of this proposal is to develop a holistic approach to detect, pinpoint and diagnose gray failures in production cloud systems.
To realize the objective, four synergistic research activities are proposed. Specifically, the project conducts a study on real-world gray failure cases in popular distributed systems, measure and characterize the observability of existing systems. The project then designs a novel hybrid analysis that automatically inserts report-generation hooks across the whole systems stack to harness observability for detecting gray failures. To pinpoint the culprit component, this project further proposes algorithms to infer causality from the collected observations. Lastly, this project designs a runtime checking framework for increasing observability and online diagnosis of gray failures.
Gray failures are a common cause of cloud service outages, resulting in significant financial loss. This project can effectively improve our understandings of gray failures and help detect and debug gray failures to reduce their impact on the ubiquitous cloud infrastructures. Software is moving to be more distributed with increasing subtle failure modes. Observability, fault detection, and localization are critical skills for this paradigm shift but are rarely covered in the existing curriculum. This project addresses this educational gap through curriculum development and student training. This project also promotes Computer Science education to underrepresented Baltimore high school students by organizing workshops in partnership with a non-profit organization, Code in the Schools, for local high school students to showcase cloud and system failure concepts.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
Levy, Sebastien and Yao, Randolph and Wu, Youjiang and Dang, Yingnong and Huang, Peng and Mu, Zheng and Zhao, Pu and Ramani, Tarun and Govindraju, Naga and Li, Xukun and Lin, Qingwei and Shafriri, Gil Lapid and Chintalapati, Murali
"Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions"
Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation
, 2020
https://doi.org/
Citation Details
Lou, Chang and Huang, Peng and Smith, Scott
"Understanding, Detecting and Localizing Partial Failures in Large System Software"
Proceedings of the 17th USENIX Symposium on Networked Systems Design
, 2020
https://doi.org/
Citation Details
Please report errors in award information by writing to: awardsearch@nsf.gov.