Meta Research Silent Data Corruptions at Scale

Closing Date: 21/03/2022

Funding research within the research domain of silent data corruptions within large-scale infrastructure systems.

Meta Research (formerly Facebook) works on cutting edge research with a practical focus, and builds long-term relationships with top research institutions around the world. It also publishes papers, give talks, and collaborates broadly with the academic community.

Meta is soliciting proposals focusing on mitigation of silent data corruptions within internet applications due to hardware faults affecting the data centre computing stack (from hardware to compilers to applications). Proposals could range from hardware and architectural level mitigations and design strategies to test architecture evolution to software resiliency for silent data corruption.

Example topics include the following:

1. Computer architecture approaches to handle silent data corruptions

  • Architectural solutions to handle and mitigate silent data corruptions like enhanced compute block ECC mechanisms.
  • Self-test architectural blocks and modes like lockstep computing, checkpointing, and redundant computing evaluating compute cost and performance trade-offs.
  • Novel architectural solutions related to compute and memory error handling mechanisms including but not limited to enhancing traditional RAS architectures.

2. Distributed computing solutions to silent error propagation

  • Multi-machine computational resiliency models/solutions for silent error containment and propagation.
  • Error detection capability across multiple subsystems.
  • Distributed/fleet scale error containment and testing mechanisms.
  • Self-test (for silent data corruption) distributed system architecture and recovery solutions.

3. Service resiliency, software redundancy

  • Software-level solutions for silent error resiliency including redundancy and probabilistic algorithmic fault tolerance.
  • Enabling corruption-resilient, general-purpose compute and data movement libraries.
  • Real-time software-level detection and containment strategies due to silent corruptions, with evaluation towards compute cost and performance.
  • Algorithmic data corruption recovery solutions from historical data corruptions.

4. Silicon design

  • Silicon design and manufacturing strategies towards mitigation of silent data corruption.
  • Advanced simulation, emulation, and testing strategies within silicon fabrication.
  • Silicon testing coverage assessment and probabilistic evaluation of fault occurrence within silicon modules.
  • Test routine development for manufacturing and fleet use cases for silent error detection.
  • Degradation assessment and modeling for silicon modules.
Funding body Meta Research
Maximum value 50,000 USD
Reference ID S23442
Category Science and Technology

Fund or call Fund