MathCS Seminar

Title: The SMURFS Project: Simulation and Modeling for Understanding Resilience and Faults at Scale
Seminar: Computer Science
Speaker: Dorian Arnold of University of New Mexico
Contact: James Lu,
Date: 2017-02-09 at 4:00PM
Venue: W201
Download Flyer
Abstract:
Current HPC research explorations target computer systems with exaflop (10^18 or a quintillion floating point operations per second) capabilities. Such computational power will enable new, important discoveries across all basic science domains. Application resilience is a major challenge to the realization of extreme scale computing systems. The SMURFS Project addresses this challenge by developing methods to improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment and a given fault-tolerance strategy at extreme scale. Specifically, SMURFS explores: (1) Advanced simulation and modeling capabilities for studying application resilience at scale; (2) Comprehensive, comparative studies of existing and new fault-tolerance strategies; (3) Detailed understandings of how application features interplay with different fault-tolerance strategies and hardware technologies; and (4) Effective prescriptions to guide application developers, hardware architects and system designers to realize efficient, resilient extreme scale capabilities. (This project is a collaboration amongst the University of New Mexico, the University of Tennessee and the Sandia National Labs. It is funded in part by the National Science Foundation.)

See All Seminars