Catching Silent Errors in Deep Learning Training
TrainCheck learns semantic invariants from sample pipelines and enforces proactive checks at runtime to catch silent training errors early.
OSDI 2025
Read Paper
Our research spans broadly across operating systems, distributed systems, cloud computing, mobile systems, and ML infrastructure, while specializing in reliability,
fault tolerance, and performance. Our work combines systems building with deep insights to address real-world challenges facing modern systems and enable ORDER.
Our research innovations cover:
ORDER := {Observable, Reliable, Defensible, Efficient, Responsive}
TrainCheck learns semantic invariants from sample pipelines and enforces proactive checks at runtime to catch silent training errors early.
OSDI 2025
Read Paper
TrainVerify verifies the parallelization logic of LLM training to eliminate subtle correctness bugs.
SOSP 2025
Read Paper
Phoenix introduces OS-level mechanisms of partial process state preservation and optimistic recovery to improve application availability.
SOSP 2025
Read Paper
Atropos is an overload control framework that uses targeted task cancellation to reduce application resource overload.
SOSP 2025
Read Paper
Xinda diagnoses and mitigates slow faults with adaptive mechanisms tailored to modern distributed system behavior.
NSDI 2025
Read PaperUpdates on lab research, milestones, and practices.
Blog entries are standard Jekyll posts. Create a markdown file under _posts/ with this naming pattern:
Read PostWe have added a dedicated blog to the lab website to share technical updates in a faster, more narrative format than conference papers.
Read PostWe appreciate our sponsors for their funding and support, which made our research possible.