Home
Invariant Checking & Observability for AI Training
Stop flying blind. Validate training dynamics, catch silent errors, and debug with confidence automatically.
✅ Continuous Invariant Checking
TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants (such as gradient norms, tensor shapes, and update magnitudes) effectively catching silent corruption before it wastes GPU hours.
🚀 Holistic Observability
Traditional tools only show you if your model crashed. TrainCheck shows you why it's degrading, analyzing internal state dynamics that loss curves miss.
🧠 Zero-Config Validation
No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
⚡ Universal Compatibility
Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.
How It Works
- Instrument: We wrap your training loop with lightweight probes. No code changes needed.
- Learn: We analyze correct runs to infer invariants (mathematical rules of healthy training).
- Check: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.

🔥 Try TrainCheck
Work through 5‑Minute Experience with TrainCheck. You’ll learn how to:
- Instrument a training script and collect a trace
- Automatically infer invariants
- Uncover silent bugs in the training script
Documentation
- Installation Guide
- Usage Guide: Scenarios and Limitations
- TrainCheck Technical Doc
- TrainCheck Dev RoadMap
Status
TrainCheck is under active development. Please join our 💬 Discord server or file a GitHub issue for support. We welcome feedback and contributions from early adopters.
Contributing
We welcome and value any contributions and collaborations. Please check out Contributing to TrainCheck for how to get involved.
License
TrainCheck is licensed under the Apache License 2.0.
Citation
If TrainCheck is relevant to your work, please cite our paper:
@inproceedings{TrainCheckOSDI2025,
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
series = {OSDI '25},
month = {July},
year = {2025},
address = {Boston, MA, USA},
publisher = {USENIX Association},
}
Artifact Evaluation
🕵️♀️ OSDI AE members, please see TrainCheck AE Guide.