Skip to content

Home

TrainCheck

Invariant Checking & Observability for AI Training

Stop flying blind. Validate training dynamics, catch silent errors, and debug with confidence automatically.

Get Started 5-Min Tutorial View on GitHub

✅ Continuous Invariant Checking

TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants (such as gradient norms, tensor shapes, and update magnitudes) effectively catching silent corruption before it wastes GPU hours.

🚀 Holistic Observability

Traditional tools only show you if your model crashed. TrainCheck shows you why it's degrading, analyzing internal state dynamics that loss curves miss.

🧠 Zero-Config Validation

No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.

⚡ Universal Compatibility

Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.


How It Works

  1. Instrument: We wrap your training loop with lightweight probes. No code changes needed.
  2. Learn: We analyze correct runs to infer invariants (mathematical rules of healthy training).
  3. Check: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.

Workflow

🔥 Try TrainCheck

Work through 5‑Minute Experience with TrainCheck. You’ll learn how to: - Instrument a training script and collect a trace
- Automatically infer invariants
- Uncover silent bugs in the training script

Documentation

Status

TrainCheck is under active development. Please join our 💬 Discord server or file a GitHub issue for support. We welcome feedback and contributions from early adopters.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to TrainCheck for how to get involved.

License

TrainCheck is licensed under the Apache License 2.0.

Citation

If TrainCheck is relevant to your work, please cite our paper:

@inproceedings{TrainCheckOSDI2025,
  author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
  title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
  booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
  series = {OSDI '25},
  month = {July},
  year = {2025},
  address = {Boston, MA, USA},
  publisher = {USENIX Association},
}

Artifact Evaluation

🕵️‍♀️ OSDI AE members, please see TrainCheck AE Guide.